Why (not) pdf?

Text Collector lets you print text messages by converting them to pdf. What we call “text messages,” of course, includes messages with both text and pictures. Sometimes they include other types of attachments, like dirty .gif files, but that’s another article. For now, I’m just discussing images and text.

On the surface, pdf seems ideal: it’s universally viewable, supports pagination, and, unlike images, includes text in a searchable way. But I don’t really like pdf as an ediscovery interchange format.

Why? Pdf is too complicated. As file formats go, it’s far from the worst monster out there, but it’s also far from simple. As a consequence, the many different programs that generate pdf often get it slightly wrong; they’re not necessarily bad programs: they’re just dealing with a complicated problem and make mistakes.

Pdf viewers grapple with the resultant problems and show you something that looks correct, so everything seems fine at first.

When you want to edit pdf, however, things quickly go wrong. For a typical ediscovery operation like stamping Bates numbers on your pdf files, the small errors compound and you have a significant chance that the result will be illegibly damaged.

Assume some set of pdf files P , and an operation b that you want to do on them to produce an output set, O .

b:P \rightarrow O

You can visually check some number v for correctness.

When the size of O is larger than v there will be some subset E, larger than zero, that is terribly broken.

E \subset O \land |O| > v \Rightarrow |E| > 0

Ulfers’ Law of Batch Pdf Editing

Second, since it is complex, pdf allows all manner of invisible content. This makes redacting pdf hazardous and if you have a highly-developed sense of self-doubt, it’s hard to shake the feeling you’ve done something wrong that allows the redaction to be removed.

So, why does Text Collector convert messages to pdf and not something else?

There are no suitable alternatives. Html is universally viewable but has no notion of pagination and very limited image embedding. Microsoft Word format is too easy to edit, and comparable in complexity to pdf. Mhtml never got universal support and it lacks pagination anyway. Tiffs and text are too large and useless to average people. How about svg?

So pdf it is, for better or worse.

Where have my emoji gone?

Bob, what were your six biggest challenges in developing pdf?

Dave, that’s easy. Fonts, fonts, fonts, fonts, fonts and fonts. In that order.

The most noticeable limitation of Text Collector is that when it converts your text messages to pdf, it drops emoticons. Instead of smiley faces, it displays diamond-question mark things, “replacement characters.”

As ever, fonts are the problem.

When you need to display something laid out just so, such that it prints on a page and everything fits correctly, text alone isn’t enough.

Text, nowadays, usually means some encoding of Unicode. Technically, it’s a “Unicode transformation format,” which is where we get the “utf” in utf-8. Impress your friends and family with that.

Unicode represents abstract concepts, like the idea of the letter A or a smile. Emoji, just like letters A through Z are Unicode text, but precisely how letter A or winky face actually gets written to the page depends on your font.

A A

If you want to ensure that your text fits on the page, doesn’t run over or under the things around it, or, indeed displays at all, you must also include a font. In pdf terms, this is called “font embedding,” and it is what Android’s built-in pdf library does.

So far so good: just embed the emoji font and smiles or piles of poo come along for free, right? Wrong.

Most fonts specify letter outlines and leave color of the letters up to the user. Thus, I can say that I want letter A in blue or orange and both are the same font:

A A

Not so, Google’s emoji font. It includes cats, dragons and piles of poo in glorious, and heavyweight, color. Including this font makes for a massive pdf: almost 10 megabytes, on Android 6. It only takes two or three such documents to blow past Gmail’s attachment size limit. Typical collections would run into hundreds of megabytes, or gigabytes. They could easily consume all available space on the device.

This isn’t a new problem, and there’s a well-known solution: font subsetting. That is, only include glyphs for the small handful of characters that you actually use. Android’s built-in pdf library can’t do that.

So, to get emoji, I have to bring in an entirely new pdf library… and I’ve been working on this long enough. Version 1 needs to ship. Emoji will have to wait till version 2.