Why (not) pdf?

Text Collector lets you print text messages by converting them to pdf. What we call “text messages,” of course, includes messages with both text and pictures. Sometimes they include other types of attachments, like dirty .gif files, but that’s another article. For now, I’m just discussing images and text.

On the surface, pdf seems ideal: it’s universally viewable, supports pagination, and, unlike images, includes text in a searchable way. But I don’t really like pdf as an ediscovery interchange format.

Why? Pdf is too complicated. As file formats go, it’s far from the worst monster out there, but it’s also far from simple. As a consequence, the many different programs that generate pdf often get it slightly wrong; they’re not necessarily bad programs: they’re just dealing with a complicated problem and make mistakes.

Pdf viewers grapple with the resultant problems and show you something that looks correct, so everything seems fine at first.

When you want to edit pdf, however, things quickly go wrong. For a typical ediscovery operation like stamping Bates numbers on your pdf files, the small errors compound and you have a significant chance that the result will be illegibly damaged.

Assume some set of pdf files P , and an operation b that you want to do on them to produce an output set, O .

b:P \rightarrow O

You can visually check some number v for correctness.

When the size of O is larger than v there will be some subset E, larger than zero, that is terribly broken.

E \subset O \land |O| > v \Rightarrow |E| > 0

Ulfers’ Law of Batch Pdf Editing

Second, since it is complex, pdf allows all manner of invisible content. This makes redacting pdf hazardous and if you have a highly-developed sense of self-doubt, it’s hard to shake the feeling you’ve done something wrong that allows the redaction to be removed.

So, why does Text Collector convert messages to pdf and not something else?

There are no suitable alternatives. Html is universally viewable but has no notion of pagination and very limited image embedding. Microsoft Word format is too easy to edit, and comparable in complexity to pdf. Mhtml never got universal support and it lacks pagination anyway. Tiffs and text are too large and useless to average people. How about svg?

So pdf it is, for better or worse.

Keyboard Savior Xtreme released

Tired of websites trapping your shortcut keys for their own ends? Me too. Usually, the problem is that they take over the Firefox slash-to-search shortcut. But what can we do about it?

First, we could tolerate it and just use ctrl+f instead. On some slash-abusing sites, like Bitbucket, this isn’t a terrible option: they rarely include long pages that require quick jumps. In api documentation, however, it’s nightmarish.

Second, we could try to stamp out the evil by filing bugs and fighting for justice. There is some hope for this: Github, for example, used to abuse the slash key and no longer does. In general, however, it devolves into Whac-A-Mole. Django rightly rejected slash abuse in in 2008, only to have it sneak into the Django docs in 2015. [As if to prove my point, Github reinstated slash abuse not long after I wrote this.]

Finally, we could just fix it. This Greasemonkey script lets you list known abusers and prevent them from seeing slash keystrokes. After some time, however, I realized that my Greasemonkey approach did not go far enough: it only prevents abuse of one keystroke and only on selected sites.

In fact, there are only a handful of sensible reasons for any website to capture a keystroke, ever. Why not just stop them all?

So, I welcome Keyboard Savior Xtreme. Take back all your keystrokes.

Comefrom0x10 released

After a long hiatus while I built Text Collector, last week I finally returned to my paradigm shifting language, Comefrom0x10. It now has a home page on Read the Docs that features a tutorial, standard library documentation and more.

Except for a couple minor bugs, its implementation was actually functionally complete eight months ago. I hesitated to release it, however, because of rather embarrassing performance problems.

Now, it wouldn’t be fair to say that Cf0x10 is just slow. It’s catastrophically slow. The brainfuck.cf0x10 program takes 10 seconds to run helloworld.b on the laptop I’m using to write this, and gets dramatically worse as the program gets longer.

What went wrong?

It’s not a fundamental problem with the comefrom paradigm, but a consequence of the twisted way the language took on a life of its own during implementation. I started with the idea that I was building a rather ordinary stack-based interpreter, but Cf0x10 would have none of it. As it evolved, the original idea became a disfigured mutant: I can demonstrate with tests that it works, but it’s too convoluted to allow necessary optimizations.

Oh well, as they say, first make it right, then make it fast release it.