Text Collector version one released

Today, Legal Text Collector graduated from beta to production, version 1.0. It’s been a long time coming: I started working on it seven months ago. I made my first notes on the idea about five years ago.

Exactly thirty years ago today, Reagan said, “open this gate. Mr. Gorbachev, tear down this wall,” and in a small way, I think I can sympathize with Gorbachev. The main change from beta to production is that now people can start posting reviews: I’m opening the gates of public criticism.

Google allows public reviews only after a production release. In the safety of beta, Google provides a private feedback option, which nobody used. Many people did, however, send me questions and bug reports via email.

It’s hard to overstate the the value of that feedback. My friend Trish, my aunt Meg and my erstwhile colleague Jon of Sandline were especially helpful in testing the early alpha versions; strangers who stumbled upon later public betas kindly sent me information to resolve some of the final bugs. Through 14 alphas and 16 betas, Text Collector grew from 3.5 thousand lines to 5.5 thousand, a testament to how long the long tail can be.

So it is a dramatic moment. Much remains to do, but at some point I must tear down the wall to public criticism. Private feedback has been overwhelmingly positive, so this doesn’t worry me too much, but only time will tell.

It’s a good day to print text messages

Even though my elevator pitch for Legal Text Collector is, “It lets you print your text messages,” in all the time that I’ve been writing it, I never actually printed any messages.

Text Collector doesn’t (yet) print text messages directly. Instead, it converts them to pdf, and you’re free to copy and print pdf as you please. In development, I just examined output pdf. Many, many times. Printing is a formality.

Still, this gives me an itchy feeling. I find it unsatisfying to leave the final step untried.

So I celebrated the beta 15 release by actually printing messages. On paper. In color and grayscale.

Why celebrate beta 15? I expect it to be the last beta release before 1.0. Every 1.0 feature is in and every 1.0 bug is fixed. In other words, beta 15 is Version One. All that remains is to take a deep breath and bless it with a “production” designation.


Where have my emoji gone?

Bob, what were your six biggest challenges in developing pdf?

Dave, that’s easy. Fonts, fonts, fonts, fonts, fonts and fonts. In that order.

The most noticeable limitation of Text Collector is that when it converts your text messages to pdf, it drops emoticons. Instead of smiley faces, it displays diamond-question mark things, “replacement characters.”

As ever, fonts are the problem.

When you need to display something laid out just so, such that it prints on a page and everything fits correctly, text alone isn’t enough.

Text, nowadays, usually means some encoding of Unicode. Technically, it’s a “Unicode transformation format,” which is where we get the “utf” in utf-8. Impress your friends and family with that.

Unicode represents abstract concepts, like the idea of the letter A or a smile. Emoji, just like letters A through Z are Unicode text, but precisely how letter A or winky face actually gets written to the page depends on your font.


If you want to ensure that your text fits on the page, doesn’t run over or under the things around it, or, indeed displays at all, you must also include a font. In pdf terms, this is called “font embedding,” and it is what Android’s built-in pdf library does.

So far so good: just embed the emoji font and smiles or piles of poo come along for free, right? Wrong.

Most fonts specify letter outlines and leave color of the letters up to the user. Thus, I can say that I want letter A in blue or orange and both are the same font:


Not so, Google’s emoji font. It includes cats, dragons and piles of poo in glorious, and heavyweight, color. Including this font makes for a massive pdf: almost 10 megabytes, on Android 6. It only takes two or three such documents to blow past Gmail’s attachment size limit. Typical collections would run into hundreds of megabytes, or gigabytes. They could easily consume all available space on the device.

This isn’t a new problem, and there’s a well-known solution: font subsetting. That is, only include glyphs for the small handful of characters that you actually use. Android’s built-in pdf library can’t do that.

So, to get emoji, I have to bring in an entirely new pdf library… and I’ve been working on this long enough. Version 1 needs to ship. Emoji will have to wait till version 2.

Why Android threading isn’t good enough

Text Collector exports your text messages organized by conversation. Conversations – also known as message threads or strings – are crucial to understanding the context of text messages. If Text Collector did something more naive: put every message in one big document according to date, perhaps, some messages would find themselves incomprehensibly woven together with other unrelated messages.

Apps on Android access messages through “content providers,” and since Android does include a content provider that lists messages according to conversation, the easy way to get message threads is to ask that provider. It’s not what Text Collector does.

The trouble with Android’s threading implementation is that it allows a message to be part of one, and only one, thread.

Android’s design is sensible enough when there are only two people on a conversation. In that case, messages intuitively belong to just one thread. Suppose Alice is having a conversation with Bob, and separate conversation with Charlie: she simply has two threads, one for Charlie, the other for Bob.

A more nuanced way of looking at message threads might also divide messages by topic. In email, for example, threads can be based on the subject line. For text messages however, a simple categorization by recipients is more intuitive.

Eventually, Alice finds herself discussing the same topic with both her friends, so she copies Bob and Charlie on a message that says, “Hey Bob, Charlie says you’re wrong.” This group message has to go into a thread, but Android doesn’t know whether to put it in the Bob thread or the Charlie thread. And lo, Android creates a third thread.

Intuitively, the group message actually continues both conversations, but since Android can only give it one thread identifier, it can’t represent that. People can join conversations, leave conversations and have related conversations on the side, but Android’s threading model represents none of this.

While using the phone, these broken threads aren’t too confusing, because you read the messages with context still fresh in your mind. They can be very confusing, however, when reading messages long after they were sent. If Text Collector used Android’s threading model, you would often need to cross-reference several documents to understand where messages really belong.

Instead, Text Collector puts the message in both the Bob document and the Charlie document. There’s a bit more to it, but if you want gory details, check the website.

No matter which document you read, Text Collector’s threading strategy makes the whole conversation understandable, and it has another happy effect for legal review: all Bob messages appear in a single document, and same for Charlie. Thus, in the common case where you want to collect Alice’s messages and review everything she discussed with Bob, you don’t have to worry that they might be scattered across a mess of different files.

Mind you, this is not magic. If Bob does something tricky, like change phone numbers, Text Collector won’t know about it.

Build one to throw away

Most software projects inadvertently plan to build a throwaway product for delivery to customers, instead of heeding Fred Brooks:

In most projects, the first system built is barely usable…

Plan one to throw away; you will, anyhow.

Luckily, when building Text Collector, only my own management expectations burdened me, so I had an opportunity to build a real pilot and chuck it.

I wrote the pilot in Java and it included all the major features I knew I would need. Thanks to Steve Yegg popping up for the first time in a few years, I heard about Kotlin, so I switched to Kotlin for the real program.

Now that it’s in alpha, I can reasonably compare the size of the two programs:

Java Kotlin
Lines  1.9k  3.5k
Words 6k  13k
Bytes 71k  146k

By many standards, this is small, and smaller yet when you consider that I’ve counted using simply the “wc” command, so the numbers include whitespace and comments.

The pilot included all major functionality, but ignored most edge cases. It featured mms and sms collection with:

  • Date filtering
  • Message preview (small collections only)
  • Pdf creation
  • Pdf sharing
  • Inline image rendering (no scaling)
  • Zooming (without panning or clamping)

For the real program, I kept all the pilot features except contact filtering, added handling for many edge cases, and added features:

  • Organization by conversation
  • Zip creation
  • Reporting usage statistics and errors
  • Failure diagnostics and debug features
  • Option to cancel collections in progress
  • Pre-calculated layout (allows preview for large collection)

So, the Kotlin app has roughly twice the features, with roughly twice the amount of code. At first glance, it doesn’t seem like that significant a difference, but breaking it down this way lets me count up the lines associated with new features only: 1.3 thousand. Thus, the part roughly equivalent to the pilot weighs in at 2.2k lines of Kotlin to Java’s 1.9.

In other words, Kotlin, handling edge cases, is only 15% larger than the equivalent Java that ignores edge cases with wild abandon.

My commit log shows that the Java version took about one month, while the Kotlin version took three. That seems pretty dismal: two thousand lines per month in Java and closer to one thousand in Kotlin, but…

The Java pilot had no tests, the real, Kotlin version adds 6.6 thousand lines worth of test code.