Text Collector version one released

Today, Legal Text Collector graduated from beta to production, version 1.0. It’s been a long time coming: I started working on it seven months ago. I made my first notes on the idea about five years ago.

Exactly thirty years ago today, Reagan said, “open this gate. Mr. Gorbachev, tear down this wall,” and in a small way, I think I can sympathize with Gorbachev. The main change from beta to production is that now people can start posting reviews: I’m opening the gates of public criticism.

Google allows public reviews only after a production release. In the safety of beta, Google provides a private feedback option, which nobody used. Many people did, however, send me questions and bug reports via email.

It’s hard to overstate the the value of that feedback. My friend Trish, my aunt Meg and my erstwhile colleague Jon of Sandline were especially helpful in testing the early alpha versions; strangers who stumbled upon later public betas kindly sent me information to resolve some of the final bugs. Through 14 alphas and 16 betas, Text Collector grew from 3.5 thousand lines to 5.5 thousand, a testament to how long the long tail can be.

So it is a dramatic moment. Much remains to do, but at some point I must tear down the wall to public criticism. Private feedback has been overwhelmingly positive, so this doesn’t worry me too much, but only time will tell.

It’s a good day to print text messages

Even though my elevator pitch for Legal Text Collector is, “It lets you print your text messages,” in all the time that I’ve been writing it, I never actually printed any messages.

Text Collector doesn’t (yet) print text messages directly. Instead, it converts them to pdf, and you’re free to copy and print pdf as you please. In development, I just examined output pdf. Many, many times. Printing is a formality.

Still, this gives me an itchy feeling. I find it unsatisfying to leave the final step untried.

So I celebrated the beta 15 release by actually printing messages. On paper. In color and grayscale.

Why celebrate beta 15? I expect it to be the last beta release before 1.0. Every 1.0 feature is in and every 1.0 bug is fixed. In other words, beta 15 is Version One. All that remains is to take a deep breath and bless it with a “production” designation.


Where have my emoji gone?

Bob, what were your six biggest challenges in developing pdf?

Dave, that’s easy. Fonts, fonts, fonts, fonts, fonts and fonts. In that order.

The most noticeable limitation of Text Collector is that when it converts your text messages to pdf, it drops emoticons. Instead of smiley faces, it displays diamond-question mark things, “replacement characters.”

As ever, fonts are the problem.

When you need to display something laid out just so, such that it prints on a page and everything fits correctly, text alone isn’t enough.

Text, nowadays, usually means some encoding of Unicode. Technically, it’s a “Unicode transformation format,” which is where we get the “utf” in utf-8. Impress your friends and family with that.

Unicode represents abstract concepts, like the idea of the letter A or a smile. Emoji, just like letters A through Z are Unicode text, but precisely how letter A or winky face actually gets written to the page depends on your font.


If you want to ensure that your text fits on the page, doesn’t run over or under the things around it, or, indeed displays at all, you must also include a font. In pdf terms, this is called “font embedding,” and it is what Android’s built-in pdf library does.

So far so good: just embed the emoji font and smiles or piles of poo come along for free, right? Wrong.

Most fonts specify letter outlines and leave color of the letters up to the user. Thus, I can say that I want letter A in blue or orange and both are the same font:


Not so, Google’s emoji font. It includes cats, dragons and piles of poo in glorious, and heavyweight, color. Including this font makes for a massive pdf: almost 10 megabytes, on Android 6. It only takes two or three such documents to blow past Gmail’s attachment size limit. Typical collections would run into hundreds of megabytes, or gigabytes. They could easily consume all available space on the device.

This isn’t a new problem, and there’s a well-known solution: font subsetting. That is, only include glyphs for the small handful of characters that you actually use. Android’s built-in pdf library can’t do that.

So, to get emoji, I have to bring in an entirely new pdf library… and I’ve been working on this long enough. Version 1 needs to ship. Emoji will have to wait till version 2.

Why Android threading isn’t good enough

Text Collector exports your text messages organized by conversation. Conversations – also known as message threads or strings – are crucial to understanding the context of text messages. If Text Collector did something more naive: put every message in one big document according to date, perhaps, some messages would find themselves incomprehensibly woven together with other unrelated messages.

Apps on Android access messages through “content providers,” and since Android does include a content provider that lists messages according to conversation, the easy way to get message threads is to ask that provider. It’s not what Text Collector does.

The trouble with Android’s threading implementation is that it allows a message to be part of one, and only one, thread.

Android’s design is sensible enough when there are only two people on a conversation. In that case, messages intuitively belong to just one thread. Suppose Alice is having a conversation with Bob, and separate conversation with Charlie: she simply has two threads, one for Charlie, the other for Bob.

A more nuanced way of looking at message threads might also divide messages by topic. In email, for example, threads can be based on the subject line. For text messages however, a simple categorization by recipients is more intuitive.

Eventually, Alice finds herself discussing the same topic with both her friends, so she copies Bob and Charlie on a message that says, “Hey Bob, Charlie says you’re wrong.” This group message has to go into a thread, but Android doesn’t know whether to put it in the Bob thread or the Charlie thread. And lo, Android creates a third thread.

Intuitively, the group message actually continues both conversations, but since Android can only give it one thread identifier, it can’t represent that. People can join conversations, leave conversations and have related conversations on the side, but Android’s threading model represents none of this.

While using the phone, these broken threads aren’t too confusing, because you read the messages with context still fresh in your mind. They can be very confusing, however, when reading messages long after they were sent. If Text Collector used Android’s threading model, you would often need to cross-reference several documents to understand where messages really belong.

Instead, Text Collector puts the message in both the Bob document and the Charlie document. There’s a bit more to it, but if you want gory details, check the website.

No matter which document you read, Text Collector’s threading strategy makes the whole conversation understandable, and it has another happy effect for legal review: all Bob messages appear in a single document, and same for Charlie. Thus, in the common case where you want to collect Alice’s messages and review everything she discussed with Bob, you don’t have to worry that they might be scattered across a mess of different files.

Mind you, this is not magic. If Bob does something tricky, like change phone numbers, Text Collector won’t know about it.

Build one to throw away

Most software projects inadvertently plan to build a throwaway product for delivery to customers, instead of heeding Fred Brooks:

In most projects, the first system built is barely usable…

Plan one to throw away; you will, anyhow.

Luckily, when building Text Collector, only my own management expectations burdened me, so I had an opportunity to build a real pilot and chuck it.

I wrote the pilot in Java and it included all the major features I knew I would need. Thanks to Steve Yegg popping up for the first time in a few years, I heard about Kotlin, so I switched to Kotlin for the real program.

Now that it’s in alpha, I can reasonably compare the size of the two programs:

Java Kotlin
Lines  1.9k  3.5k
Words 6k  13k
Bytes 71k  146k

By many standards, this is small, and smaller yet when you consider that I’ve counted using simply the “wc” command, so the numbers include whitespace and comments.

The pilot included all major functionality, but ignored most edge cases. It featured mms and sms collection with:

  • Date filtering
  • Message preview (small collections only)
  • Pdf creation
  • Pdf sharing
  • Inline image rendering (no scaling)
  • Zooming (without panning or clamping)

For the real program, I kept all the pilot features except contact filtering, added handling for many edge cases, and added features:

  • Organization by conversation
  • Zip creation
  • Reporting usage statistics and errors
  • Failure diagnostics and debug features
  • Option to cancel collections in progress
  • Pre-calculated layout (allows preview for large collection)

So, the Kotlin app has roughly twice the features, with roughly twice the amount of code. At first glance, it doesn’t seem like that significant a difference, but breaking it down this way lets me count up the lines associated with new features only: 1.3 thousand. Thus, the part roughly equivalent to the pilot weighs in at 2.2k lines of Kotlin to Java’s 1.9.

In other words, Kotlin, handling edge cases, is only 15% larger than the equivalent Java that ignores edge cases with wild abandon.

My commit log shows that the Java version took about one month, while the Kotlin version took three. That seems pretty dismal: two thousand lines per month in Java and closer to one thousand in Kotlin, but…

The Java pilot had no tests, the real, Kotlin version adds 6.6 thousand lines worth of test code.

Qwerty War released

It’s actually been up for a while now at http://www.qwertywar.com/.

As is typical of software, final cleanup took me longer than expected, a week perhaps. One problem still irks me: it’s too slow. While it runs reasonably well on my desktop machines, not so much on my my high-resolution devices. And this after already spending a day or two profiling and optimizing.

Poor performance on the tablets and phones doesn’t bother me too much, since it’s a typing game, and who would play with a touchscreen anyway? My high-dpi Lenovo laptop, however, also does badly. I suspect the problem is partly thanks to poor Nvidia support on Linux.

Still, the performance problems surprised me because it just doesn’t seem like there really is all that much happening on the screen. The game draws a few shapes and renders a handful of different 32×32 sprites in maybe a few dozen different positions.

Sometimes, though, you just need to call something good enough and go play Qwerty War.

Comefrom0x10: unary minus

In the original Comefrome0x10 grammar, I skipped unary minus. In retrospect, its presence has important consequences that I should have considered earlier, so unary minus made me reverse a couple decisions.

Line continuation

Comefrom0x10 uses newlines to delimit statements, which is fairly common in modern languages. What is not so common is that a standalone expression has a side effect: it writes to program output, as in PowerShell. This makes hello world succinct:

'hello world'

Prior to adding the unary minus operator, I allowed line continuation if an operator started a line, so these would be equivalent:

x = 2 + 3

x = 2
  + 3

But unary minus causes problems. These lines could be either “3 - 2” or “print 3; print -2“:


I could continue to allow operators at the end of a line to indicate line continuation, but I’ve never liked end-of-line line markers for continuation as they are too hard to see. Comefrom0x10, therefore, now has no facility at all for line continuation. I will figure it out when I have a burning need.


Unary minus causes a second problem, when interacting with cf0x10’s concatenation behavior.

In cf0x10, space is the concatenation operator: “'foo' 'bar' is 'foobar' # true“. Originally, I allowed the concatenation space to be optional, so “name'@'domain” would be equivalent to “name '@' domain“.

Enter unary minus. In most languages, all three of these expressions mean “a minus b”, but in the original cf0x10 grammar, they could indicate either concatenation with negative b or subtraction:

a -b
a - b

Other languages do allow space concatenation, but normally only for literals. In Python, for example:

>>> 'a' 'b'
>>> a, b = 'a', 'b'
>>> a b # ILLEGAL in Python, but LEGAL in cf0x10
SyntaxError: invalid syntax

In css, space is even an operator and “-” can appear at the beginning of an identifier. The css grammar, avoids ambiguity, however, by making “-” mathematical only when followed by a number, as in “-2px” versus “-moz-example“; “--foo” is illegal. Css gets away with this because it has no variables, making the operand’s type known during parsing.

Since the cf0x10 parser cannot know operand types, Comefrom0x10 has two options. It could decide whether to subtract or concatenate at runtime, or it could make the grammar a bit less forgiving of how you use spaces. Spacing is already significant, so, on balance, I think that making the spacing rules restrictive is the more clear:

a -b # concatenation
a-b # subtraction
a - b # subtraction
a- b # syntax error

The rules are that concatenation requires at least on space and unary minus cannot be followed by a space.

Comefrom0x10: day 5

A few days have passed building a brand new language. About a day and a half getting most of the grammar worked out and another day and a half building the guts of the interpreter. But now cf0x10 programs can basically run, albeit with a limited set of operators and without entirely correctly scope resolution.

Of course, I spent much of the time on the interpreter working out the ideal semantics for the comefrom statement, a much neglected field of computer science. The natural question is, “what happens when more than one statement comes from a single other statement?”

In Intercal, this is simply an error. Though that might seem like a cowardly evasion, it can simply be credited to Intercal’s being a lower-level language: the Intercal programmer needs to handle these situations explicitly by abstaining and so forth.

Comefrom0x10, on the other hand, is a high-level language designed to minimize runtime errors, so Comefrom0x10 has well-defined semantics for all variations of coming from a place.

First, eligible comefroms receive control in the order they appear in the source. Alone, however, this rule causes trouble breaking out of blocks. Consider this program that prints the Fibonacci sequence up to ten thousand:

  b = 1
  comefrom fib
  next_b = a + b
  a = b
  b = next_b

  comefrom fib if a > 10000

If the first comefrom took precedence, the loop would never end. Changing the first line to “comefrom fib if a < 10000” would fix the problem, but require reconfiguring how the variables a and b get swapped. I think the existing version is more clear: last-first wins allows a common idiom of placing your exit conditions at the end of a block.

The first exception to source order, therefore, is that when multiple comefroms are eligible within the same block, the last one wins.

There are other exceptions, but that’s another article.

Comefrom0x10: day 1

Today, I am finally creating a language.

Inspired by Come Here and Come From, Comefrom0x10, pronounced “Come from sixteen”, will be a modern language based entirely around the much-maligned “come from” control structure.

I am writing the first interpreter in Python. I was hoping to have a grammar worked out by the end of day one, but failed to make significant progress. Originally, I thought I would use a parsing expression grammar, but soon reverted to Backus-Naur. The trouble is that I want cf0x10 to include syntactic indentation (as it’s a modern, Pythonic language). This means that I need indent and dedent constructions, which parsing expression grammars can’t easily express.

So, I decide to switch to lalr. As a bonus, using lalr, will, of course, let people easily build million-line cf0x10 programs without worrying about slow compile times.

The Python parser generator landscape is surprisingly bleak. Many of the parser generators available are abandoned or half-assed. Ply seems to be the only real contender. There are two main problems with Ply: first, its mechanism of embedding grammar specification within docstrings makes the parser obnoxiously verbose; second, it doesn’t provide an obvious way to generate a parser that does not require installing Ply.

Ply has a very strong point though: it provides outstanding debug output for diagnosing problems with your grammar. I’ll just live with the verbosity; I might try PlyPlus if it gets too annoying. As far as removing the Ply dependency goes, I’ll wait until the Enterprise customers for cf0x10 want to fork out money to fund writing a build script so I can use the generated parse table without depending on Ply.

Ply’s scanner generator has the usual Lex push and pop state capabilities, plus it makes saving arbitrary lexer state easy. I, therefore, spend a good deal of day one trying to avoid hand-writing a lexer. As usual, this is a waste of time.

The insurmountable problem is that Ply helpfully stops you from providing regexes that match an empty string, presumably assuming they would cause infinite loops. In addition to syntactic indentation, Comefrom0x10 uses syntactically significant blank lines and without either a facility for pushback or empty-matching regex, it is very hard to write a sensible grammar. But that’s another article.

So finally, I just decide to hand-write the lexer and begin to make progress near the end of day one.