Copying antifeatures: quote style

The C language gave different meanings to single quotes and double quotes:

'c' // a byte
"c" // array of bytes

Decades later, a difference persisted but, in most languages, changed its character. Single and double quotes now often change how string interpolation works:

"#{foo}" # interpolate the string value of variable foo
'#{foo}' # literal, no interpolation

As far as I know, this comes from Bourne shell. In the weird world of shell scripts, it actually turns out to be useful, but in the more structured world of most programming languages, it has approximately zero real-world use cases.

Suppose that you do literally want to write “#{” in a string:

'#{foo}' # intentional, but looks like a mistake

Python got this right. It used to be right in JavaScript.

Single quotes should be interchangeable with double quotes. Interchangeability allows you to use the easier-to-type single quotes most of the time, but switch to double quotes for easy single-quote escaping.

Copying antifeatures: multiline strings

Multiline strings are a Good Thing but they usually end up implemented just a little bit wrong.

Delimiters

First, they require special delimiters. Most grammars could easily allow ordinary strings to break across lines, and many languages could even add this in a backward-compatible way. A string that includes line breaks is no harder to parse than a single-line string:

foo = "
Hello
world
"

To the great annoyance of programmers everywhere, languages generally require special delimiters for multiline strings:

foo = """
Hello
world
"""

Thus, when we find ourselves typing a string literal and the line gets uncomfortably long, the language makes us go back to the beginning and change delimiters. Vice versa when shortening strings.

Indentation

Most multiline strings end up being indented. Ruby gets multiline string quote style right, but fails on indentation:

def do_stuff()
   foo = "
     hello
     world
     "
   return foo
end

puts do_stuff() # "\n    hello\n    world\n    "

What are those leading spaces doing? I can’t think of a time when I’ve ever wanted a string literal to maintain its source indentation at runtime.

Almost always, multiline strings fall into two categories:

Indentation is irrelevant Needs trim and dedent
sql = "
  select *
  from foo
  where bar
    and baz "
help = "
   usage: frob [-nicate]
     frob the widget
   
   -n turn on the n
"

So there are many use cases where trimming and dedenting is warranted, but almost none where it hurts.

CoffeeScript almost got both aspects right, but it makes a distinction between “multiline strings”, that use ordinary quotes and “block strings,” which use triple-quotes. Ordinary strings collapse line breaks to single spaces, triple-quoted strings trim and dedent correctly.

Next time: the perils of such subtle distinction in quote style.

Comefrom0x10 released

After a long hiatus while I built Text Collector, last week I finally returned to my paradigm shifting language, Comefrom0x10. It now has a home page on Read the Docs that features a tutorial, standard library documentation and more.

Except for a couple minor bugs, its implementation was actually functionally complete eight months ago. I hesitated to release it, however, because of rather embarrassing performance problems.

Now, it wouldn’t be fair to say that Cf0x10 is just slow. It’s catastrophically slow. The brainfuck.cf0x10 program takes 10 seconds to run helloworld.b on the laptop I’m using to write this, and gets dramatically worse as the program gets longer.

What went wrong?

It’s not a fundamental problem with the comefrom paradigm, but a consequence of the twisted way the language took on a life of its own during implementation. I started with the idea that I was building a rather ordinary stack-based interpreter, but Cf0x10 would have none of it. As it evolved, the original idea became a disfigured mutant: I can demonstrate with tests that it works, but it’s too convoluted to allow necessary optimizations.

Oh well, as they say, first make it right, then make it fast release it.

Comefrom0x10 patterns: Ayarbtu

Blocks in Comefrom0x10 look like functions in other languages and they create scopes like functions, but they are not functions – they do not take parameters or return values. The scope thing aside, block names are analogous to statement labels. Blocks let you use meaningful names instead of line numbers.

Since cf0x10 block names really are just labels that happen to denote scope, you are free to assign a variable of the same name. For example:

foo = 'bar'
foo
  comefrom if foo
  'I am now in the block called foo.'
  'The block should not be confused with the variable ' foo

This variable is often both the signal for jumping into a block and a value the block modifies, as if the block “foo” were returning by assigning to the variable “foo”.

I call this pattern, “All your assignment are belong to us,” or Ayarbtu for short. The standard library block “car” uses it, so you can get the first character of a string like so:

car = "Hello"
car # prints "H"

Here is how you can roll your own Ayarbtu block:

salute
  comefrom if salute
  salute = "Hello, " salute
  comefrom salute if salute

salute = "world"
salute # prints "Hello, world"

Ayarbtu is just a convention, but assignment as return value has precedent – in QBasic, for example – and cooperative assignment is easily done in, say, Java. You might know it as “aspect-oriented” programming.

Comefrom0x10: unary minus

In the original Comefrome0x10 grammar, I skipped unary minus. In retrospect, its presence has important consequences that I should have considered earlier, so unary minus made me reverse a couple decisions.

Line continuation

Comefrom0x10 uses newlines to delimit statements, which is fairly common in modern languages. What is not so common is that a standalone expression has a side effect: it writes to program output, as in PowerShell. This makes hello world succinct:

'hello world'

Prior to adding the unary minus operator, I allowed line continuation if an operator started a line, so these would be equivalent:

x = 2 + 3

x = 2
  + 3

But unary minus causes problems. These lines could be either “3 - 2” or “print 3; print -2“:

3
-2

I could continue to allow operators at the end of a line to indicate line continuation, but I’ve never liked end-of-line line markers for continuation as they are too hard to see. Comefrom0x10, therefore, now has no facility at all for line continuation. I will figure it out when I have a burning need.

Concatenation

Unary minus causes a second problem, when interacting with cf0x10’s concatenation behavior.

In cf0x10, space is the concatenation operator: “'foo' 'bar' is 'foobar' # true“. Originally, I allowed the concatenation space to be optional, so “name'@'domain” would be equivalent to “name '@' domain“.

Enter unary minus. In most languages, all three of these expressions mean “a minus b”, but in the original cf0x10 grammar, they could indicate either concatenation with negative b or subtraction:

a -b
a-b
a - b

Other languages do allow space concatenation, but normally only for literals. In Python, for example:

>>> 'a' 'b'
'ab'
>>> a, b = 'a', 'b'
>>> a b # ILLEGAL in Python, but LEGAL in cf0x10
SyntaxError: invalid syntax

In css, space is even an operator and “-” can appear at the beginning of an identifier. The css grammar, avoids ambiguity, however, by making “-” mathematical only when followed by a number, as in “-2px” versus “-moz-example“; “--foo” is illegal. Css gets away with this because it has no variables, making the operand’s type known during parsing.

Since the cf0x10 parser cannot know operand types, Comefrom0x10 has two options. It could decide whether to subtract or concatenate at runtime, or it could make the grammar a bit less forgiving of how you use spaces. Spacing is already significant, so, on balance, I think that making the spacing rules restrictive is the more clear:

a -b # concatenation
a-b # subtraction
a - b # subtraction
a- b # syntax error

The rules are that concatenation requires at least one space and unary minus cannot be followed by a space.

Comefrom0x10: day 5

A few days have passed building a brand new language. About a day and a half getting most of the grammar worked out and another day and a half building the guts of the interpreter. But now cf0x10 programs can basically run, albeit with a limited set of operators and without entirely correctly scope resolution.

Of course, I spent much of the time on the interpreter working out the ideal semantics for the comefrom statement, a much neglected field of computer science. The natural question is, “what happens when more than one statement comes from a single other statement?”

In Intercal, this is simply an error. Though that might seem like a cowardly evasion, it can simply be credited to Intercal’s being a lower-level language: the Intercal programmer needs to handle these situations explicitly by abstaining and so forth.

Comefrom0x10, on the other hand, is a high-level language designed to minimize runtime errors, so Comefrom0x10 has well-defined semantics for all variations of coming from a place.

First, eligible comefroms receive control in the order they appear in the source. Alone, however, this rule causes trouble breaking out of blocks. Consider this program that prints the Fibonacci sequence up to ten thousand:

fib
  b = 1
  comefrom fib
  a
  next_b = a + b
  a = b
  b = next_b

  comefrom fib if a > 10000

If the first comefrom took precedence, the loop would never end. Changing the first line to “comefrom fib if a < 10000” would fix the problem, but require reconfiguring how the variables a and b get swapped. I think the existing version is more clear: last-first wins allows a common idiom of placing your exit conditions at the end of a block.

The first exception to source order, therefore, is that when multiple comefroms are eligible within the same block, the last one wins.

There are other exceptions, but that’s another article.

Comefrom0x10: day 1

Today, I am finally creating a language.

Inspired by Come Here and Come From, Comefrom0x10, pronounced “Come from sixteen”, will be a modern language based entirely around the much-maligned “come from” control structure.

I am writing the first interpreter in Python. I was hoping to have a grammar worked out by the end of day one, but failed to make significant progress. Originally, I thought I would use a parsing expression grammar, but soon reverted to Backus-Naur. The trouble is that I want cf0x10 to include syntactic indentation (as it’s a modern, Pythonic language). This means that I need indent and dedent constructions, which parsing expression grammars can’t easily express.

So, I decide to switch to lalr. As a bonus, using lalr, will, of course, let people easily build million-line cf0x10 programs without worrying about slow compile times.

The Python parser generator landscape is surprisingly bleak. Many of the parser generators available are abandoned or half-assed. Ply seems to be the only real contender. There are two main problems with Ply: first, its mechanism of embedding grammar specification within docstrings makes the parser obnoxiously verbose; second, it doesn’t provide an obvious way to generate a parser that does not require installing Ply.

Ply has a very strong point though: it provides outstanding debug output for diagnosing problems with your grammar. I’ll just live with the verbosity; I might try PlyPlus if it gets too annoying. As far as removing the Ply dependency goes, I’ll wait until the Enterprise customers for cf0x10 want to fork out money to fund writing a build script so I can use the generated parse table without depending on Ply.

Ply’s scanner generator has the usual Lex push and pop state capabilities, plus it makes saving arbitrary lexer state easy. I, therefore, spend a good deal of day one trying to avoid hand-writing a lexer. As usual, this is a waste of time.

The insurmountable problem is that Ply helpfully stops you from providing regexes that match an empty string, presumably assuming they would cause infinite loops. In addition to syntactic indentation, Comefrom0x10 uses syntactically significant blank lines and without either a facility for pushback or empty-matching regex, it is very hard to write a sensible grammar. But that’s another article.

So finally, I just decide to hand-write the lexer and begin to make progress near the end of day one.

When All You Have Is a Class

I care little for static typing, but when a language gives you static types, use the type system.

Say we have a domain class like

public class User {
  public User(String firstName,
              String lastName,
              String email,
              long balanceInDollars) {
  ...
}

I’m in agreement with Stephan that using String and primitive types here is wimping out. That is weak typing. Weak! Chuck Norris does not approve!

The post goes on to describe a more typesafe and relatively elegant implementation. While obnoxiously verbose, its type system is not Java’s real problem. The actual deficit that makes Java less expressive than many others is that it only supports one modeling paradigm, the ubiquitous class.

Take an interview question as an example:

Given an input string, write a program that lists the top n words. So for the sentence, “Do or do not,” the top 2 tokens would be:

do = 2
not = 1

Sort first by frequency, then lexically by token.

A reasonable solution takes four conceptual steps:

  1. Tokenize
  2. Count and map tokens to frequency
  3. Sort
  4. Display the first n

In Java, sorting causes the most code. The most obvious way to maintain type safety demands an entirely new class; it contains two properties, token and frequency; it implements Comparable, equals and hashcode. The implementation then sticks instances of this class into a sorted data structure.

An alternative implementation could eschew the new class, instead jamming data into two-element Object arrays then sort using a Comparator. This would likely be shorter but lose type checking along the way.

The first feels unnatural because the data and the sorting behavior have no intuitive connection. Word frequency means nothing outside the sentence while sort order means nothing outside the problem. The two should live apart but the language foists that artificial marriage on the program. To make the error in this design more clear, consider if the problem specified that sort order should be interchangeable. For example, sort same-frequency tokens by length instead.

The second approach, though more conceptually pure, loses type safety. Using a value class containing token and frequency improves it but once again that class lacks conceptual justification. The token itself is unrelated to its frequency when apart from the sentence.

Like many problems, this one only involves scraps of data and a way to sort them but both Java programs require class implementations in a problem with no natural need for modeling anything as a class.

Not only does Java make you think in classes even when inappropriate, it simply refuses to acknowledge that any other problem model exists. You want collection literals to group data? Forget type safety. First class functions? Better define an interface. Mixins? Sorry, not even multiple inheritance. Declarative domain specific languages? Good luck. When all you have is a class, everything looks like an object.

Groovy, The Gateway Drug

Once upon a time, the only languages I trusted used static typing. This now seems silly, but is a fairly common attitude and for good reason. To a developer trained in C++ and Java formality, using Ruby or Python feels like anarchy. And there is that pejorative term “scripting language.”

At my first job after college I built Java Enterprise Web Services on the gruesome Servicemix platform. My task involved deploying the xml beasts and synchronizing our Cvs repository with the Devil’s own source control. Just maybe, this was a job for a scripting language. Just a taste couldn’t hurt…

So I found Groovy and got myself a copy of Groovy In Action.

At first, my Groovy scripts looked like Java, semicolons and all, since almost any Java program is also valid Groovy. An established codebase full of closures and duck types would have been alien, but to the programmer tiptoeing at the edge of dynamic language, Groovy felt very safe. After time, the same comforting safety eventually made Groovy feel like lugging Java baggage, but I highly recommend it for the programmer monogamous with Java. Just start renaming your Java files with Groovy extensions.

Next, drop a semicolon here and there and those ridiculous “throws Exception” clauses. Then, discover the “each” method; it seems like syntactic sugar, but explore. It involves some wacky thing called a closure. Write a function that takes a closure argument.

Every once in a while, try writing “def foo” instead of “String foo.” Surely dangerous, but promise yourself it will just be this one more time. The examples all do it; how bad could it be?

Ask yourself again, “What’s so great about closures?” Google says they sprout like weeds among the Ruby community. That’s still not a real language, but maybe Groovy is; it works with Java, after all.

Soon enough, you will realize Java feels stifling. “Dicing all this xml sure would be easier in Groovy,” you will think. You will notice and understand dynamic language zealots. The cool kids mostly run Ruby and Python. Java virtual machine startup delay is a drag for scripts, so try one of those.

No variable declarations at all? No damn curly brackets? You will be hooked.

Object Disorientation

Ask not what you can do to your objects, but what your objects can do for you.

The object-oriented primer tells us that objects are collections of data and behaviors. Sadly, modern Java de-emphasizes the behaviors, telling us we just need a bunch of beans wired to a few business functions and… tada… object-oriented magic.

This sort of programming tends to not be object-oriented at all. It generally leads to reams of what Martin Fowler calls “getter confetti,” plus a few thousand-line methods that do all the real work in about as procedural a way as you can make an object-oriented language.

I nominate this pattern for the name “object-disorientation,” or “object-disoriented.” Yes, someone already called it the blob, but the blob differs in its cause. The blob results from “sloth” and “haste.” Object disorientation results from careful, intentional misapplication of object-oriented principles. Someone says to encapsulate data, and someone carefully encapsulates fields, missing the more subtle encapsulation and allocation of behavior.

I suspect it happens because people think of programs as recipes for solving problems rather than reflections of the problems themselves. We tend to ask the steps from here to there rather than heeding Hannibal Lecter and asking of each particular thing, “what is it in itself… what [is] its causal nature… and what is it doing in the world?

Reluctantly, I admit that I have sometimes created getters, particularly in Groovy, which supports them syntactically, but immutable properties only need apply. In all cases, the class in question must mainly exist to do other useful things.

Instead of beans, if you want a collection of data, use a Collection. Pure mindless beans are not object-oriented; they existed in C too but we called them “structs.”