Some tasks are fun — they are self-contained, easy to conceptualize (though not always easy to build), and challenging. A few weeks ago, I spent my weekend writing a parser, and it was a lot of fun.
I’ve been writing parsers for a very long time. My book about them came out in 2010 and I've been playing with Boo since 2005. The ANTLR book was an interesting read and it taught me a lot about how to approach text parsing.
However, you may have noticed that I've shifted my thinking about a lot of design problems — in particular, performance, the number of allocations, the exceptions thrown during parsing, the readability of errors when getting invalid input, etc.
In particular, the Lucene query parser is a really good example of a really crappy one. It fails on pretty much all of the above points. I worked with a bunch of parser generators and I never found one whose output was something that I really liked. They typically generate unreadable code, and the customization of their behavior is not obvious, to say the least.
As Martin Fowler has said (only slightly out of context):
…it's hard to write a parser.
My most recent parser is the RavenDB JSON parser. You can see the progression of the ideas around that in this series of posts. (That isn’t something that you’ll really want to read without a cup of coffee and some writing instruments to write notes.)
Most non-trivial parsers are composed of at least two pieces: a tokenizer and an builder. The tokenizer goes over the input and breaks it into tokens that the builder uses to build the final format. The JSON scanner in RavenDB is called
UnmanagedJsonParser and the builder is
BlittableJsonDocumentBuilder. Traditionally, they would be called scanner and parser, for the roles they play, but we’ll leave the names as is because it doesn’t really matter. This code is not fun to go through. It has been through multiple performance reviews, each time making it uglier then before, but much faster.
JSON is also one of the simplest possible textual formats. The formal definition of JSON fits a post-it note. The JSON scanner I have for RavenDB is close to 900 lines of code and is only part of the parsing process.
But the parser I built over the weekend wasn’t for JSON. Instead, I wanted to play with a query language, so naturally, I wanted something SQL-like. And that is anything but trivial to do.
The first thing I needed was to actually sit down and figure out what the language would look like. In order to do that, you almost always want to use a BNF notation of some kind. This allows you to specify what your language should look like — not just as a few snippets in a notepad window but in a more structured manner.
More to the point, there are a lot of tools out there to use. I decided to use GOLD Parser. It was last updated in in 2012 (and it shows) but it has the lowest friction of all the parser IDEs that I tried and it has great support for debugging and working with grammars. Why not use ANTLR, which is pretty much the default choice? Put simply, it is usually too much of a hassle to setup ANTLR properly and I didn’t want to get bogged down with the actual details of generating the parsers. I wanted to focus on the grammar.
I actually don’t know how to parse text using the GOLD Parser. It looks like it generates a binary file that you feed to some library that would do it for you, but I’m not sure and it doesn’t matter. What I care about is that I can develop a formal definition and debug it easily. I’m not actually going to use it to generate the parser.
Huh? Why do all this work for no reason?
A formal definition of the language is incredibly helpful when you consider a new syntax because you can verify that you aren’t creating holes and ambiguities in your language. It also gives you pretty clear guidelines on how to implement the language.
I’m going to go into more details about the language itself and building a parser for it in my next post.