Over a million developers have joined DZone.

How to Implement a Ruby Hash Like Grammar in Parslet, Part 1

DZone 's Guide to

How to Implement a Ruby Hash Like Grammar in Parslet, Part 1

In Part 1 of this two-part series, a web developer discusses a problem he hoped to solve using regex, and how he created a (failing) test using Ruby.

· Web Dev Zone ·
Free Resource

Before you can understand how to build a parser using parslet, you need to understand why you might want to. In my case, I have a library called rundoc, it allows anyone to write documentation that can be "run." For example, someone might write docs that had this:

:::>> $ rails -v

Then in your documentation output, you would get this result:

$ rails -v
Rails 5.2.0

Note: If you want, you can skip to the next section for the tutorial.

While this doesn't seem that impressive on the surface - if you have docs that need to be updated frequently, it can save a lot of time spent copying and pasting outputs. Even more importantly, it can catch errors and regressions that you might not catch manually.

I use this library to keep some docs in the Heroku devcenter, such as the getting started with Rails 5 article. When a new version of Rails or Ruby is released I can update the doc, "run it," and then ensure that the output in the documentation matches perfectly with the output someone using the same commands would get. If a command fails, the generation process fails, so the doc acts as a test as well.

The other added benefit of this approach is to the reader. By ensuring consistency of output of commands, the reader can better detect when something has gone wrong. Enough about my project though, this article is about building a parser. Why does rundoc need parslet?

When I wrote rundoc I first implemented it using regexes. They started out simple, but then got more and more gnarly, here's what they look like now:

INDENT_BLOCK       = '(?<before_indent>(^\s*$\n|\A)(^(?:[ ]{4}|\t))(?<indent_contents>.*)(?<after_indent>[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*))'
GITHUB_BLOCK       = '^(?<fence>(?<fence_char>~|`){3,})\s*?(?<lang>\w+)?\s*?\n(?<contents>.*?)^\g<fence>\g<fence_char>*\s*?\n'
COMMAND_REGEX      = ->(keyword) {

This allows the registration of a "command" such as $ which then takes one input, a string. It's pretty flexible, but recently I wanted to allow for a more complex grammar, something that would allow me to use keyword arguments like syntax:

:::>> background.start(command: "rails server", name: "server")

While I could maybe have implemented this in a regex, it seemed like I was using the wrong tool for the job, and that I needed a fully featured parser. Enter Parslet.

Rather than trying to write one regex to rule them all, parslet allows me to build up my grammar piece by piece. These small pieces then fit together to make more and more complex grammars. The output of this parse step is a deeply nested syntax "tree" which we will see in a bit.

Parslet then has the concept of a "transformer" that can be used to turn complex parsed trees into any kind of Ruby code, such as value objects that we want. When you combine parsers and transformers, you can write your own mini-language with whatever syntax rules you want. This is perfect for something like rundoc.

While there are other parser libraries in Ruby, I chose parslet largely because of the amount of examples and documentation it had. The documentation site is great, and there are examples directly in the repo. My only minor-nit is that the readme isn't rendered in markdown in GitHub, but I can get over it.

I also looked at the treetop gem, but I couldn't make much progress. I investigated the racc library which is great if you're familiar with yacc syntax (which I'm not). Yacc is the syntax that Ruby uses to implement its own grammar. If you want to go down that Rabbit hole, Aaron recommended the O'Reilly book on Yacc which seems good but requires more reading than I was willing to put into this project.

Now that you know my problem and my toolset, it's time for a tutorial! I'm writing this, not as an expert in parsers (or in parslet even), but really to convince myself that I understand what I've already done. When you can explain how you did something to someone else, then you actually understand it. Without further ado, here's the tutorial.

Tutorial Setup

In this parslet tutorial, we will build a grammar that can read in a Ruby 1.9 style hash with symbol keys and string values. Something like this:

{hello: "world", iam: "Schneems"}

That might seem like an easy goal, but as Caral Sagan once said, if you want to create a buttermilk biscuit from scratch, you'll first have to invent the universe.

How does this relate to rundoc? Remember how I wanted syntax that matched keyword args? I'll have to implement that same syntax to get Ruby hash literal syntax to work in Parslet.

I generally don't develop using pure TDD methods (i.e. writing tests first, then getting them to pass), but I've found when implementing a grammar, it's much easier to write a test and then implement it second.

To that end, I've implemented a failing test and my code all in one file:

require 'parslet'

class MyParser < Parslet::Parser

class MyTransformer < Parslet::Transform

require 'minitest/autorun'
require 'parslet/convenience'

class MyParserTest < Minitest::Test
  def test_parses_a_comma
    input = %Q{,}
    parser = MyParser.new.comma
    tree = parser.parse_with_debug(input)
    refute_equal nil, tree

I can run this file directly using $ ruby example.rb and see if my tests pass.

My First Failing Test

If you like to see the finished product first, you can jump to the source code to my example parslet app.

In the above code, the comma method on MyParser.new is referring to a "rule" in parslet that I've not implemented yet. This test fails with:

NoMethodError: undefined method `comma' for #<MyParser:0x00007ff5dc074d18>
    example.rb:14:in `test_parses_a_comma'

In parslet, you add a rule by using the rule keyword, giving the rule a symbol name, and then defining the rule inside of a block. Here's a rule that matches our comma:

class MyParser < Parslet::Parser
  rule(:comma) { str(",") }

Inside of the block, the keyword str refers to a literal string match. In this case, we're only passing in a single comma, so it matches the comma in our input string. Pretty straightforward. The tests now pass.

When building a grammar, I've found it easiest to start with very small pieces and build out from there. We can re-use those small pieces to help build more complicated grammars.

While this matches a single comma, it won't match if we add a test to have spaces around it:

def test_parses_a_comma_with_spaces
  input = %Q{ , }
  parser = MyParser.new.comma
  tree = parser.parse_with_debug(input)
  refute_equal nil, tree

Now the test fails:

Expected ",", but got " " at line 1 char 1.

Finished in 0.001192s, 838.9262 runs/s, 838.9262 assertions/s.

  1) Failure:
MyParserTest#test_parses_a_comma_with_spaces [example.rb:19]:
Expected nil to not be equal to nil.

How can we update the grammar to allow for spaces? In addition to str there is also a match keyword that will match via a regex. A space could be matched like this:

rule(:spaces) { match('\s').repeat(1) }

Here the regex \s will match any whitespace character. The call to repeat(1) says that it must be repeated at least once, but has no upper bound.

This means it will match ` (one space) and ` ` (6 spaces) but not `` (no spaces).

While this is a useful rule, we also want to match the case where we don't have spaces in addition to the case where we do. To accomplish that we can add a spaces? rule, that uses the spaces rule and adds to it:

 rule(:spaces?) { spaces.maybe }

Inside of the block, the call to spaces uses our previously defined rule. The maybe method is provided by parslet and indicates that if it matches the spaces rule, great. Also if it doesn't match that rule, it's fine.

We can put all these things together to get our tests to pass by updating the comma rule, here's the full thing:

class MyParser < Parslet::Parser
  rule(:spaces)  { match('\s').repeat(1) }
  rule(:spaces?) { spaces.maybe }
  rule(:comma)   { spaces? >> str(',') >> spaces? }

While we understand spaces? and str(','), what is this >> operator doing? I don't know the term for it, but I mentally named it "and then." I read this rule as "Match spaces (if there are any), and then explicitly match a string of ',' and then match spaces (if there are any)". Now that we have a rule, let's make sure our tests pass:

Finished in 0.001970s, 1015.2284 runs/s, 1015.2284 assertions/s.

2 runs, 2 assertions, 0 failures, 0 errors, 0 skips


Now that we have a comma, what else could our language use? How about a string.

Here's the test:

  def test_parses_a_string
    input = %Q{"hello world"}
    parser = MyParser.new.string
    tree = parser.parse_with_debug(input)
    refute_equal nil, tree

It fails, how will it pass? Let's look at the structure of our string first. It's got a quote character " then it has other characters, those characters can be repeated, and they cannot include another quote. Then, finally, the string terminates with another quote ". How is this represented in parslet syntax?

rule(:string) {
  str('"') >> (
    str('"').absent? >> any
  ).repeat.as(:string) >> str('"')

You've seen everything here before, except for absent?, any, and as. The absent? method checks for the lack of that character. The any keyword will match, well, anything. The any character is shorhand for match('.'). The combination of str("'").absent? >> any is checking each character to see that it does not contain a " and then it will match any other character.

What does the as do? This is our way of telling parslet that we are dealing with a significant part of our grammar. While I don't necessarily need to know how many spaces are around a comma, I'll likely want to know the contents of a string. That's why I added as(:string).

When you run the tests you'll see that it passes. I want to go one step further, though, and actually verify the format of the parsed tree (instead of saying "not nil"). To do that, I'll change the test:

  def test_parses_a_string
    input = %Q{"hello world"}
    parser = MyParser.new.string
    tree = parser.parse_with_debug(input)
    expected = {string: "hello world"}
    assert_equal expected, tree

In parslet, each as produces a hash (and potentially an Array). As we keep going you'll see that they will be deeply nested. While a parser builds a tree, a "transformer" takes a tree as an input and simplifies it to make it smaller.

To understand, we'll try to add a grammar for Ruby's hash. Eventually, we want to parse {hello: "world" }. To start, though, we can match a subset of this, just the inner part hello: "world". This is also similar to keyword args.

Before we start, let's brainstorm the syntax of this. A key is any character that is not a : or a space, followed by a : literal. The value is a string, but, in the future, it could be a number or an array. We can also have multiple keys and multiple values separated by commas.

This is a complicated feature, let's start with the smaller parts and work towards the larger piece. I mentioned that a value can be things other than a string, to allow for this later we can add a value rule:

rule(:value) { string }

If a "number" rule existed then we could use the | operator to specify that a value can be a string | number (string or number). We won't do that here, but know that's why I pre-emptively made a value rule, even if it doesn't look like it's doing anything.

Next up I want a key parser. Here's a test:

def test_key
  input = %Q{ hello: }
  parser = MyParser.new.key
  tree = parser.parse_with_debug(input)
  refute_equal nil, tree

The key rule needs to allow for a space at the beginning and the end. You'll remember that a key is a repetition of characters that are not : or a space, and that ends in a colon, :. Here's what I came up with:

rule(:key) {
  spaces? >> (
    str(':').absent? >> match('\s').absent? >> any
  ).repeat.as(:key) >> str(':') >> spaces?

This gets the tests to pass. Notice, that since we'll care about the contents of the key, it is named here using the as keyword.

A hash can have repeating key-value pairs. Before we match an entire series, let's group a key and a value together. Start with a test:

def test_parses_a_key_value_pair
  input = %Q{ hello: "world" }
  parser = MyParser.new.key_value
  tree = parser.parse_with_debug(input)
  refute_equal nil, tree

It fails, let's make it pass with a key_value rule:

rule(:key_value) {
    key >> value.as(:val)
  ).as(:key_value) >> spaces?

We are really close to finishing our grammar, but before we do, I want to take a look at the output of the tree for this key/value. It looks like this:

{ :key_value => { :key => "hello", :val => { :string => "world" }}}

This hash somewhat makes sense to me. We have a top-level key key_value and that points to another hash. This hash has a key key that points to a "hello" string and a val key that points to yet another hash, {:string=>"world"}. While it's not super complicated, we can make the result of this simpler by using a transformer.

That's all for today's post, tune in tomorrow when we'll discuss transformers, repeated rules, and create our first true data structure. 

web dev ,regex ,regular expressions ,ruby ,documentation generation

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}