Parsing in Java (Part 3): PEG Parsers and Combinators

As we conclude the Parsing in Java series, we examine the variety PEG parsers and parser combinators out there to determine when they should be used.

Jun. 11, 17 · Tutorial

Likes (11)

Comment

Save

21.9K Views

Welcome back to Part 3 of the Parsing in Java series! In case you missed them, take a look at Part 1 and Part 2, which respectively talk about parser structures and CFG parsers. This post will conclude the series with a look at the different PEG parsers in the wild. Just like Part 2, you'll get an overview of the parsers, followed by grammar examples and some advice, if applicable.

Without further delay, here we go.

PEG

Now that we've covered CFG parsers, it is time to see the PEG parsers available in Java.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using action annotations to use custom code in the parser. In practical terms, you just write the name of a function next to a rule, then you implement the function in your source code.

A Canopy grammar with actions:

// the actions are prepended by %
grammar Maps
    map     <-  "{" string ":" value "}" %make_map
    string  <-  "'" [^']* "'" %make_string
    value   <-  list / number
    list    <-  "[" value ("," value)* "]" %make_list
    number  <-  [0-9]+ %make_number

The Java file containing the action code:

[..]
import maps.Actions;
[..]

class MapsActions implements Actions {
    public Pair make_map(String input, int start, int end, List<TreeNode> elements) {
        Text string = (Text)elements.get(1);
        Array array = (Array)elements.get(3);
        return new Pair(string.string, array.list);
    }

    [..]
}

Laja

Laja is a two-phase scannerless, top-down, backtracking parser generator with support for runtime grammar rules.

Laja is a code generator and a parser generator mainly designed to create external DSLs. This means that it has some peculiar features. With Laja, you must specify not only the structure of the data, but also how the data should be mapped into Java structures. These structures are usually objects in a hierarchy or flat organization. In short, it makes very easy to parse data files, but it is less suitable for a generic programming language.

Laja options, like an output directory or input file, are set in a configuration file.

A Laja grammar is divided into a rules section and the data mapping section. It looks like this:

// this example is from the documentation
grammar example {
    s = [" "] + ;
    newline = "\r\n" | "\n";

    letter = "a"..
    "z";
    digit = "0"..
    "9";
    label = letter[digit | letter] + ;
    row = label ":"
    s[!(newline | END) + ]: value[newline];
    example = row + ;

    Row row.setLabel(String label);
    row.setValue(String value);

    Example example.addRow(Row row);
}

Mouse

Mouse is a tool to transcribe PEG into an executable parser written in Java.

Mouse does not use Packrat, and thus it uses less memory than the typical PEG parser (the manual explicitly compares Mouse to Rats!).

It does not have a grammar repository, but there are grammars for Java 6-8 and C.

A Mouse grammar is quite clean. To include custom code, a feature called semantic predicates, you do something similar to what you do in Canopy. You include a name in the grammar and then later, in a Java file, you actually write the custom code.

A Mouse grammar:

// example from the manual
// http://mousepeg.sourceforge.net/Manual.pdf
// the semantics are between {}
Sum = Space Sign Number(AddOp Number) * !_ {
    sum
};
Number = Digits Space {
    number
};
Sign = ("-"
    Space) ? ;
AddOp = [- +] Space;
Digits = [0 - 9] + ;
Space = " " * ;

Rats!

Rats! is a parser generator part of xtc (eXTensible Compiler). It is based on PEG, but it uses “additional expressions and operators necessary for generating actual parsers.” It supports left-recursive productions. It can automatically generate an AST.

It requires Java 6 or later.

The grammar can be quite clean, but you can embed custom code after each production:

// example from Introduction to the Rats! Parser Generator
// http://cs.nyu.edu/courses/fall11/CSCI-GA.2130-001/rats-intro.pdf
/* module intro */
module Simple;
option parser(SimpleParser);

/* productions for syntax analysis */
public String program = e: expr EOF {
    yyValue = e;
};
String expr = t: term r: rest {
    yyValue = t + r;
};
String rest = PLUS t: term r: rest {
    yyValue = t + "+" + r;
}
/ MINUS t:term r:rest { yyValue = t + "-" + r; } /
/*empty*/ {
    yyValue = "";
};
String term = d: DIGIT {
    yyValue = d;
};

/* productions for lexical analysis */
void PLUS = "+";
void MINUS = "-";
String DIGIT = [0 - 9];
void EOF = !;

Parser Combinators

Combinators allow you to create a parser simply with Java code by combining different pattern matching functions that are equivalent to grammar rules. They are generally considered suited for simpler parsing needs. Given they are just Java libraries, you can easily introduce them into your project: You do not need any specific generation step and you can write all of your code in your favorite Java editor. Their main advantage is the possibility of being integrated into your traditional workflow and IDE.

In practice, this means that they are very useful for all the little parsing problems you find. If the typical developer encounters a problem that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Jparsec

Jparsec is the port of the parsec library of Haskell.

Parser combinators are usually used in one phase — that is to say they are without a lexer. This is simply because it can quickly become too complex to manage all the combinator chains directly in the code. Having said that, Jparsec has a special class to support lexical analysis.

It does not support left-recursive rules, but it provides a special class for the most common use case: managing the precedence of operators.

A typical parser written with Jparsec is similar to this one:

// from the documentation
public class Calculator {

    static final Parser < Double > NUMBER =
        Terminals.DecimalLiteral.PARSER.map(Double::valueOf);

    private static final Terminals OPERATORS =
        Terminals.operators("+", "-", "*", "/", "(", ")");

    [..]

    static final Parser << ? > TOKENIZER =
        Parsers.or(Terminals.DecimalLiteral.TOKENIZER, OPERATORS.tokenizer());

    [..]

    static Parser < Double > calculator(Parser < Double > atom) {
        Parser.Reference < Double > ref = Parser.newReference();
        Parser < Double > unit = ref.lazy().between(term("("), term(")")).or(atom);
        Parser < Double > parser = new OperatorTable < Double > ()
            .infixl(op("+", (l, r) - > l + r), 10)
            .infixl(op("-", (l, r) - > l - r), 10)
            .infixl(Parsers.or(term("*"), WHITESPACE_MUL).retn((l, r) - > l * r), 20)
            .infixl(op("/", (l, r) - > l / r), 20)
            .prefix(op("-", v - > -v), 30)
            .build(unit);
        ref.set(parser);
        return parser;
    }

    public static final Parser < Double > CALCULATOR =
        calculator(NUMBER).from(TOKENIZER, IGNORED);
}

Parboiled

Parboiled provides a recursive descent PEG parser implementation that operates on PEG rules you specify.

The objective of parboiled is to provide an easy-to-use and understand way of creating small DSLs in Java. It put itself in the space between a simple bunch of regular expressions and an industrial-strength parser generator like ANTLR. A parboiled grammar can include actions with custom code, included directly into the grammar code or through an interface.

Here is an example parboiled parser:

// example parser from the parboiled repository
// CalculatorParser4.java

package org.parboiled.examples.calculators;

[..]

@BuildParseTree
public class CalculatorParser4 extends CalculatorParser < CalcNode > {

    @Override
    public Rule InputLine() {
        return Sequence(Expression(), EOI);
    }

    public Rule Expression() {
        return OperatorRule(Term(), FirstOf("+ ", "- "));
    }

    [..]

    public Rule OperatorRule(Rule subRule, Rule operatorRule) {
        Var < Character > op = new Var < Character > ();
        return Sequence(
            subRule,
            ZeroOrMore(
                operatorRule, op.set(matchedChar()),
                subRule,
                push(new CalcNode(op.get(), pop(1), pop()))
            )
        );
    }

    [..]

    public Rule Number() {
        return Sequence(
            Sequence(
                Optional(Ch('-')),
                OneOrMore(Digit()),
                Optional(Ch('.'), OneOrMore(Digit()))
            ),
            // the action uses a default string in case it is run during error recovery (resynchronization)
            push(new CalcNode(Double.parseDouble(matchOrDefault("0")))),
            WhiteSpace()
        );
    }

    //**************** MAIN ****************

    public static void main(String[] args) {
        main(CalculatorParser4.class);
    }
}

It does not build an AST for you, but it provides a parse tree and some classes to make it easier to build it. That is because its authors maintain that the AST is heavily dependent on your exact project needs, so they prefer to offer an “open and flexible approach.” It sounds quite appropriate to the project objective and one of our readers finds the approach better than a straight AST.

The documentation is very good. It explains features, shows examples, and compares the ideas behind Parboiled with the other options. There are some example grammars in the repository, including one for Java.

It is used by several projects, including important ones like Neo4j.

PetitParser

PetitParser combines ideas from scannerless parsing, parser combinators, parsing expression grammars and packrat parsers to model grammars and parsers as objects that can be reconfigured dynamically.

PetitParser is a cross between a parser combinator and a traditional parser generator. All the information is written in the source code, but the source code is divided into two files. In one file, you define the grammar while in the other one, you define the actions corresponding to the various elements. The idea is that it should allow you to dynamically redefine grammars. While it is smartly engineered, it is debatable if it is also smartly designed. You can see that the example JSON grammar it is more lengthy than one expects it to be.

Here is an excerpt from the example grammar file for JSON:

package org.petitparser.grammar.json;

[..]

public class JsonGrammarDefinition extends GrammarDefinition {

    // setup code not shown

    public JsonGrammarDefinition() {
        def("start", ref("value").end());

        def("array", of('[').trim()
            .seq(ref("elements").optional())
            .seq(of(']').trim()));
        def("elements", ref("value").separatedBy(of(',').trim()));
        def("members", ref("pair").separatedBy(of(',').trim()));

        [..]

        def("trueToken", of("true").flatten().trim());
        def("falseToken", of("false").flatten().trim());
        def("nullToken", of("null").flatten().trim());
        def("stringToken", ref("stringPrimitive").flatten().trim());
        def("numberToken", ref("numberPrimitive").flatten().trim());

        [..]
    }
}

And here is an excerpt from the example parser definition file (that defines the actions for the rules) for JSON.

package org.petitparser.grammar.json;

import org.petitparser.utils.Functions;

public class JsonParserDefinition extends JsonGrammarDefinition {

    public JsonParserDefinition() {
        action("elements", Functions.withoutSeparators());
        action("members", Functions.withoutSeparators());
        action("array", new Function < List < List << ? >> , List << ? >> () {
            @Override
            public List << ? > apply(List < List << ? >> input) {
                return input.get(1) != null ? input.get(1) : new ArrayList < > ();
            }
        });

        [..]
    }
}

There is a version written in Java, but there are also versions in Smalltalk, Dart, PHP, and TypeScript.

The documentation is lacking, but there are example grammars available.

Java Libraries That Parse Java: JavaParser

There is one special case that requires some more commentary: The case in which you want to parse Java code in Java. In this case, we have to suggest using a library named JavaParser. Incidentally, we heavily contribute to JavaParser, but that is not the only reason we suggest it. The fact is that JavaParser is a project with dozens of contributors and thousands of users, so it is pretty robust.

A quick list of features:

It supports all versions of Java from 1 to 9.
It supports lexical preservation and pretty printing: It means you can parse Java code, modify it, and print it back either with the original formatting or pretty printed.
It can be used with JavaSymbolSolver, which gives you symbol resolution. It understands which methods are invoked, it knows which declarations references are linked to, it calculates the type of expressions, etc.

Convinced? Still to write your own Java parser for Java?

Summary

Parsing in Java is a broad topic, and the world of parsers is a bit different from the usual world of programmers. You will find the best tools coming directly from academia, which is typically not the case with software. Some tools and libraries have been started for a thesis or a research project. The upside is that tools tend to be easily and freely available. The downside is that some authors prefer to have a good explanation of the theory behind what their tools do, rather than good documentation on how to use them. Also, some tools end up being abandoned as the original authors finish their master's or their Ph.D.

We tend to use parser generators quite a lot: ANTLR is our favorite one and we use JavaCC extensively in our work on JavaParser. We do not use parser combinators very often, though. It is not because they are bad — they have their uses and, in fact, we wrote an article about one in C#. But for the problems we deal with, they typically lead to less maintainable code. However, they could be easier to start with, so you may want to consider those, especially if, until now, you have hacked something terrible using regular expressions and a half-baked parser written by hand.

We cannot really definitely say what software you should use. What it is best for one user might not be the best for somebody else. And we all know that the most technically correct solution might not be ideal in real life with all its constraints. Still, we have searched and tried many similar tools in our work, and something like these articles would have helped us save some time. With that in mind, we wanted to share what we have learned about the best options for parsing in Java.

Parser (programming language) Java (programming language)

Published at DZone with permission of Gabriele Tomassetti. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending