Parsing in Python: Tools and Libraries (Part 3)
Parsing in Python: Tools and Libraries (Part 3)
In Part 3 of this 8-part series, we'll look at three types of context-free parser generators: ANTLR, Lark, and lrparsing.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language or document and you run the tool to generate a parser usable from your Python code.
The parser might produce the AST, which you may have to traverse yourself, or you can traverse with additional ready-to-use classes such as listeners or visitors. Some tools offer the ability to embed code inside the grammar to be executed every time a specific rule is matched.
Usually, you need a runtime library and/or program to use the generated parser.
Let’s see the tools that generate context-free parsers.
ANTLR is a great parser generator written in Java that can also generate parsers for Python and many other languages. There is also a beta version for TypeScript from the same guy who makes the optimized C# version. ANTLR is based on a new LL algorithm developed by the author and described in this paper.
It is quite popular for its many useful features. For instance, version 4 supports direct left-recursive rules. However, the real added value for a vast community it is the large amount of grammars available.
It provides two ways to walk the AST instead of embedding actions in the grammar: visitors and listeners. The first one is used when you have to manipulate or interact with the elements of the tree, while the second is used when you just have to do something when a rule is matched.
The typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively, lexer and parser grammars can be defined in separate files.
A very simple ANTLR grammar:
grammar simple; basic : NAME ':' NAME ; NAME : [a-zA-Z]* ; COMMENT : '/*' .*? '*/' -> skip ;
If you are interested in ANTLR, you can look into this giant ANTLR tutorial we have written.
A modern parsing library for Python, implementing Earley and LALR(1) with an easy interface.
Lark is a parser generator that works as a library. You write the grammar in a string or a file and then use it as an argument to dynamically generate the parser. Lark can use two algorithms: Earley when you need to parse all the grammars and LALR when you need speed. Earley can also parse ambiguous grammars. Lark offers the ability to automatically solve the ambiguity by choosing the simplest option or reporting all options.
Lark grammars are written in EBNF format. They cannot include actions. This means that they are clean and readable but also that you have to traverse the resulting tree yourself — although there is a function that can help with that if you use the LALR algorithm. On the positive side, you can also use specific notations in the grammar to automatically generate an AST by dropping, merging, or transforming certain nodes.
The following example grammar shows an useful feature of Lark: it includes rules for common things like whitespace or numbers.
parser = Lark('''?sum: product | sum "+" product -> add | sum "-" product -> sub ?product: item | product "*" item -> mul | product "/" item -> div ?item: NUMBER -> number | "-" item -> neg | "(" sum ")" %import common.NUMBER %import common.WS %ignore WS ''', start='sum')
Lark comes with a tool to convert Nearley grammars into its own format. It also includes a useful function to transform the tree generated by the parser in an image.
lrparsing is an LR(1) parser hiding behind a Pythonic interface.
lrparsing is a parser generator whose grammars are defined as Python expressions. These expressions are attributes of a class that corresponds to rule of a traditional grammar. They are usually dynamically generated, but the library provides a function to precompile a parse table beforehand.
Given their format and depending on Python, lrparsing grammars can be easy to read for Python developers — but they are harder to read than a traditional grammar.
// from the documentation class ExprParser(lrparsing.Grammar): # # Put Tokens we don't want to re-type in a TokenRegistry. # class T(lrparsing.TokenRegistry): integer = Token(re="[0-9]+") integer["key"] = "I'm a mapping!" ident = Token(re="[A-Za-z_][A-Za-z_0-9]*") # # Grammar rules. # expr = Ref("expr") # Forward reference call = T.ident + '(' + List(expr, ',') + ')' atom = T.ident | T.integer | Token('(') + expr + ')' | call expr = Prio( # If ambiguous choose atom 1st, ... atom, Tokens("+ - ~") >> THIS, # >> means right associative THIS << Tokens("* / // %") << THIS, THIS << Tokens("+ -") << THIS,# THIS means "expr" here THIS << (Tokens("== !=") | Keyword("is")) << THIS) expr["a"] = "I am a mapping too!" START = expr # Where the grammar must start COMMENTS = ( # Allow C and Python comments Token(re="#(?:[^\r\n]*(?:\r\n?|\n\r?))") | Token(re="/[*](?:[^*]|[*][^/])*[*]/"))
lrparsing also provides some basic functions to print parsing tree sand grammar rules for debugging purposes.
The documentation is really good; it explains everything you need to know about the library and it also provides some guidance on creating good grammars (i.e. solving ambiguities). There are also quite complex example grammars, like one for SQLite.
Opinions expressed by DZone contributors are their own.