Parsing in Python: Tools and Libraries (Part 1)
Parsing in Python: Tools and Libraries (Part 1)
In Part 1 of this 8-part series, we'll look at tools to create parsers and start learning useful things to know about parsers.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
If you need to parse a language or document from Python, there are fundamentally three ways to do so:
- Use an existing library supporting that specific language: The first option is the best for well-known and supported languages like XML or HTML. A good library usually also includes APIs to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases, you are out of luck.
- Building your own custom parser by hand: You may need to pick the second option if you have particular needs, whether the language that you need to parse cannot be parsed with traditional parser generators or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, maybe you need the best possible performance or a deep integration between different components.
- A tool or library to generate a parser: In all other cases, the third option should be the default one because it is the most flexible and has a shorter development time. (For example, ANTLR can be used to build parsers for any language.) That is why, in this article, we concentrate on the tools and libraries that correspond to this option.
Note: Any text in a blockquote describing a program comes from the respective documentation.
Tools to Create Parsers
We are going to see:
- Tools that can generate parsers usable by Python (and possibly from other languages).
- Python libraries to build parsers.
Tools that can be used to generate the code for a parser are called parser generators or compilers. Libraries that create parsers are known as parser combinators.
Parser generators (or parser combinators) are not trivial; you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them with a short introduction for each. We also concentrate on one target language: Python. This also means that (usually) the parser itself will be written in Python.
To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there would be simply too many options and we would get lost in all of them. By concentrating on one programming language, we can provide an apples-to-apples comparison and help you choose one option for your project.
Useful Things to Know About Parsers
To make sure that this list is accessible to all programmers, we have prepared a short explanation of terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.
Structure of a Parser
A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-step schema; some parsers do not depend on a lexer. They are called scannerless parsers.
A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, while the parser scans the tokens and produces the parsing result.
Let’s look at the following example and imagine that we are trying to parse a mathematical operation.
437 + 734
The lexer scans the text and finds 4, 3, and 7, and then the space. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a + symbol, which corresponds to the second token of type PLUS, and lastly, it finds another token of type NUM.
The parser will typically combine the tokens produced by the lexer and group them.
The definitions used by lexers or parsers are called rules or productions. A lexer rule will specify that a sequence of digits corresponding to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.
Scannerless parsers are different because they directly process the original text instead of processing a list of tokens produced by a lexer.
It is now typical to find suites that can generate both a lexer and parser. In the past, it was more common to combine two different tools: one to produce the lexer and one to produce the parser. This was, for example, the case of the venerable lex and yacc couple — lex produced the lexer, while yacc produced the parser.
Parse Tree and Abstract Syntax Tree
There are two terms that are related, and sometimes, they are used interchangeably: parse tree and abstract syntax tree (AST).
Conceptually, they are very similar.
They are both trees. There is a root representing the whole piece of code parsed. Then, there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree.
The difference is the level of abstraction. The parse tree contains all the tokens that appeared in the program and possibly a set of intermediate rules. The AST, instead, is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed.
In the AST, some information is lost; for instance, comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.
A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually, a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.
A graphical representation of an AST looks like this.
Sometimes, you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by following the steps that you typically want to perform on the tree: code validation, interpretation, compilation, etc.
In Part 2, we'll discuss more useful things to know about parsers!
Published at DZone with permission of Gabriele Tomassetti , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.