Parsing in Python: Tools and Libraries (Part 2)
Parsing in Python: Tools and Libraries (Part 2)
In Part 2 of this 8-part series, we'll look at useful things to know about parsers, like left-recursive rules and types of grammars.
Join the DZone community and get the full member experience.Join For Free
The Architect’s Guide to Big Data Application Performance. Get the Guide.
Useful Things to Know About Parsers
A grammar is a formal description of a language that can be used to recognize its structure.
In simple terms, a grammar is a list of rules that define how each construct can be composed. For example, a rule for an "if" statement could specify that it must start with the "if" keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.
A rule could reference other rules or token types. In the example of the "if" statement, the keyword "if" and the left and the right parenthesis were token types, while the expression and the statement were references to other rules.
The most-used format to describe grammars is the Backus-Naur form (BNF), which also has many variants, including the extended Backus-Naur form. The extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:
<symbol> ::= __expression__
<simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right,
__expression__. The element
__expression__ could contain other nonterminal symbols, or terminal ones. Terminal symbols are simply the ones that do not appear as a
<symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like
In the context of parsers, an important feature is support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.
Consider, for example, arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.
addition ::= expression '+' expression multiplication ::= expression '*' expression // an expression could be an addition or a multiplication or a number expression ::= addition | multiplication |// a number
This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression (4+3) and then 4 + 3 itself can be divided into its two components.
The problem is that this kind of rule may not be used with some parser generators. The alternative is a long chain of expressions that also take care of the precedence of operators.
Some parser generators support direct left-recursive rules, but not indirect ones.
Types of Languages and Grammars
We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.
A regular language can be defined by a series of regular expressions, while a context-free one needs something more. A simple rule of thumb is that if a grammar of a language has recursive elements, it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.
Usually, a language and grammar correspond to each other. That is to say, there are regular grammars and context-free grammars that correspond respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar called the parsing expression grammar (PEG). These grammars are as powerful as context-free grammars, but according to their authors, they describe programming languages more naturally.
The Differences Between PEG and CFG
The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead, with PEG, the first applicable choice will be chosen, which automatically solves some ambiguities.
Another difference is that PEG uses scannerless parsers; they do not need a separate lexer or lexical analysis phase.
Traditionally, both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this — either by modifying the basic parsing algorithm or by having the tool automatically rewrite a left-recursive rule in a nonrecursive way. Both of these workarounds have their own downsides: either making the generated parser less intelligible or worsening its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.
Stay tuned for Part 3, where we'll talk about parser generators!
Published at DZone with permission of Gabriele Tomassetti , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.