A Guide to Parsing: Algorithms and Technology (Part 4)
A Guide to Parsing: Algorithms and Technology (Part 4)
Grammars are a set of rules used to describe a language, so it comes naturally to study the formats of the rules. That's what we're doing in Part 4 of this 9-part series.
Join the DZone community and get the full member experience.Join For Free
Start coding something amazing with the IBM library of open source AI code patterns. Content provided by IBM.
Be sure to check out Part 3 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.
Grammars are a set of rules used to describe a language, so it comes naturally to study the formats of the rules. However, there are also several elements of a typical grammar that could use further attention. Some of them are due to the fact that a grammar can also be used to define other duties or to execute some code.
Typical Grammar Issues
First, we are going to talk about some special rules or issues you might encounter in parsing.
The Missing Tokens
If you read grammars, you will probably encounter many in which only a few tokens are defined, and not all of them. Like in this grammar:
NAME : [a-zA-Z]+ greeting : "Hello" NAME
There is no definition for the token
"Hello", but since you know that a parser deals with tokens, you may ask yourself how this is possible. The answer is that some tools generate for you the corresponding token for a string literal to save you some time.
Be aware that this might be possible only under certain conditions. For instance, with ANTLR, you must define all tokens yourself if you define separate lexer and parser grammars.
In the context of parsers, an important feature is support for left-recursive rules. This means that a rule starts with a reference to itself. Sometimes, this reference could also be indirect; that is to say, it could appear in another rule referenced by the first one.
Consider, for example arithmetic, operations. An addition could be described as two expression(s) separated by the plus (
+) symbol, but the operands of the additions could be other additions.
addition : expression '+' expression multiplication : expression '*' expression // an expression could be an addition or a multiplication or a number expression : multiplication | addition | [0-9]+
In this example, expression contains an indirect reference to itself via the rules addition and multiplication.
This description also matches multiple additions like
5 + 4 + 3. That is because it can be interpreted as
expression (5) ('+') expression(4+3) (the rule addition: the first expression corresponds to the option [0-9]+, the second one is another addition). And then
4 + 3 itself can be divided into its two components:
expression(4) ('+') expression(3) (the rule addition: both the first and second expression corresponds to the option [0-9]+) .
The problem is that left-recursive rules may not be used with some parser generators. The alternative is a long chain of expressions that also takes care of the precedence of operators. A typical grammar for a parser that does not support such rules would look similar to this one:
expression : addition addition : multiplication ('+' multiplication)* multiplication : atom ('*' atom)* atom : [0-9]+
As you can see, the expressions are defined in the inverse order of precedence. So, the parser would put the expression with the lower precedence at the lowest level of the three; thus, they would be executed first.
Some parser generators support direct left-recursive rules, but not indirect ones. Notice that usually, the issue is with the parsing algorithm that does not support left-recursive rules. So, the parser generator may transform rules written as left-recursive in the proper way to make it work with its algorithm. In this sense, left-recursive support may be (very useful) syntactic sugar.
How Left-Recursive Rules Are Transformed
The specific way in which the rules are transformed varies from one parser generator to the other; however, the logic remains the same. The expressions are divided into two groups: the ones with an operator and two operands and the atomic ones. In our example, the only atomic expression is a number (
[0-9]+), but it could also be an expression between parentheses (
(5 + 4)). That is because in mathematics, parentheses are used to increase the precedence of an expression.
Once you have these two groups, you maintain the order of the members of the second group and reverse the order of the members of the first group. The reason is that humans reason on a first come, first serve basis: it is easier to write the expressions in their order of precedence.
However, the final form of the parsing is a tree, which operates on a different principle: you start working on the leaves and rise up so that at the end of this process, the root node contains the final result — which means that in the parsing tree the atomic expressions are at the bottom, while the ones with operators appear in the inverse order in which are applied.
Predicates, sometimes called syntactic or semantic predicates, are special rules that are matched only if a certain condition is met. The condition is defined with code in a programming language supported by the tool for which the grammar is written.
Their advantage is that they allow some form of context-sensitive parsing, which is sometimes unavoidable to match certain elements. For instance, they can be used to determine if a sequence of characters that defines a soft keyword is used in a position where it would be a keyword (i.e. the previous token can be followed by the keyword) or it is a simple identifier.
The disadvantages are that they slow down parsing and they make the grammar dependent on said programming language. That is because the condition is expressed in a programming language and must be checked.
Embedded actions identify code that is executed every time the rule is matched. They have the clear disadvantage that make the grammar harder to read since the rules are surrounded by code. Furthermore, just like predicates, they break the separation between a grammar that describes the language and the code that manipulates the results of the parsing.
Actions are frequently used by less sophisticated parsing generators as the only way to easily execute some code when a node is matched. Using these parser generations, the only alternative would be to traverse the tree and execute the proper code yourself. More advanced tools instead allow to use the visitor pattern to execute arbitrary code when needed, and also to govern the traversing of the tree.
They can also be used to add certain tokens or change the generated tree. While ugly, this might be the only practical way to deal with complicated languages like C or specific issues like whitespace in Python.
You can check out the rest of the series below:
Published at DZone with permission of Gabriele Tomassetti , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.