Be sure to check out Part 5 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.
In theory, parsing is a solved problem, but it is the kind of problem that keeps being solved again and again. That is to say, there are many different algorithms, each one with strong and weak points, and they are still improved by academics.
In this section, we are not going to teach how to implement every one of the parsing algorithms, but we are going to explain their features. The goal is that you can choose with more awareness which parsing tool to use or which algorithm to study better and implement for your custom parser.
Let’s start with a global overview of the features and strategies of all parsers.
There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are defined in relation to the parse tree generated by the parser. Explained in a simple way:
- A top-down parser tries to identify the root of the parse tree first, then moves down the subtrees until it finds the leaves of the tree.
- A bottom-up parser instead starts from the lowest part of the tree, the leaves, and rises up until it determines the root of the tree.
Let’s see an example, starting with a parse tree.
Example parse tree (from Wikipedia)
The same tree would be generated in a different order by a top-down and a bottom-up parser. In the following images, the number indicates the order in which the nodes are created.
Top-down order of generation of the tree (from Wikipedia)
Bottom-up order of generation of the tree (from Wikipedia)
Traditionally, top-down parsers were easier to build, but bottom-up parsers were more powerful. Now, the situation is more balanced, mostly because of advancement in top-down parsing strategies.
The concept of derivation is closely associated with the strategies. Derivation indicates the order in which the nonterminal elements appear in the rule, on the right, are applied to obtain the nonterminal symbol, on the left. Using the BNF terminology, it indicates how the elements that appear in __expression__ are used to obtain <symbol>. The two possibilities are leftmost derivation and rightmost derivation. The first indicates that the rules are applied from left to right, while the second indicates the opposite.
A simple example: Imagine that you are trying to parse the symbol
result, which is defined as such in the grammar.
expr_one = .. // stuff expr_two = .. // stuff result = expr_one 'operator' expr_two
You can apply the rule first for symbol
expr_one and then
expr_two, or vice versa. In the case of leftmost derivation, you pick the first option, while for rightmost derivation, you pick the second one.
It is important to understand that the derivation is applied depth-first or recursively. That is to say, it is applied to the starting expression then it is applied again to the intermediate result that is obtained. So, in this example, if after applying the rule corresponding to
expr_one, there is a new nonterminal; that one is transformed first. The nonterminal
expr_two is applied only when it becomes the first nonterminal and not following the order in the original rule.
Derivation is associated with the two strategies because for bottom-up parsing, you would apply rightmost derivation, while for top-down parsing, you would choose leftmost derivation. Note that this has no effect on the final parsing tree; it just affects the intermediate elements and the algorithm used.
Parsers built with top-down and bottom-up strategies share a few elements that we should talk about.
Lookahead and Backtracking
The terms lookahead and backtracking do not have a different meaning in parsing than the ones they have in the larger computer science field. Lookahead indicates the number of elements, following the current one, that are taken into consideration to decide which current action to take.
A simple example: A parser might check the next token to decide which rule to apply now. When the proper rule is matched, the current token is consumed, but the next one remains in the queue.
Backtracking is a technique of an algorithm. It consists of finding a solution to complex problems by trying partial solutions and then continuously checking the most promising one. If the one that is currently being tested fails, then the parser backtracks (i.e. it goes back to the last position that was successfully parsed) and tries another one.
They are especially relevant to LL, LR, and LALR parsing algorithms because parsers for language that just need one lookahead token are easier to build and quicker to run. The lookahead tokens used by such algorithms are indicated between parentheses after the name of the algorithm (i.e. LL(1), LR(k)). The notation (*) indicates that the algorithm can check infinite lookahead tokens, although this might affect the performance of the algorithm.
Chart parsers are a family of parsers that can be bottom-up (i.e. CYK) or top-down (i.e. Earley). Chart parsers essentially try to avoid backtracking, which can be expensive, by using dynamic programming. Dynamic programming, or dynamic optimization, is a general method to break down a larger problem in smaller subproblems.
A common dynamic programming algorithm used by chart parsers is the Viterbi algorithm. The goal of the algorithm is to find the most likely hidden states given the sequence of known events. Basically, given the tokens that we know, we try to find the most probable rules that have produced them.
The name chart parser comes from the fact that the partial results are stored in a structure called a chart (usually, the chart is a table). The particular technique of storing partial results is called memoization. Memoization is also used by other algorithms, unrelated to chart parsers, like packrat.
Before discussing parsing algorithms, we would like to talk about the use of automatons in parsing algorithms. Automatons are a family of abstract machines, among which there is the well-known Turing machine.
When it comes to parsers, you might hear the term (deterministic) pushdown automaton (PDA), and when you read about lexers, you would hear the term deterministic finite automaton (DFA). A PDA is more powerful and complex than a DFA (although still simpler than a Turing machine).
Since they define abstract machines, usually they are not directly related to an actual algorithm. Rather, they describe in a formal way the level of complexity that an algorithm must be able to deal with. If somebody says that to solve problem X you need a DFA, he means that you need an algorithm as equally powerful as a DFA.
However, since DFA are state machines, in the case of a lexer, the distinction is frequently moot. That is because state machines are relatively straightforward to implement (i.e. there are ready-to-use libraries), so most of the time, a DFA is implemented with a state machine. That is why we are going to briefly talk about DFA and why they are frequently used for lexers.
Lexing With a Deterministic Finite Automaton
DFA is a (finite-) state machine, a concept with which we assume you are familiar. Basically, a state machine has many possible states and a transition function for each of them. These transition functions govern how the machine can move from one state to a different one in response to an event. When used for lexing, the machine is fed the input characters one at a time until it reaches an accepted state (i.e. it can build a token).
They are used for a few reasons:
- It has been proven that they recognize exactly a set of regular languages; that is to say, they are equally powerful as regular languages.
- There are a few mathematical methods to manipulate and check their properties (i.e. whether they can parse all strings or any strings).
- They can work with an efficient online algorithm (see below).
An online algorithm is one that does not need the whole input to work. In the case of a lexer, it means that it can recognize a token as soon as its characters are seen by the algorithm. The alternative would be that it needed the whole input to identify each token.
In addition to these properties, it is fairly easy to transform a set of regular expressions into a DFA, which makes possible to input the rules in a simple way that is familiar to many developers. Then, you can automatically convert them in a state machine that can work on them efficiently.
You can check out the rest of the series below: