A Guide to Parsing: Algorithms and Technology (Part 7)
A Guide to Parsing: Algorithms and Technology (Part 7)
Get a summary of the main information needed to understand and implement a specific parser algorithm — specifically, top-down algorithms.
Join the DZone community and get the full member experience.Join For Free
Be sure to check out Part 6 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.
Let's continue our overview of parsing algorithms.
Tables of Parsing Algorithms
The table lists:
- A formal description, to explain the theory behind the algorithm
- A more practical explanation
- One or two implementations, usually one easier and the other a professional parser. Sometimes, though, there is no easier version or a professional one.
To understand how a parsing algorithm works, you can also look at the syntax analytic toolkit. It is an educational parser generator that describes the steps that a generated parser takes to accomplish its objective. It implements an LL and an LR algorithm.
The second table shows a summary of the main features of the different parsing algorithms and for what they are generally used.
The top-down strategy is the most widespread of the two strategies and there are several successful algorithms applying it.
LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.
To find the correct rule to apply:
- Firstly, the parser looks at the current token and the appropriate amount of lookahead tokens.
- Then, it tries to apply the different rules until it finds the correct match.
The concept of the LL parser does not refer to a specific algorithm, but more to a class of parsers. They are defined in relation to grammars. That is to say, an LL parser is one that can parse a LL grammar. In turn, LL grammars are defined in relation to the number of lookahead tokens that are needed to parse them. This number is indicated between parentheses next to LL, so in the form LL(k).
An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar that needs k tokens of lookahead to be parsed. Effectively, the concept of the LL(k) grammar is more widely employed than the corresponding parser — which means that LL(k) grammars are used as a meter when comparing different algorithms. For instance, you would read that PEG parsers can handle LL(*) grammars.
The Value of LL Grammars
This use of LL grammars is due both to the fact that LL parser are widely used and that they are a bit restrictive. In fact, LL grammars does not support left-recursive rules. You can transform any left-recursive grammar in an equivalent non-recursive form, but this limitation matters for a couple of reasons: productivity and power.
The loss of productivity depends on the requirement that you have to write the grammar in a special way, which takes time. The power is limited because a grammar that might need one token of lookahead when written with a left-recursive rule might need two or three tokens of lookahead when written in a non-recursive way. So, this limitation is not merely annoying, it is limiting the power of the algorithm, i.e. the grammars it can be used for.
The loss of productivity can be mitigated by an algorithm that automatically transforms a left-recursive grammar in a non-recursive one. ANTLR is a tool that can do that, but, of course, if you are building your own parser, you have to do that yourself.
There are two special kind of LL(k) grammars: LL(1) and LL(*). In the past the first kinds were the only one considered practical, because it is easy to build efficient parsers for them. Up to the point that many computer languages were purposefully designed to be described by a LL(1) grammar. An LL(*), also known as LL-regular, parser can deal with languages using an infinite amount of lookahead tokens.
The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually compared to CYK, another chart parser, that is simpler but also usually worse in performance and memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial results, it implement a prediction step to decide which rule is going to try to match next.
The Earley parser fundamentally works by dividing a rule in segments, like in the following example.
// an example grammar HELLO : "hello" NAME : [a-zA-Z]+ greeting : HELLO NAME // Earley parser would break up greeting like this // . HELLO NAME // HELLO . NAME // HELLO NAME .
Then, working on this segment that can be connected at the dot (
.), tries to reach a completed state; that is to say. one with the dot at the end.
The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages, while other famous algorithms (i.e. LL, LR) can parse only a subset of them. For instance, it has no problem with left-recursive grammars. More generally, an Earley parser can also deal with nondeterministic and ambiguous grammars.
It can do that at the risk of worse performance (O(n3)), in the worst case. However, it has a linear time performance for normal grammars. The catch is that the set of languages parsed by more traditional algorithms are the one we are usually interested in.
There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in certain way the parsing can be more efficient, i.e., building an LL(1) grammar might be harder for the developer, but the parser can apply it very efficiently. With Earley, you do less work, so the parser does more of it.
In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in terms of performance.
Earley Use Cases
So... Earley parsers are easy to use, but the advantage, in terms of performance, in the average case, might be non-existent. This makes the algorithm great for an educational environment or wherever productivity is more relevant than speed.
The first case is useful, for example, because most of the time, the grammars your users write work just fine. The problem is that the parser will throw at them obscure and seemingly random errors. Of course, the errors are not actually random, but they are due to the limitations of an algorithm that your users do not know or understand. So, you are forcing the user to understand the inner workings of your parser to use it, which should be unnecessary.
An example of when productivity is more important than speed might be a parser generator to implement syntax highlighting for an editor that needs to support many languages. In a similar situation, being able to quickly support new languages might be more desirable than completing the task as soon as possible.
You can check out the rest of the series below:
Opinions expressed by DZone contributors are their own.