A Guide to Parsing: Algorithms and Technology (Part 8)
A Guide to Parsing: Algorithms and Technology (Part 8)
In this post, we continue to dive into top-down parsing algorithms, including Packrat (PEG), recursive descent parser, Pratt parser, and parser combinator.
Join the DZone community and get the full member experience.Join For Free
Start coding something amazing with the IBM library of open source AI code patterns. Content provided by IBM.
Be sure to check out Part 7 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.
Packrat is often associated with the formal grammar PEG since they were invented by the same person: Bryan Ford. Packrat was described first in his thesis: Packrat Parsing: A Practical Linear-Time Algorithm With Backtracking. The title says almost everything that we care about: it has a linear time of execution and does not use backtracking.
The other reason for its efficiency it is memoization: the storing of partial results during the parsing process. The drawback, and the reason why the technique was not used until recently, is the quantity of memory it needs to store all the intermediate results. If the memory required exceeds what is available, the algorithm loses its linear time of execution.
Packrat also does not support left-recursive rules, a consequence of the fact that PEG requires to always choose the first option. Actually, some variants can support direct left-recursive rules but at the cost of losing linear complexity.
Packrat parsers can perform with an infinite amount of lookahead, if necessary. This influences the execution time, that in the worst case can be exponential.
Recursive Descent Parser
A recursive descent parser is a parser that works with a set of (mutually) recursive procedures, usually one for each rule of the grammars. Thus, the structure of the parser mirrors the structure of the grammar.
The termpredictive parser is used in a few different ways: some people mean it as a synonym for a top-down parser, some as a recursive descent parser that never backtracks.
The opposite of this second meaning is a recursive descent parser, which does backtracks. That is to say, one that finds the rule that matches the input by trying each one of the rules in sequence and then goes back each time it fails.
Typically, recursive descent parsers have problems parsing left-recursive rules because the algorithm would end up calling the same function again and again. A possible solution to this problem is using tail recursion. Parsers that use this method are called tail recursive parsers.
Tail recursion per se is simply recursion that happens at the end of the function. However, tail recursion is employed in conjunction with transformations of grammar rules. The combination of transforming the grammar rules and putting recursion at the end of the process allows dealing with left-recursive rules.
A Pratt parser is a widely unused, but much appreciated (by the few who know it), parsing algorithm defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts with a polemic on BNF grammars, which the author argues wrongly are the exclusive concerns of parsing studies. This is one of the reasons for the lack of success. In fact, the algorithm does not rely on a grammar but works directly on tokens, which makes it unusual to parsing experts.
The second reason is that traditional top-down parsers work great if you have a meaningful prefix that helps distinguish between different rules. For example, if you get the token
FOR, you are looking at a
for statement. Since this essentially applies to all programming languages and their statements, it is easy to understand why the Pratt parser did not change the parsing world.
Where the Pratt algorithm shines is with expressions. In fact, the concept of precedence makes it impossible to understand the structure of the input simply by looking at the order in which the tokens are presented.
Basically, the algorithm requires you to assign a precedence value to each operator token and a couple of functions that determines what to do according to what is on the left and right of the token. Then, it uses these values and functions to bind the operations together while it traverses the input.
While the Pratt algorithm has not been overtly successful, it is used for parsing expressions. It was also adopted by Douglas Crockford (of JSON fame) for JSLint.
A parser combinator is a higher-order function that accepts parser functions as input and returns a new parser function as output. A parser function usually means a function that accepts a string and output a parse tree.
A parser combinator is modular and easy to build, but they are also slower (they have O(n4) complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing tasks or for prototyping. In a sense, the user of a parser combinator builds the parser partially by hand but relies on the hard work done by whoever created the parser combinator.
Generally, they do not support left recursive rules, but there are more advanced implementations that do just that. See, for example, the paper Parser Combinators for Ambiguous Left-Recursive Grammars, which also manages to describe an algorithm that has a polynomial time of execution.
Many contemporary implementations are called monadic parser combinators since they rely on a structure of functional programming called monad. Monads are a fairly complex concept that we cannot hope to explain here. However, basically, a monad is able to combine functions and actions relying on a data type. The crucial feature is that the data type specifies how its different values can be combined.
The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer, that returns the value itself when the value is valid (i.e. 567), but a special value, Nothing, when it is not (i.e. undefined or divided by zero). Thus, you can avoid using a null value and unceremoniously crashing the program. Instead, the Nothing value is managed normally, like it would manage any other value.
You can check out the rest of the series below:
Published at DZone with permission of Gabriele Tomassetti , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.