Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Guide to Parsing: Algorithms and Technology (Part 7)

DZone's Guide to

A Guide to Parsing: Algorithms and Technology (Part 7)

Get a summary of the main information needed to understand and implement a specific parser algorithm — specifically, top-down algorithms.

· AI Zone ·
Free Resource

Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

Be sure to check out Part 6 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.

Let's continue our overview of parsing algorithms.

Parsing Algorithms

Overview

Tables of Parsing Algorithms

We provide a table below to offer a summary of the main information needed to understand and implement a specific parser algorithm. You can find more implementations by reading our articles that present parsing tools and libraries for JavaC#Python, and JavaScript.

The table lists:

  • A formal description, to explain the theory behind the algorithm
  • A more practical explanation
  • One or two implementations, usually one easier and the other a professional parser. Sometimes, though, there is no easier version or a professional one.
ALGORITHM FORMAL DESCRIPTION EXPLANATION EXAMPLE IMPLEMENTATION
CYK An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages (PDF) CKY and Earley Algorithms (PDF) CYK Algorithm in Golang/Chart-parsers
Earley An Efficient Context-Free Parsing Algorithm (PDF) Earley Parsing Explained Nearley
LL LL(*): The Foundation of the ANTLR Parser Generator (PDF) LL Parser on Wikipedia/LL-parser.js ll(1) parser/ANTLR
LR On the Translation of Languages from Left to Right (PDF) Compiler Course Jison/Bison
Packrat (PEG) Packrat Parsing: A Practical Linear-Time Algorithm with Backtracking (PDF) Packrat Parsing: Simple, Powerful, Lazy, Linear TimePackrat Parsing in Scala (PDF) Companion Code of the Thesis/Canopy
Parser Combinator Constructing Natural Language Interpreters in a Lazy Functional Language (PDF) Parser Combinators Explained Understanding Parser Combinators/Sprache
Pratt Top Down Operator Precedence Simple Top-Down Parsing in Python A Pratt Parser in Python/JSLint

To understand how a parsing algorithm works, you can also look at the syntax analytic toolkit. It is an educational parser generator that describes the steps that a generated parser takes to accomplish its objective. It implements an LL and an LR algorithm.

The second table shows a summary of the main features of the different parsing algorithms and for what they are generally used.
Table of parsing Algorithms features and uses

Top-Down Algorithms

The top-down strategy is the most widespread of the two strategies and there are several successful algorithms applying it.

LL Parser

LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.

To find the correct rule to apply:

  1. Firstly, the parser looks at the current token and the appropriate amount of lookahead tokens.
  2. Then, it tries to apply the different rules until it finds the correct match.

The concept of the LL parser does not refer to a specific algorithm, but more to a class of parsers. They are defined in relation to grammars. That is to say, an LL parser is one that can parse a LL grammar. In turn, LL grammars are defined in relation to the number of lookahead tokens that are needed to parse them. This number is indicated between parentheses next to LL, so in the form LL(k).

An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar that needs k tokens of lookahead to be parsed. Effectively, the concept of the LL(k) grammar is more widely employed than the corresponding parser — which means that LL(k) grammars are used as a meter when comparing different algorithms. For instance, you would read that PEG parsers can handle LL(*) grammars.

The Value of LL Grammars

This use of LL grammars is due both to the fact that LL parser are widely used and that they are a bit restrictive. In fact, LL grammars does not support left-recursive rules. You can transform any left-recursive grammar in an equivalent non-recursive form, but this limitation matters for a couple of reasons: productivity and power.

The loss of productivity depends on the requirement that you have to write the grammar in a special way, which takes time. The power is limited because a grammar that might need one token of lookahead when written with a left-recursive rule might need two or three tokens of lookahead when written in a non-recursive way. So, this limitation is not merely annoying, it is limiting the power of the algorithm, i.e. the grammars it can be used for.

The loss of productivity can be mitigated by an algorithm that automatically transforms a left-recursive grammar in a non-recursive one. ANTLR is a tool that can do that, but, of course, if you are building your own parser, you have to do that yourself.

There are two special kind of LL(k) grammars: LL(1) and LL(*). In the past the first kinds were the only one considered practical, because it is easy to build efficient parsers for them. Up to the point that many computer languages were purposefully designed to be described by a LL(1) grammar. An LL(*), also known as LL-regular, parser can deal with languages using an infinite amount of lookahead tokens.

On StackOverflow you can read a simple comparison between LL parsers and recursive descent parsers or one between LL parsers and LR parsers.

Earley Parser

The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually compared to CYK, another chart parser, that is simpler but also usually worse in performance and memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial results, it implement a prediction step to decide which rule is going to try to match next.

The Earley parser fundamentally works by dividing a rule in segments, like in the following example.

// an example grammar
HELLO    : "hello"
NAME     : [a-zA-Z]+
greeting : HELLO NAME

// Earley parser would break up greeting like this
// . HELLO NAME
// HELLO . NAME
// HELLO NAME .

Then, working on this segment that can be connected at the dot (.), tries to reach a completed state; that is to say. one with the dot at the end.

The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages, while other famous algorithms (i.e. LL, LR) can parse only a subset of them. For instance, it has no problem with left-recursive grammars. More generally, an Earley parser can also deal with nondeterministic and ambiguous grammars.

It can do that at the risk of worse performance (O(n3)), in the worst case. However, it has a linear time performance for normal grammars. The catch is that the set of languages parsed by more traditional algorithms are the one we are usually interested in.

There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in certain way the parsing can be more efficient, i.e., building an LL(1) grammar might be harder for the developer, but the parser can apply it very efficiently. With Earley, you do less work, so the parser does more of it.

In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in terms of performance.

Earley Use Cases

So... Earley parsers are easy to use, but the advantage, in terms of performance, in the average case, might be non-existent. This makes the algorithm great for an educational environment or wherever productivity is more relevant than speed.

The first case is useful, for example, because most of the time, the grammars your users write work just fine. The problem is that the parser will throw at them obscure and seemingly random errors. Of course, the errors are not actually random, but they are due to the limitations of an algorithm that your users do not know or understand. So, you are forcing the user to understand the inner workings of your parser to use it, which should be unnecessary.

An example of when productivity is more important than speed might be a parser generator to implement syntax highlighting for an editor that needs to support many languages. In a similar situation, being able to quickly support new languages might be more desirable than completing the task as soon as possible.

You can check out the rest of the series below:

  • Part 1
  • Part 2
  • Part 3
  • Part 4
  • Part 5
  • Part 6
  • Part 8
  • Part 9
  • Start coding something amazing with the IBM library of open source AI code patterns.  Content provided by IBM.

    Topics:
    ai ,tutorial ,algorithms ,parsing ,grammars ,top-down programming

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}