Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Guide to Parsing: Algorithms and Technology (Part 5)

DZone's Guide to

A Guide to Parsing: Algorithms and Technology (Part 5)

There are two main formats for a grammar: BNF (and its variants) and PEG. Learn about these different formats and when they should be used.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Be sure to check out Part 4 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.

Formats

There are two main formats for a grammar: BNF (and its variants) and PEG. Many tools implement their own variants of these ideal formats. Some tools use custom formats altogether. A frequent custom format consists of a three-part grammar: options together with custom code, followed by the lexer section and finally the parser one.

Given that a BNF-like format is usually the foundation of a context-free grammar, you could also see it identified in the CFG format.

Backus-Naur Form and Its Variants

Backus-Naur Form (BNF) is the most successful format and was even the basis of PEG. However, it is quite simple, and thus it is not often used in its base form. A more powerful variant is typically used.

To show why these variants were necessary, let’s show an example in BNF: the description of a character.

<letter>    ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
<digit>     ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<character> ::= <letter> | <digit>

The symbol <letter> can be transformed in any of the letters of the English alphabet, although in our example, only lowercase letters are valid. A similar process happens for <digit>, which can indicate any of the alternative digits. The first issue is that you have to list all the alternatives individually; you cannot use characters classes like you do with regular expressions. This is annoying, but usually manageable, unless, of course, you have to list all Unicode characters.

A much harder problem is that there is no easy way to denote optional elements or repetitions. So if you want to do that you have to rely on boolean logic and the alternative (|) symbol.

<text>      ::= <character> | <text> <character>

This rule states that <text> can be composed of a character or a shorter <text> followed by a <character>. This would be the parse tree for the word dog:

Parse tree for &apos;dog&apos;

BNF has many other limitations: it makes complicate to use empty strings or the symbols used by the format (i.e. ::=) in the grammar, it has a verbose syntax (i.e. you have to include terminals between < and >), etc.

Extended Backus-Naur Form

To solve some of these limitations, Niklaus Wirth created the Extended Backus-Naur Form(EBNF), which includes some concepts from his own Wirth syntax notation.

EBNF is the most used form of contemporary parsing tool, although tools might deviate from the standard notation. EBNF has a much cleaner notation and adopts more operators to deal with concatenation or optional elements.

Let’s see how you would write the previous example in EBNF.

letter    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ;
digit     = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
character = letter | digit ;
text      = character , { character } ;

The text symbol is described more easily with the help of the concatenation operator (,) and the optional one ({ .. }). The resulting parse tree would also be simpler. The standard EBNF still presents some problems, like the unfamiliar syntax. To overcome this issue, most parsing tools adopt a syntax inspired by regular expressions and also support additional elements like characters classes.

If you need an in-depth coverage of the format you can read our article on EBNF.

ABNF

Augmented BNF (ABNF) is another variant of BNF, mainly developed to describe bi-directional communication protocols and formalized by the IETF with several documents (see RFC 5234 and updates).

ABNF can be as productive as EBNF, but it had some quirks that limited its adoption outside of internet protocols. For example, until recently, the standard dictated that strings were matched in a case-insensitive way, which would have been a problem for matching identifiers in many programming languages.

ABNF has a different syntax from EBNF; for example, the alternative operator is the slash (/), and sometimes it is plainly better. For instance, there is no need for a concatenation operator. It also has a few more things than standard EBNF. For instance, it allows you to define numeric ranges, such as %x30-39, which is equivalent to [0-9]. This is also used by the designers themselves to include standard character classes-like basic rules that the end user can use. An example of such rule is ALPHA, which is equivalent to [a-zA-Z].

PEG

Parsing Expression Grammar (PEG) is a format presented by Brian Ford in a 2004 paper. Technically, it derives from an old formal grammar called Top-Down Parsing Language (TDPL). However, a simple way to describe it is EBNF in the real world.

In fact, it looks similar to EBNF but also directly support things widely used, like character ranges (character classes) — although it also has some differences that are not actually that pragmatic, like using the more formal arrow symbol () for assignment instead of the more common equals symbol (=). The previous example could be written this way with PEG.

letter    ← [a-z]
digit     ← [0-9]
character ← letter / digit
text      ← character+

As you can see, this is the most obvious way a programmer would write it: with character classes and regular expression operators. The only anomalies are the alternative operator (/) and the arrow character, and in fact, many implementations of PEG use the "equals" character.

The pragmatic approach is the foundation of PEG formalism: it was created to describe more naturally programming languages. That is because context-free grammar has its origin in the work to describe natural languages. In theory, CFG is a generative grammar, while PEG is an analytic grammar.

The first should be a sort of recipe to generate only the valid sentences in the language described by the grammar. It does not have to be related to the parsing algorithm. Instead, the second kind defines directly the structure and semantics of a parser for that language.

PEG vs. CFG

The practical implications of this theoretical difference are limited: PEG is more closely associated with the packrat algorithm, but that is basically it. For instance, generally, PEG (packrat) does not permit left recursion. Although the algorithm itself can be modified to support it, this eliminates the linear-time parsing property. Also, PEG parsers generally are scannerless parsers.

Probably the most important difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus will return an error. By usually wrong, we mean that some parsers that adopt CFGs can deal with ambiguous grammars; for instance, by providing all possible valid results to the developer and let them sort it out. Instead, PEG eliminates ambiguity altogether because the first applicable choice will be chosen; thus, a PEG can never be ambiguous.

The disadvantage of this approach is that you have to be more careful in listing possible alternatives because otherwise, you could have unexpected consequences. That is to say, some choices could never be matched.

In the following example, doge will never be matched. Since dog comes first, it will be picked each time.

dog ← 'dog' / 'doge'

It is an open area of research whether PEG can describe all grammars that can be defined with CFG, but for all practical purposes, they do.

You can check out the rest of the series below:

  • Part 1
  • Part 2
  • Part 3
  • Part 4
  • Part 6
  • Part 7
  • Part 8
  • Part 9
  • TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

    Topics:
    ai ,tutorial ,grammars ,machine learning ,deep learning ,algorithms

    Published at DZone with permission of

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}