Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parsing in JavaScript: Tools and Libraries, Part 1

DZone's Guide to

Parsing in JavaScript: Tools and Libraries, Part 1

In this article, we introduce the idea of parsers, how they work, discuss the concepts underlying parser libraries and tools, and introduce parsing in JS.

· Web Dev Zone
Free Resource

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

If you need to parse a language, or document, from JavaScript there are fundamentally three ways to solve the problem:

  • Use an existing library supporting that specific language: for example a library to parse XML.
  • Building your own custom parser by hand.
  • A tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language.

Use an Existing Library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually also includes APIs to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases, you are out of luck.

Building Your Own Custom Parser by Hand

You may need to pick the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A Tool or Library to Generate a Parser

In all other cases, the third option should be the default one, because it is the one that is most flexible and has the shortest development time. That is why in this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation.

Tools to Create Parsers

We are going to see:

  • Tools that can generate parsers usable from JavaScript (and possibly from other languages).
  • JavaScript libraries to build parsers.

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: JavaScript. This also means that (usually) the parser itself will be written in JavaScript.

To list all possible parser tools and libraries for all languages would be kind of interesting, but not that useful. That is because there will be simply too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things to Know About Parsers

To make sure that this list is accessible to all programmers, we have prepared a short explanation of terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure of a Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-step schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

437 + 734

The lexer scans the text and finds ‘4’, ‘3’, ‘7’ and then space ‘ ’. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to the second token of the type PLUS, and lastly, it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parsers are called rules or productions. A lexer rule will specify that a sequence of digits corresponds to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process the original text directly, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was, for example, the case of the venerable lex and yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract Syntax Tree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree.
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed.

In the AST, some information is lost, for instance, comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually, a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

Sometimes you may want to start producing a parse tree and then derive an AST from this. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following these steps, we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc.

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms, this is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must start with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if,” the left and the right parentheses were token types, while the expression and the statement were references to other rules.

The most used format to describe grammar is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

<symbol> ::= __expression__

The <symbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contain other non-terminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class.”

Left-Recursive Rules

In the context of parsers, an important feature is the support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could also be indirect.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

addition       ::= expression '+' expression
multiplication ::= expression '*' expression
// an expression could be an addition or a multiplication or a number
expression     ::= addition | multiplication |// a number

This description also matches multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided into its two components.

The problem is that this kind of rule may not be used with some parser generators. The alternative is a long chain of expressions that take care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect ones.

Types of Languages and Grammar

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one needs something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually, a kind of language corresponds to the same kind of grammar. That is to say, there are regular grammars and context-free grammars that correspond respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors, they describe programming languages more naturally.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead, with PEG, the first applicable choice will be chosen, and this automatically solves some ambiguities.

Another difference is that PEG uses scannerless parsers: they do not need a separate lexer or lexical analysis phase.

Traditionally, both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm or by having the tool automatically rewrite a left-recursive rule in a nonrecursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsening its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

var lexer = new Lexer;

var chars = lines = 0;

lexer.addRule(/\n/, function () {
    lines++;
    chars++;
});

lexer.addRule(/./, function () {
    chars++;
});

lexer.setInput("Hello World!\n Hello Stars!\n!")

lexer.lex();

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:
web dev ,javascript ,parsers

Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}