Parsing in JavaScript: Tools and Libraries, Part 2

We explore the world of parser generators and parser combinators, to help guide you intrepid web developers through the complex world of JavaScript.

Gabriele Tomassetti

Aug. 15, 17 · Analysis

Likes (5)

Comment

Save

8.0K Views

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your JavaScript code.

The parser might produce the AST that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually, you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically called lexers.

Lexer

Lexer is a lexer that claims to be modeled after flex. Though a more fair description would be a short lexer based upon RegExp. The documentation seems minimal, with just a few examples, but the whole thing is 147 lines of code, so it is actually comprehensive.

However, in a few lines, it manages to support a few interesting things and it appears to be quite popular and easy to use. One thing is its supports RingoJS, a JavaScript platform on top of the JVM. Another one is the integration with Jison, the Bison clone in JavaScript. If you temper your expectations, it can be a useful tool.

There is no grammar, you just use a function to define the RegExp pattern and the action that should be executed when the pattern is matched. So it is a cross between a lexer generator and a lexer combinator. You cannot combine different lexer functions, like in a lexer combinator, but the lexer is only created dynamically at runtime, so it is not a proper lexer generator either.

A lexer example in JavaScript:

var lexer = new Lexer;
 
var chars = lines = 0;
 
lexer.addRule(/\n/, function () {
    lines++;
    chars++;
});
    
lexer.addRule(/./, function () {
    chars++;
});
    
lexer.setInput("Hello World!\n Hello Stars!\n!")
    
lexer.lex();

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for JavaScript and many other languages. There is also a beta version for TypeScript from the same guy that makes the optimized C# version. ANTLR is based on a new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However, a real added value is ANTLR's vast community and the large amount of grammars available.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is good for when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively, lexer and parser grammars can be defined in separate files.

A very simple ANTLR grammar

grammar simple;
 
basic   : NAME ':' NAME ;
 
NAME    : [a-zA-Z]* ;
 
COMMENT : '/*' .*? '*/' -> skip ;

If you are interested in ANTLR, you can look into this giant ANTLR tutorial we have written.

APG

APG is a recursive-descent parser using a variation of Augmented BNF, that they call Superset Augmented BNF. ABNF is a particular variant of BNF designed to better support bi-directional communications protocols. APG also supports additional operators, like syntactic predicates and custom user defined matching functions.

It can generate parsers in C/C++, Java, and JavaScript. Support for the last language seems superior and more up to date: it has a few more features and seems more updated. In fact, the documentation says it is designed to have the look and feel of JavaScript RegExp.

Because it is based on ABNF, it is especially well suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.

APG grammar is very clean and easy to understand.

APG grammar example:

// example from a tutorial of the author of the tool available here
// https://www.sitepoint.com/alternative-to-regular-expressions/
phone-number = ["("] area-code sep office-code sep subscriber
area-code    = 3digit                       ; 3 digits
office-code  = 3digit                       ; 3 digits
subscriber   = 4digit                       ; 4 digits
sep          = *3(%d32-47 / %d58-126 / %d9) ; 0-3 ASCII non-digits
digit        = %d48-57                      ; 0-9

Jison

Jison generates bottom-up parsers in JavaScript. Its API is similar to Bison’s, hence the name.

Despite the name, Jison can also replace flex, so you do not need a separate lexer. Although you can use one or build your own custom lexer. All you need is an object with the functions setInput and lex. It can best recognize languages described by LALR(1) grammars, though it also has modes for LR(0), SLR(1).

The generated parser does not require a runtime component, so you can use it as a standalone software.

It has a good enough documentation with a few examples and even a section to try your grammar online. Although, at times, it relies on the Bison manual to cover for the lack of its own documentation.

A Jison grammar can be inputted using a custom JSON format or a Bison-styled one. Both require you to use embedded actions if you want to do something when a rule is matched. The following example is in the custom JSON format.

// example from the documentation
{
   "comment": "JSON Math Parser",
   // JavaScript comments also work
 
   "lex": {
      "rules": [
         ["\\s+",                    "/* skip whitespace */"],
         ["[0-9]+(?:\\.[0-9]+)?\\b", "return 'NUMBER'"],
         ["\\*",                     "return '*'"],
         ["\\/",                     "return '/'"],
         ["-",                       "return '-'"],
         ["\\+",                     "return '+'"],
         [..] // skipping some parts
      ]
   },
 
   "operators": [
      ["left", "+", "-"],
      ["left", "*", "/"],
      ["left", "UMINUS"]
      [..] // skipping some parts
   ],
 
   "bnf": {
      "expressions": [["e EOF",   "return $1"]],
 
      "e" :[
         ["e + e",  "$$ = $1+$3"],
         ["e - e",  "$$ = $1-$3"],  
         ["- e",    "$$ = -$2", {"prec": "UMINUS"}],         
         ["NUMBER", "$$ = Number(yytext)"],
         [..] // skipping some parts
      ]
   }
}

It is very popular and used by many projects, including CoffeeScript and Handlebars.js.

nearley

nearley uses the Earley parsing algorithm augmented with Joop Leo’s optimizations to parse complex data structures easily. nearley is über-fast and really powerful. It can parse literally anything you throw at it.

The Earley algorithm is designed to easily handle all grammars, including left-recursive and ambiguous ones. Essentially, its main advantage it is that it should never catastrophically fail. On the other hand, it could be slower than other parsing algorithms. nearley, itself is also able to detect some ambiguous grammars. It can also report multiple results in the case of an ambiguous input.

A nearley grammar is a written in a .ne file that can include custom code. A rule can include an embedded action, which the documentation calls a postprocessing function. You can also use a custom lexer.

An example of nearley grammar in .NET

# from the examples in Nearly
# csv.ne
 
@{%
var appendItem = function (a, b) { return function (d) { return d[a].concat([d[b]]); } };
var appendItemChar = function (a, b) { return function (d) { return d[a].concat(d[b]); } };
var empty = function (d) { return []; };
var emptyStr = function (d) { return ""; };
%}
 
 
file              -> header newline rows             {% function (d) { return { header: d[0], rows: d[2] }; } %}
# id is a special preprocessor function that return the first token in the match
header            -> row                             {% id %}
 
rows              -> row
                   | rows newline row                {% appendItem(0,2) %}
 
row               -> field
                   | row "," field                   {% appendItem(0,2) %}
 
field             -> unquoted_field                  {% id %}
                   | "\"" quoted_field "\""          {% function (d) { return d[1]; } %}
 
quoted_field      -> null                            {% emptyStr %}
                   | quoted_field quoted_field_char  {% appendItemChar(0,1) %}
 
quoted_field_char -> [^"]                            {% id %}
                   | "\"" "\""                       {% function (d) { return "\""; } %}
 
unquoted_field    -> null                            {% emptyStr %}
                   | unquoted_field char             {% appendItemChar(0,1) %}
 
char              -> [^\n\r",]                       {% id %}
 
newline           -> "\r" "\n"                       {% empty %}ca
                   | "\r" | "\n"                     {% empty %}

Another interesting feature is that you could build custom tokens. You can define them using a tokenizing library, a literal or a test function. The test function must return true if the text corresponds to that specific token.

Custom tokens example:

	@{%
var print_tok  = {literal: "print"};
var number_tok = {test: function(x) {return x.constructor === Number; }}
%}
 
 
main -> %print_tok %number_tok

nearley includes tools for debugging and understanding your parser. For instance, Unparser can automatically generate random strings that are considered correct by your parser. This is useful to test your parser against random noise or even to generate data from a schema (e.g. a random email address). It also includes a tool to generate SVG railroad diagrams: a graphical way to represent a grammar.

Railroad diagrams from the documentation demo

A nearley parser requires the nearley runtime.

The nearley documentation is a good overview of what is available and there is also a third-party playground to try grammar online. But you will not find a complete explanation of all the features.

PEG

Now that we've covered the CFG parsers, it is time to see the PEG parsers available for JavaScript.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python, and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms, you just write the name of a function next to a rule and then you implement the function in your source code.

A Canopy grammar with actions:

// the actions are prepended by %
grammar Maps
  map     <-  "{" string ":" value "}" %make_map
  string  <-  "'" [^']* "'" %make_string
  value   <-  list / number
  list    <-  "[" value ("," value)* "]" %make_list
  number  <-  [0-9]+ %make_number

The JavaScript file containing the action code.

var maps = require('./maps');
 
var actions = {
  make_map: function(input, start, end, elements) {
    var map = {};
    map[elements[1]] = elements[3];
    return map;
  },
  [..]
};

Ohm

Ohm is a parser generator consisting of a library and a domain-specific language.

Ohm grammars are defined in a custom format that can be put in a separate file or in a string. However, the parser is generated dynamically and not with a separate tool. In that sense, it works like a parser library more than a traditional parser generator.

A helper function to create an AST is included among the extras.

In Ohm, a grammar defines a language, and semantic actions specify what to do with valid inputs in that language.

A grammar is completely separated from semantic actions. This simplifies portability and readability and allows one to support different languages with the same grammar. At the moment, Ohm only supports JavaScript, but the team is planning on adding more languages in the future. The typical Ohm grammar is clean and readable.

arithmetic.ohm

Arithmetic {
  Exp
    = AddExp
 
  AddExp
    = AddExp "+" PriExp  -- plus
    | AddExp "-" PriExp  -- minus
    | PriExp
 
  PriExp
    = "(" Exp ")"  -- paren
    | number
 
  number
    = digit+
}

Semantic actions can be implemented using a simple API. In practical terms, this ends up working like the visitor pattern, with the difference that is easier to define more groups of semantic actions.

Example of Ohm semantic actions in JavaScript:

var semantics = g.createSemantics().addOperation('eval', {
  Exp: function(e) {
    return e.eval();
  },
  AddExp: function(e) {
    return e.eval();
  },
  AddExp_plus: function(left, op, right) {
    return left.eval() + right.eval();
  },
  AddExp_minus: function(left, op, right) {
    return left.eval() - right.eval();
  },
  PriExp: function(e) {
    return e.eval();
  },
  PriExp_paren: function(open, exp, close) {
    return exp.eval();
  },
  number: function(chars) {
    return parseInt(this.sourceString, 10);
  },
});
var match = g.match('1 + (2 - 3) + 4');
console.log(semantics(match).eval());

To support debugging, Ohm has a text trace and (work in progress) graphical visualizer. You can see the graphical visualizer at work and test a grammar in the interactive editor.

The documentation is not that bad, though you have to go under the doc directory to find it. There is no tutorial, but there are a few examples and a reference. In particular, the documentation suggests reading a well commented Math example.

PEG.js

PEG.js is a simple parser generator for JavaScript that produces fast parsers with excellent error reporting.

PEG.js can work as a traditional parser generator and create a parser with a tool or can generate one using a grammar defined in the code. It supports different module loaders (e.g. AMD, CommonJS, etc.).

PEG.js has a neat online editor that allows you to write a grammar, test the generated parser, and download it.

A PEG.js grammar is usually written in a .pegjs file and can include embedded actions, custom initializing code, and semantic predicates. That is to say, functions that determine if a specific match is activated or not.

Example of math.pegjs grammar:

// example from the documentation
{
  function makeInteger(o) {
    return parseInt(o.join(""), 10);
  }
}
 
start
  = additive
 
additive
  = left:multiplicative "+" right:additive { return left + right; }
  / multiplicative
 
multiplicative
  = left:primary "*" right:multiplicative { return left * right; }
  / primary
 
primary
  = integer
  / "(" additive:additive ")" { return additive; }
 
integer "integer"
  = digits:[0-9]+ { return makeInteger(digits); }

The documentation is good enough, there are a few example grammars, but there are no tutorials available. The popularity of the project has led to the development of third-party tools, like one to generate railroad diagrams, and plugins, like one to generate TypeScript parsers.

Waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, Javascript, Python, Ruby, and Scheme.

Waxeye can facilitate the creation of an AST by defining nodes in the grammar that will not be included in the generated tree. That is quite useful, but a drawback of Waxeye is that it only generates an AST. In the sense that there is no way to automatically execute an action when you match a node. You have to traverse and execute what you need manually.

One positive side-effect of this limitation is that grammars are easily readable and clean. They are also independent of any language.

Calc.waxeye

// from the manual
 
calc  <- ws sum
 
sum   <- prod *([+-] ws prod)
 
prod  <- unary *([*/] ws unary)
 
unary <= '-' ws unary
       | :'(' ws sum :')' ws
       | num
 
num   <- +[0-9] ?('.' +[0-9]) ws
 
ws    <: *[ \t\n\r]

A particular feature of Waxeye is that it provides some help to compose different grammars together and then it facilitates modularity. For instance, you could create a common grammar for identifiers that are usually similar in many languages.

Waxeye has a great documentation in the form of a manual that explains basic concepts and how to use the tool for all the languages it supports. There are a few example grammars.

Waxeye seems to be maintained, but it is not actively developed.

Parser Combinators

They allow you to create a parser simply with JavaScript code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered best suited for simpler parsing needs. Given that they are just JavaScript libraries, you can easily introduce them into your project: you do not need any specific generation steps and you can write all of your code in your favorite editor. Their main advantage is the possibility of being integrated into your traditional workflow and IDE.

In practice, this means that they are very useful for all the little parsing problems you find. If the typical developer encounters a problem that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Bennu, Parjs, and Parsimmon

Bennu is a JavaScript parser combinator library based on Parsec. The Bennu library consists of a core set of parser combinators that implement Fantasy Land interfaces. More advanced functionality, such as detailed error messaging, custom parser state, memoization, and running unmodified parsers incrementally, is also supported.

All libraries are inspired by Parsec. Bennu and Parsimmon are the oldest and Bennu the more advanced between the two. They also all have an extensive documentation, but they have no tutorial. Bennu seems to be maintained, but it is not actively developed.

Here's an example of Bennu example in JavaScript from the documentation:

var parse = require('bennu').parse;
var text = require('bennu').text;
 
var aOrB = parse.either(
    text.character('a'),
    text.character('b'));
 
parse.run(aOrB, 'b'); // 'b'

Parsimmon is a small library for writing big parsers made up of lots of little parsers. The API is inspired by parsec and Promises/A+.

Parsimmon is the most popular among the three, it is stable and updated.

The following is a part of the JSON example.

Parsimmon JSON example in JavaScript:

let whitespace = P.regexp(/\s*/m);
 
let JSONParser = P.createLanguage({
  // This is the main entry point of the parser: a full JSON value.
  value: r =>
    P.alt(
      r.object,
      r.string,
      [..]
    ).thru(parser => whitespace.then(parser)),
 
  [..]
 
  // Regexp based parsers should generally be named for better error reporting.
  string: () =>
    token(P.regexp(/"((?:\\.|.)*?)"/, 1))
      .map(interpretEscapes)
      .desc('string'),
 
  [..]
 
  object: r =>
    r.lbrace
      .then(r.pair.sepBy(r.comma))
      .skip(r.rbrace)
      .map(pairs => {
        let object = {};
        pairs.forEach(pair => {
          let [key, value] = pair;
          object[key] = value;
        });
        return object;
      }),
});
 
///////////////////////////////////////////////////////////////////////
 
let text = // JSON Object
 
let ast = JSONParser.value.tryParse(text);

Parjs is a JavaScript library of parser combinators, similar in principle and in design to the likes of Parsec and in particular its F# adaptation FParsec.

It’s also similar to the Parsimmon library, but intends to be superior to it.

Parjs is only a few months old, but it is already quite developed. It also has the advantage of begin written in TypeScript.

All the libraries have good documentation, but Parjs is great: it explained how to use the parser and also how to design good parsers with it. There are a few examples, including the following on string formatting.

Parjs String Format example:

/**
 * Created by lifeg on 04/04/2017.
 */
import "../setup";
import {Parjs, LoudParser} from "../src";
//+ DEFINING THE PARSER
 
//Parse an identifier, an asciiLetter followed by an asciiLetter or digit, e.g. a12b but not 1ab.
let ident = Parjs.letter.then(Parjs.letter.or(Parjs.digit).many()).str;
 
//Parse a format token, an `ident` between `{` and `}`. Return the result as a Token object.
let formatToken = ident.between(Parjs.string("{"), Parjs.string("}")).map(x => ({token: x}));
 
//Parse an escaped character. This parses "`{a}" as the text "{a}" instead of a token.
//Also escapes the escaped char, parsing "``" as "`".
//Works for arbitrary characters like `a being parsed as a.
let escape = Parjs.string("`").then(Parjs.anyChar).str.map(x => ({text: x.substr(1)}));
 
//Parse text which is not an escape character or {.
let text = Parjs.noCharOf("`{").many(1).str.map(x => ({text: x}));
 
//The parser itself. Parses either a formatToken, e.g. {abc} or an escaped combo `x, or text that doesn't contain `{.
//Parses many times.
let formatParser = formatToken.or(escape, text).many();
 
//This prints a sequence of tokens {text: "hello, my name is "}, {token: name}, {text: " and I am "}, {token: " years old}, ...
console.log(formatParser.parse("hello, my name is {name} and I am {age} years old. This is `{escaped}. This is double escaped: ``{name}."));

That's it for today! Tune in tomorrow when we'll wrap up this series on JavaScript Parsers by talking about some JavaScript specific libraries related to parsing.

JavaScript library Parser (programming language) Documentation

Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending