Parsing in C#: All the Tools and Libraries You Can Use (Part 2)
In today's post, we take a look at the best parser generators out there for working with C#, taking a look at the code you need to use them as well.
Join the DZone community and get the full member experience.
Join For FreeWelcome back! If you missed Part 1, you can check it out here.
Parser Generators
The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your C# code.
The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such as Listeners or Visitors. Some tools, instead, offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.
Usually, you need a runtime library and/or program to use the generated parser.
Regular (Lexer)
Tools that analyze regular languages are typically called lexers.
Gardens Point LEX
Gardens Point LEX (GPLEX) is a lexical analyzer (lexer) generator based upon finite state automata. The input language is similar to the original LEX, but it also implement some extensions of FLEX. Obviously, it has better Unicode support. GPLEX can generate a C# lexer from a grammar file, .lex
.
The typical grammar is divided in three parts, separated by '%%':
- Options and C# definitions.
- Rules with embedded actions.
- Usercode.
As you can see in the following example, in standalone programs the usercode section contains the Main
function, so a .lex file can generate a complete functioning program. Although this makes it quite messy and hard to read for the untrained reader.
// example from the documentation
%namespace LexScanner
%option noparser nofiles
alpha [a-zA-Z]
%%
foo |
bar Console.WriteLine("keyword " + yytext);
{alpha}{3} Console.WriteLine("TLA " + yytext);
{alpha}+ Console.WriteLine("ident: " + yytext);
%%
public static void Main(string[] argp) {
Scanner scnr = new Scanner();
for (int i = 0; i < argp.Length; i++) {
Console.WriteLine("Scanning \"" + argp[i] + "\"");
scnr.SetSource(argp[i], 0);
scnr.yylex();
}
}
It can be used as a standalone tool, but, being a lexer generator, it can also work with parser generators: it is designed to work with Gardens Point Parser Generator, however, it has also been used with COCO/R and custom parsers.
Context Free
Let's see the tools that generate Context Free parsers.
ANTLR
ANTLR is a great parser generator written in Java that can also generate parsers for C# and many other languages. A particularity of the C# target is that there are actually two versions: the original by sharwell and the new standard runtime. The original defined itself as C# optimized, while the standard one is included in the general distribution of the tool. Neither is a fork, since the authors work together and both are mentioned in the official website. It is more of a divergent path. The grammars are compatible, but the generated parsers are not.
If you are unsure of which one to pick, I suggest the standard one, since it is slightly more updated. In fact, the standard has a release version supporting .NET Core, while the original only has a pre-release.
ANTLR is based on a new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).
It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However, a real added value of a vast community is the large amount of grammars available.
It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited for when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.
The typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively, lexer and parser grammars can be defined in separate files.
grammar simple;
basic : NAME ':' NAME ;
NAME : [a-zA-Z]* ;
COMMENT : '/*' .*? '*/' -> skip ;
If you are interested in learning how to use ANTLR, you can look into this giant ANTLR tutorial we have written.
Coco/R
Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.
The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.
Technically, all the grammars must be LL(1), that is to say, the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.
A Coco/R grammar looks like this.
[Imports]
// ident is the name of the grammar
"COMPILER" ident
// this includes arbitrary fields and method in the target language (eg. Java)
[GlobalFieldsAndMethods]
// ScannerSpecification
CHARACTERS
[..]
zero = '0'.
zeroToThree = zero + "123" .
octalDigit = zero + "1234567" .
nonZeroDigit = "123456789".
digit = '0' + nonZeroDigit .
[..]
TOKENS
ident = letter { letter | digit }.
[..]
// ParserSpecification
PRODUCTIONS
// just a rule is shown
IdentList =
ident <out int x> (. int n = 1; .)
{',' ident (. n++; .)
} (. Console.WriteLine("n = " + n); .)
.
// end
"END" ident '.'
Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C#, and C++.
Gardens Point Parser Generator
There are some adaptations to make it work with C# and its tools (e.g. MPF). However, a particular feature of GPPG is the possibility of also generating an HTML report of the structure of the generated parser. This is meant to simplify understanding and analysis of the parser and grammar.
It is designed to work with its brother GPLEX, however, it has also been used with COCO/R and custom lexers. The structure of the grammar is similar to the one of the brother, but instead of .lex
it has the extension .y
.
Grammatica
The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.
A typical grammar of Grammatica is divided in three sections: header, tokens, and production. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.
%header%
GRAMMARTYPE = "LL"
[..]
%tokens%
ADD = "+"
SUB = "-"
[..]
NUMBER = <<[0-9]+>>
WHITESPACE = <<[ \t\n\r]+>> %ignore%
%productions%
Expression = Term [ExpressionTail] ;
ExpressionTail = "+" Expression
| "-" Expression ;
Term = Factor [TermTail] ;
[..]
Atom = NUMBER
| IDENTIFIER ;
Hime Parser Generator
Hime Parser Generator is a parser generator for .NET and Java, with a modern implementation of GLR with the RNGLR algorithm. It is a pragmatic tool with everything you need and nothing more. It does not reinvent the wheel, but it does improve it.
The grammar uses a custom language based on BNF with some enhancement. For instance, it supports context-sensitive lexing (useful for soft keywords), template rules to avoid repetition of similar rules, and features to transform the parse tree in an AST. These features are called tree actions: drop and promote. One drops the node from the tree, the other substitutes the node with its children.
Instead of embedding code in a Hime grammar, you can annotate a rule with something called semantic action in the documentation. In practical terms, you just write the name of a function next to a rule and then you implement the function in your source code.
The grammar is put in a file with the .gram
extension. The structure of the file resembles the classic one with the three sections: options, terminals, and rules. But it is much cleaner and looks like a C# class.
grammar MathExp
{
options
{
Axiom = "exp";
Separator = "SEPARATOR";
}
terminals
{
WHITE_SPACE-> U+0020 | U+0009 | U+000B | U+000C ;
SEPARATOR-> WHITE_SPACE+;
INTEGER-> [1-9] [0-9]* | '0' ;
REAL-> INTEGER? '.' INTEGER (('e' | 'E') ('+' | '-')? INTEGER)?
| INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
NUMBER-> INTEGER | REAL ;
}
rules
{
exp_atom-> NUMBER^ @OnNumber // @ is a semantic action
| '('! exp^ ')'! ; // the final ! drop the node from the final tree
exp_op0-> exp_atom^
| exp_op0 '*'^ exp_atom // the ^ drop the node and changes with its children
| exp_op0 '/'^ exp_atom ;
exp_op1-> exp_op0^
| exp_op1 '+'^ exp_op0
| exp_op1 '-'^ exp_op0 ;
exp-> exp_op1^ ;
}
}
The documentation is concise, but complete: there are tutorials and recipes to explain practical usage of the tool. There is an equally good enough grammar repository. It has debug features, like the generation of DOT files.
LLLPG
So, LLLPG is a LL(k) parser generator that is not actually a standalone parser generator, but a part of a larger project called Enhanced C#. It is not really usable standalone, because it does not even generate a complete class, as the tool only translates the parts of the input file that it recognizes.
It also bizarrely claims to be better than ANTLR2 (released in 2006), despite not being updated until recently. But, we have mentioned it because, for the very narrow objective of building a custom language on .NET, it is a good tool designed just for that objective.
PEG
After the CFG parsers, is time to look at the PEG parsers available for C#.
IronMeta
Despite the potentially perplexing reference to being a programming language, IronMeta is a PEG parser generator that works just like any other.
IronMeta improves upon its base, OMeta, by allowing direct and indirect left recursion. On the other hand, error reporting and documentation are limited.
An IronMeta grammar can contain embedded actions and conditions. A condition is a boolean expression that controls the activation of the following rule. If the condition is true, the rule activates.
// from the documentation
// IronMeta Calculator Example
using System;
using System.Linq;
ironmeta Calc<char, int> : IronMeta.Matcher.CharMatcher<int>
{
Expression = Additive;
Additive = Add | Sub | Multiplicative;
Add = BinaryOp(Additive, '+', Multiplicative) -> { return _IM_Result.Results.Aggregate((total, n) => total + n); };
Sub = BinaryOp(Additive, '-', Multiplicative) -> { return _IM_Result.Results.Aggregate((total, n) => total - n); };
Multiplicative = Multiply | Divide;
Multiplicative = Number(DecimalDigit);
Multiply = BinaryOp(Multiplicative, "*", Number, DecimalDigit) -> { return _IM_Result.Results.Aggregate((p, n) => p * n); };
Divide = BinaryOp(Multiplicative, "/", Number, DecimalDigit) -> { return _IM_Result.Results.Aggregate((q, n) => q / n); };
BinaryOp :first :op :second .?:type = first:a KW(op) second(type):b -> { return new List<int> { a, b }; };
Number :type = Digits(type):n WS* -> { return n; };
Digits :type = Digits(type):a type:b -> { return a*10 + b; };
Digits :type = type;
DecimalDigit = .:c ?( (char)c >= '0' && (char)c <= '9' ) -> { return (char)c - '0'; };
KW :str = str WS*;
WS = ' ' | '\n' | '\r' | '\t';
}
Pegasus
Pegasus is a simple parser generator with equally sparse documentation. It supports the formal definition of PEG and does have basic features to simplify the management of indentation and debugging.
A Pegasus grammar is written in a .peg
file that is quite clean, but can also includes embedded actions.
@namespace MyProject
@classname ExpressionParser
additive <double> -memoize
= left:additive "+" right:multiplicative { left + right }
/ left:additive "-" right:multiplicative { left - right }
/ multiplicative
multiplicative <double> -memoize
= left:multiplicative "*" right:power { left * right }
/ left:multiplicative "/" right:power { left / right }
/ power
power <double>
= left:primary "^" right:power { Math.Pow(left, right) }
/ primary
primary <double> -memoize
= decimal
/ "-" primary:primary { -primary }
/ "(" additive:additive ")" { additive }
decimal <double>
= value:([0-9]+ ("." [0-9]+)?) { double.Parse(value)
That's all for Part 2! Tune back in tomorrow when we'll take a look at parser combinators, the best way to part C# code, and some great tools.
Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments