Parsing in Java (Part 2): Diving Into CFG Parsers
Parsing in Java is a broad topic, so let's cover the various techniques, tools, and libraries out there and see which works best where and when.
Join the DZone community and get the full member experience.
Join For FreeIn case you missed Part 1, we covered the basics of how parsing works, then dove a bit deeper into the architecture of parsers, considering overviews the different types available, parse trees vs. ASTs, and more. Here in Part 2, we'll touch on parser generators before diving into context-free grammars. Part 3 will then follow up with a look at parsing expression grammars (PEGs).
Parser Generators
The basic workflow of a parser generator tool is quite simple: You write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Java code.
The parser might produce the AST, which you may have to traverse yourself, or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.
Usually, you need a runtime library and/or program to use the generated parser.
Regular (Lexer)
Tools that analyze regular languages are typically lexers.
We are not going to talk about it because it is very basic, but Java includes a library to parse data with numbers and simple patterns: java.util.Scanner. It could be defined as a smart library to read streams of data. It might be worth it to check it out if you need to quickly parse some data.
JFlex
JFlex is a lexical analyzer (lexer) generator based upon deterministic finite automata (DFA). A JFlex lexer matches the input according to the defined grammar (called spec) and executes the corresponding action (embedded in the grammar).
It can be used as a standalone tool, but being a lexer generator, it is designed to work with parser generators: Typically, it is used with CUP or BYacc/J. It can also work with ANTLR.
The typical grammar (spec) is divided into three parts, separated by ‘%%’:
- Usercode, which will be included in the generated class
- Options/macros
- And finally the lexer rules.
A JFlex spec file:
// taken from the documentation
/* JFlex example: partial Java language lexer specification */
import java_cup.runtime.*;
%%
// second section
%class Lexer
%unicode
%cup
[..]
LineTerminator = \r|\n|\r\n
%%
// third section
/* keywords */
<YYINITIAL> "abstract" { return symbol(sym.ABSTRACT); }
<YYINITIAL> "boolean" { return symbol(sym.BOOLEAN); }
<YYINITIAL> "break" { return symbol(sym.BREAK); }
<STRING> {
\" { yybegin(YYINITIAL);
return symbol(sym.STRING_LITERAL,
string.toString()); }
[..]
}
/* error fallback */
[^] { throw new Error("Illegal character <"+
yytext()+">"); }
Context Free
Let’s see the tools that generate Context Free parsers.
ANTLR
ANTLR is probably the most used parser generator for Java. ANTLR is based on a new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).
It can output parsers in many languages, but the real added value of a vast community is the large number of grammars available. Version 4 supports direct left-recursive rules.
It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited for when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.
The typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively, lexer and parser grammars can be defined in separate files.
A very simple ANTLR grammar:
grammar simple;
basic : NAME ':' NAME ;
NAME : [a-zA-Z]* ;
COMMENT : '/*' .*? '*/' -> skip ;
If you are interested in ANTLR, you can look into this giant ANTLR tutorial we have written.
APG
APG is a recursive-descent parser using a variation of Augmented BNF, that they call Superset Augmented BNF. ABNF is a particular variant of BNF designed to better support a bi-directional communications protocol. APG also support additional operators, like syntactic predicates and custom user-defined matching functions.
It can generate parsers in C/C++, Java, and JavaScript. Support for the last language seems superior and more up to date: It has a few more features and seems more updated. In fact, the documentation says it is designed to have the look and feel of JavaScript RegExp.
Because it is based on ABNF, it is especially well-suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.
An APG grammar is very clean and easy to understand:
// example from a tutorial of the author of the tool available here
// https://www.sitepoint.com/alternative-to-regular-expressions/
phone-number = ["("] area-code sep office-code sep subscriber
area-code = 3digit ; 3 digits
office-code = 3digit ; 3 digits
subscriber = 4digit ; 4 digits
sep = *3(%d32-47 / %d58-126 / %d9) ; 0-3 ASCII non-digits
digit = %d48-57 ; 0-9
BYACC/J
BYACC is Yacc that generates Java code. That is the whole idea, and it defines its advantages and disadvantages. It is well-known, and it allows easier conversion of a Yacc and C program to a Java program — although you obviously still need to convert all the C code embedded in semantic actions into Java code. Another advantage is that you do not need a separate runtime. The generated parser it is all you need.
On the other hand, it is old, and the parsing world has made many improvements. If you are an experienced Yacc developer with a code base to upgrade, it is a good choice. Otherwise, there are many more modern alternatives you should consider.
The typical grammar is divided into three sections, separated by ‘%%’: DECLARATIONS, ACTIONS, and CODE. The second one contains the grammar rules and the third one has the custom user code.
A BYACC grammar:
// from the documentation
%{
import java.lang.Math;
import java.io.*;
import java.util.StringTokenizer;
%}
/* YACC Declarations */
%token NUM
%left '-' '+'
%left '*' '/'
%left NEG /* negation--unary minus */
%right '^' /* exponentiation */
/* Grammar follows */
%%
input: /* empty string */
| input line
;
line: '\n'
| exp '\n' { System.out.println(" " + $1.dval + " "); }
;
%%
public static void main(String args[])
{
Parser par = new Parser(false);
[..]
}
Coco/R
Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.
The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.
Technically, all the grammars must be LL(1). That is to say that the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.
A Coco/R grammar:
[Imports]
// ident is the name of the grammar
"COMPILER" ident
// this includes arbitrary fields and method in the target language (eg. Java)
[GlobalFieldsAndMethods]
// ScannerSpecification
CHARACTERS
[..]
zero = '0'.
zeroToThree = zero + "123" .
octalDigit = zero + "1234567" .
nonZeroDigit = "123456789".
digit = '0' + nonZeroDigit .
[..]
TOKENS
ident = letter { letter | digit }.
[..]
// ParserSpecification
PRODUCTIONS
// just a rule is shown
IdentList =
ident <out int x> (. int n = 1; .)
{',' ident (. n++; .)
} (. Console.WriteLine("n = " + n); .)
.
// end
"END" ident '.'
Coco/R has a good documentation, with several example grammars. It supports several languages, including Java, C#, and C++.
CookCC
CookCC is a LALR (1) parser generator written in Java. Grammars can be specified in three different ways:
- In Yacc format: It can read grammars defined for Yacc.
- In its own XML format.
- In Java code, by using specific annotations.
A unique feature is that it can also output a Yacc grammar. This can be useful if you need to interact with a tool that supports a Yacc grammar, like some old C program with which you must maintain compatibility.
It requires Java 7 to generate the parser, but it can run on earlier versions.
A typical parser defined with annotations will look like this:
// required import
import org.yuanheng.cookcc.*;
@CookCCOption (lexerTable = "compressed", parserTable = "compressed")
// the generated parser class will be a parent of the one you define
// in this case it will be "Parser"
public class Calculator extends Parser
{
// code
// a lexer rule
@Shortcuts ( shortcuts = {
@Shortcut (name="nonws", pattern="[^ \\t\\n]"),
@Shortcut (name="ws", pattern="[ \\t]")
})
@Lex (pattern="{nonws}+", state="INITIAL")
void matchWord ()
{
m_cc += yyLength ();
++m_wc;
}
// a typical parser rules
@Rule (lhs = "stmt", rhs = "SEMICOLON")
protected Node parseStmt ()
{
return new SemiColonNode ();
}
}
For the standard of parser generators, using Java annotations it is a peculiar choice. Compared to an alternative like ANTLR, there is certainly a less clear division between the grammar and the actions. This could make the parser harder to maintain for complex languages. Also, porting to another language could require a complete rewrite.
On the other hand, this approach permits you to mix grammar rules with the actions to perform when you match them. Furthermore, it has the advantage of being integrated into the IDE of your choice, since it is just Java code.
CUP
CUP is the acronym of Construction of Useful Parsers and is a LALR parser generator for Java. It just generates the proper parser part, but it is well-suited to work with JFlex — although, obviously, you can also build a lexer by hand to work with CUP. The grammar has a syntax similar to Yacc and it allows you to embed code for each rule.
It can automatically generate a parse tree, but not an AST.
It also has an Eclipse plugin to aid you in the creation of a grammar, so effectively it has its own IDE.
The typical grammar is similar to YACC.
A CUP grammar:
// example from the documentation
// CUP specification for a simple expression evaluator (w/ actions)
import java_cup.runtime.*;
/* Preliminaries to set up and use the scanner. */
init with {: scanner.init(); :};
scan with {: return scanner.next_token(); :};
/* Terminals (tokens returned by the scanner). */
terminal SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD;
terminal UMINUS, LPAREN, RPAREN;
terminal Integer NUMBER;
/* Non-terminals */
non terminal expr_list, expr_part;
non terminal Integer expr;
/* Precedences */
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
/* The grammar */
expr_list ::= expr_list expr_part
|
expr_part;
expr_part ::= expr:e
{: System.out.println("= " + e); :}
SEMI
;
[..]
Grammatica
Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well-commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages, and a clean separation between the grammar and the source code.
The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good number of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.
A typical grammar of Grammatica is divided into three sections: header, tokens, and productions. It is also clean, almost as much as an ANTLR grammar. It is also based on a similar Extended BNF, although the format is slightly different.
A Grammatica grammar:
% header %
GRAMMARTYPE = "LL"
[..]
%
tokens %
ADD = "+"
SUB = "-" [..]
NUMBER = << [0 - 9] + >>
WHITESPACE = << [\t\ n\ r] + >> % ignore %
%
productions %
Expression = Term[ExpressionTail];
ExpressionTail = "+"
Expression
|
"-"
Expression;
Term = Factor[TermTail];
[..]
Atom = NUMBER |
IDENTIFIER;
Jacc
Jacc is similar to BYACC/J, except that is written in Java and thus it can run wherever your program can run. As a rule of thumb, it is developed as a more modern version of Yacc. The author describes small improvements in areas like error messages, modularity, and debugging support.
If you know Yacc and you do not have any code base to upgrade, it might be a great choice.
JavaCC
JavaCC is the other widely used parser generator for Java. The grammar file contains actions and all the custom code needed by your parser.
Compared to ANTLR, the grammar file is much less clean and includes a lot of Java source code.
A JavaCC grammar:
javacc_options
// "PARSER_BEGIN" "(" <IDENTIFIER> ")"
PARSER_BEGIN(SimpleParser)
public final class SimpleParser { // Standard parser class setup...
public static void main(String args[]) {
SimpleParser parser;
java.io.InputStream input;
}
PARSER_END(SimpleParser)
// the rules of the grammar
// token rules
TOKEN: { <
#DIGIT: ["0" - "9"] >
|
< #LETTER: ["A" - "Z", "a" - "z"] >
|
< IDENT: < LETTER > ( < LETTER > | < DIGIT > ) * > [..]
}
SKIP: {
" " | "\t" | "\n" | "\r"
}
// parser rules
[..]
void IdentDef(): {} { <
IDENT > ("*" | "-") ?
}
Thanks to its long history, it is used in important projects, like JavaParser. At the same time, this has left some quirks in the documentation and usage. For instance, technically, JavaCC itself does not build an AST, but it comes with a tool that does it, JTree, so for practical purposes, it does.
There is a grammar repository, but it does not have many grammars in it. It requires Java 5 or later.
ModelCC
ModelCC is a model-based parser generator that decouples language specification from language processing [..]. ModelCC receives a conceptual model as input, along with constraints that annotate it.
In practical terms, you define a model of your language, that works as a grammar, in Java, using annotations. Then you feed the model to ModelCC to obtain a parser.
With ModelCC, you define your language in a way that is independent of the parsing algorithm used. Instead, it should be the best conceptual representation of the language — although, under the hood, it uses a traditional parsing algorithm. So the grammar per se uses a form that is independent of any parsing algorithm, but ModelCC does not use magic and produces a normal parser.
There is a clear description of the intentions of the authors of the tools, but limited documentation. Nonetheless, there are examples available, including the following model for a calculator partially shown here.
public abstract class Expression implements IModel {
public abstract double eval();
}
[..]
public abstract class UnaryOperator implements IModel {
public abstract double eval(Expression e);
}
[..]
@Pattern(regExp = "-")
public class MinusOperator extends UnaryOperator implements IModel {
@Override public double eval(Expression e) {
return -e.eval();
}
}
@Associativity(AssociativityType.LEFT_TO_RIGHT)
public abstract class BinaryOperator implements IModel {
public abstract double eval(Expression e1, Expression e2);
}
[..]
@Priority(value = 2)
@Pattern(regExp = "-")
public class SubtractionOperator extends BinaryOperator implements IModel {
@Override public double eval(Expression e1, Expression e2) {
return e1.eval() - e2.eval();
}
}
[..]
SableCC
SableCC is a parser generator created for a thesis and aims to be easy to use and to offer a clean separation between grammar and Java code. Version 3 should also offer an included ready-to-use way to walk the AST through using a visitor. But that is all in theory because there is virtually no documentation and we have no idea how to use any of these things.
Also, a version 4 was started in 2015 and apparently lies abandoned.
UrchinCC
Urchin(CC) is a parser generator that allows you to define a grammar, called an Urchin parser definition. Then, you generate a Java parser from it. Urchin also generates a visitor from the UPD.
There is an exhaustive tutorial that is also used to explain how Urchin works and its limitations, but the manual is limited.
A UPD is divided into three sections: terminals, token, and rules.
A UPD file:
terminals {
Letters ::= 'a'..'z', 'A'..'Z';
Digits ::= '0'..'9';
}
token {
Space ::= [' ', #8, #9]*1;
EOLN ::= [#10, #13];
EOF ::= [#65535];
[..]
Identifier ::= [Letters] [Letters, Digits]*;
}
rules {
Variable ::= "var", Identifier;
Element ::= Number | Identifier;
PlusExpression ::= Element, '+', Expression;
[..]
}
Stay Tuned
Now that we've covered CFG parsers, it's time to dive into the world of PEG parsers. We'll cover a comprehensive list of those parsers in our next installment, which will cap off this series.
Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments