Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

ANTLR Tutorial: a Peek Into the Theory

DZone 's Guide to

ANTLR Tutorial: a Peek Into the Theory

Demystify grammar with this down-to-earth analysis.

· Web Dev Zone ·
Free Resource

deer-with-antlers-peeking-out-of-brush

Demystify grammar

There seems to be a perennial confusion with teaching the theory behind programming in universities. The program of most universities implies teaching theory during the first two years when students actually can not code and do not understand how to employ the knowledge gained from the theory.

Courses often do not contain a sufficient number of hands-on examples and case studies. That’s why this knowledge is often treated as lifeless theory that students mindlessly memorize to pass exams only to forget it by the time they graduate. As a result, when a programmer actually faces theory in industry, they often do not even know how to apply it.

You may also like: Custom Grammar to Query JSON With Antlr.

I’ll take another way: I’ll explain only the information relevant to work or necessary for a compelling explanation of a hands-on example. We will definitely return to more complex concepts later when you develop a sufficient understanding of why we need it at all.

Grammar

He had complimented me on how I spoke Italian, and we talked together very easily. One day I had said that Italian seemed such an easy language to me that I could not take a great interest in it; everything was so easy to say. "Ah, yes," the major said. "Why, then, do you not take up the use of grammar?" So we took up the use of grammar, and soon Italian was such a difficult language that I was afraid to talk to him until I had the grammar straight in my mind.
— Ernest Hemingway

There are a multitude of definitions of grammar. For now, I'll assume that only ANTLR's is important.

So, context-free, generative/production grammar is nothing more than a production set, X → Y expressions where X and Y are symbol chains or symbol rules. Grammar begins with a starting rule, S, for instance.

If we can obtain a string substitutionally from the main rule, the string corresponds to grammar; if we can not, it does not.

Example time! Supposing we define the grammar of expressions which consist exclusively of opening and closing brackets. Here is how it looks like:

S → (S)

S→ where is a special symbol meaning emptiness.

It is obvious that we can obtain an expression like () or (()) while we can’t obtain (( or (а). This way, we can see whether the expression corresponds to the given grammar in every sequence of symbols or not. If it doesn’t, we need to report an error.

Okay, what is grammar, then? If we consider it from the practical point of view, it looks like a set of rules that help to understand whether a given string is admissible in the given grammar or not.

The definition is non-standard and works only within one type of grammars, but this is fine with us. In my following articles, I’ll present a more rigorous definition that will be helpful when we bump into practical problems.

Special Characters and Broad Definitions of Grammars

The same as in conventional programming, we use special symbols to write grammar to make programming more convenient. The symbols are also comfortable because they allow omitting a blank string.

Alternative

If we want to say that rule A generates chain B, we write it as follows:

A → B

A → C

alternatively,

A → B|C

Enumeration

If we want to show that the chain repeats two times or more we use the “+” symbol, if zero or more times, then we use “*”, if zero times or one time we use “?”.

Example: Suppose we want to create grammar describing the chain of letters “a.”

If there has to be at least one letter, we put it down as follows:

S→a+

If we want the chain with no letters to correspond to the grammar, we put it down as follows:

S→a*

In the case where the letter “а” can repeat zero or one time we have this:

S→a?

Now, let us get back to the grammar from the previous article:

actions:action+;

action:moveTo|lineTo;

moveTo: moveToName='MoveTo' '(' x=INT ',' y=INT ')';

lineTo: lineToName='LineTo' '(' x=INT ',' y=INT ')';

INT :'-'? DIGIT+ ;

fragment DIGIT : [0-9];

As we can see, a colon is used instead of arrows in ANTLR (perhaps because an arrow belongs to special symbols and is not to be found on most keyboards). In the rest of these cases, the way we use special symbols coincides with the definition given above. Yet, there are some blind spots in this grammar description, so it’s a great time to give some explanations.

Lexical Analysis Versus Parsing

There are no lexical or syntax analyses in the automatic theory on which grammar has its base. A data structure called “automation” is analyzed. The symbols taken from the analyzed string are entered and at the end of the process. From there, we can understand whether the string corresponds to the grammar or not.

In practice, however, it is more convenient to split a string not into letters but into bigger chunks. Similar to how the natural languages consist of sentences that consist of words, which, in their turn, consist of letters, sentences in artificial languages lexemes consist of letters and symbols.

Rules consist of lexemes and other rules. The process of dividing text into lexemes is known as lexical analysis. Building bigger rules from lexemes is known as syntax analysis. The classes that perform this analysis are called lexical parsers or syntax parsers, respectively.

Please pay attention to the grammar in the previous article. There are rules starting with a capital letter. These are lexemes. Those that start with a small letter are rules. Structuring into lexemes and rules is pretty much conditional. It’s hard to pinpoint what we should call a lexeme and what we should call a rule. ANTLR has one major distinction: rules may consist of rules and strings; lexemes, however, consist only of strings, other lexemes, and fragments.

What is a fragment? It is a part of grammar that can only be a part of a lexeme, and it can not be a part of a rule. In other words, you have this:

INT :'-'? DIGIT+ ;

fragment DIGIT : [0-9];

While you absolutely can not have this:

int :'-'? DIGIT+ ;

fragment DIGIT : [0-9];

Left Recursive Grammars

A wise man once said that after the phrase «a wise man once said» people unload some bullshit.

Like any other type of software, ANTLR did not come out of the blue. It was written by someone, and a certain algorithm was used. It is called LL(1)-parsing, left recursive parsing, or recursive detour from left to right. You are welcome to google this if you are curious about the full algorithm.

It is curious, by the way, that the algorithm can not cope with a rule that contains itself in the right end without moving along the string. In other words, while processing this rule

A→Aabc

It’s going to loop forever. In a nutshell, the algorithm analyzes grammar recursively, from left to right. So, when processing the rule above, the rule for "A" analysis will be invoked recursively. Then once more for A, and then, again and again, until the stack is overflowed.

If we take back from abstract theory to a more specific framework we come up with some good news and some bad news.

First, let’s go with the good news: the fourth ANTLR version already supports left recursive grammars. The bad news is that it works for direct left recursion only; it won’t be able to process indirect recursion. Indirect left recursion has its definition, of course, but it is rather complicated, and it is not the case for us here.

Instead, let’s analyze the following grammar:

Value→[0-9]+ ‘/’ '(' Expr ')'

Product → Expr (('*' / '/') Expr)*

Sum → Expr (('+' / '-') Expr)*

Expr → Product |Sum | Value

As can be seen, we invoke Expr and immediately analyze Product without moving stringwise; when invoking Product, we analyze Expr, without moving. One rule immediately invokes another rule, and we end up with a sink state.

More good news is that there is an algorithm for eliminating both direct and indirect left recursion from grammar. You can find it by simply googling “elimination of left recursion.”

What Code ANTLR Generates

As I mentioned in my earlier article, we can use two essentially different methods for handling grammar: visitor and listener

Let us take a look inside the code generated by ANTLR. What is actually produced? We deploy grammar we used in the previous article:

Listener

ANTLR generates an inherited interface from ParseTreeListener for the listener. For every rule or a label in it (we’ll get back to them later)

This is how it looks like

 // Generated from CuttingLanguage.g4 by ANTLR 4.4

package org.newlanguageservice.ch1;

import org.antlr.v4.runtime.misc.NotNull;

import org.antlr.v4.runtime.tree.ParseTreeListener;


/**

 * This interface defines a complete listener for a parse tree produced by

 * {@link CuttingLanguageParser}.

 */

public interface CuttingLanguageListener extends ParseTreeListener {

/**

 * Enter a parse tree produced by {@link CuttingLanguageParser#action}.

 * @param ctx the parse tree

 */

void enterAction(@NotNull CuttingLanguageParser.ActionContext ctx);

/**

 * Exit a parse tree produced by {@link CuttingLanguageParser#action}.

 * @param ctx the parse tree

 */

void exitAction(@NotNull CuttingLanguageParser.ActionContext ctx);

/**

 * Enter a parse tree produced by {@link CuttingLanguageParser#lineTo}.

 * @param ctx the parse tree

 */

void enterLineTo(@NotNull CuttingLanguageParser.LineToContext ctx);

/**

 * Exit a parse tree produced by {@link CuttingLanguageParser#lineTo}.

 * @param ctx the parse tree

 */

void exitLineTo(@NotNull CuttingLanguageParser.LineToContext ctx);

/**

 * Enter a parse tree produced by {@link CuttingLanguageParser#actions}.

 * @param ctx the parse tree

 */

void enterActions(@NotNull CuttingLanguageParser.ActionsContext ctx);

/**

 * Exit a parse tree produced by {@link CuttingLanguageParser#actions}.

 * @param ctx the parse tree

 */

void exitActions(@NotNull CuttingLanguageParser.ActionsContext ctx);

/**

 * Enter a parse tree produced by {@link CuttingLanguageParser#moveTo}.

 * @param ctx the parse tree

 */

void enterMoveTo(@NotNull CuttingLanguageParser.MoveToContext ctx);

/**

 * Exit a parse tree produced by {@link CuttingLanguageParser#moveTo}.

 * @param ctx the parse tree

 */

void exitMoveTo(@NotNull CuttingLanguageParser.MoveToContext ctx);

}


In order to redefine listener, we need to create a class where we can implement each of the methods. Or we can deploy a stub class that implements each method without leaving their body empty. If we inherit from there we can focus on implementing only those methods we need, effectively saving time for coding.

Visitor

Here we work with a similar pattern. The interface is inherited from ParseTreeVisitor and is parametrized. It means that every visitor method has to return a certain value. The methods corresponding to the rule are invoked only once. This is what differentiates it from listener.

Grammar Labels

Suppose we have the following grammar:

grammar Arithmetic;

expr: expr '+' expr #plus

| expr '-' expr #minus

| expr '*' expr #mul

| expr '/' expr #div

| INT #int;

INT :[0-9]+;

WS: [ \t\r\n]+ -> skip;

It contains immediate left recursion, but as was mentioned above, ANTLR copes with this just fine. Yet, how can we arrange the whole thing in a way that there would be only one action for every arithmetic operation without changing the grammar itself, as in listing above? We add rules with a hash to the label rules. This will result in the grammar generated listener and visitor having handler-methods for labels as well.

It looks like this:  

// Generated from Arithmetic.g4 by ANTLR 4.7.2

import org.antlr.v4.runtime.tree.ParseTreeVisitor;


/**

 * This interface defines a complete generic visitor for a parse tree produced

 * by {@link ArithmeticParser}.

 *

 * @param <T> The return type of the visit operation. Use {@link Void} for

 * operations with no return type.

 */

public interface ArithmeticVisitor<T> extends ParseTreeVisitor<T> {

/**

 * Visit a parse tree produced by the {@code div}

 * labeled alternative in {@link ArithmeticParser#expr}.

 * @param ctx the parse tree

 * @return the visitor result

 */

T visitDiv(ArithmeticParser.DivContext ctx);

/**

 * Visit a parse tree produced by the {@code minus}

 * labeled alternative in {@link ArithmeticParser#expr}.

 * @param ctx the parse tree

 * @return the visitor result

 */

T visitMinus(ArithmeticParser.MinusContext ctx);

/**

 * Visit a parse tree produced by the {@code mul}

 * labeled alternative in {@link ArithmeticParser#expr}.

 * @param ctx the parse tree

 * @return the visitor result

 */

T visitMul(ArithmeticParser.MulContext ctx);

/**

 * Visit a parse tree produced by the {@code int}

 * labeled alternative in {@link ArithmeticParser#expr}.

 * @param ctx the parse tree

 * @return the visitor result

 */

T visitInt(ArithmeticParser.IntContext ctx);

/**

 * Visit a parse tree produced by the {@code plus}

 * labeled alternative in {@link ArithmeticParser#expr}.

 * @param ctx the parse tree

 * @return the visitor result

 */

T visitPlus(ArithmeticParser.PlusContext ctx);

}


Take a look — now we’ve got visit rules where we have a label name even though grammar does not contain such rules.

There you have it all in a nutshell. We will talk about it in more detail later when we need it.
My next article will deal with a review of ANTLR development tools.


Related Articles

Topics:
antlr ,parser ,java ,grammar ,web design and web development ,tutorial ,lexical analysis

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}