{{announcement.body}}
{{announcement.title}}

ANTLR: An Informal Introduction

DZone 's Guide to

ANTLR: An Informal Introduction

Learn more about this powerful framework and your code.

· Java Zone ·
Free Resource

In this article, I am going to introduce you to ANTLR's powerful framework. Equipped with this framework, we will also write a relatively simple language that coordinates the process of shearing a metal sheet (or any other sheet). The first steps in the language are going to be relatively simple, but in the next articles, more and more details will emerge. In the end, we will have something fully-fledged and effectively functioning. So, without further ado, let's take a closer look at how ANTLR works. 

What Is ANTLR? And How Can You Use It?

ANTLR (ANother Tool for Language Recognition), according to Terence Parr, "is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From grammar, ANTLR generates a parser that can build and walk parse trees."

With this in mind, ANTLR is also used to generate text parsers. Text parsers allow us to put text into pieces we understand — whether or not this text holds up to the predetermined rules. If it does, then we conduct our computations or proceed to another task.

Initially, ANTLR was created for Java and written using Java; however, the up-to-date version 4.7.2 allows parser generation for С#, Python 2 and 3, JavaScript, Go, C++, and Swift. Meaning, a parser can be generated for these languages as well. In this particular article, we will refer to Java, but everything mentioned below can be easily applied to other languages.

So, What Can You Actually Do With ANTLR?

  1. Write interpreters and compilers for new programming languages
  2. Analyze logs
  3. Create utilities for generating images from text descriptions (this is exactly what we are going to do in this article)

Text Analysis Stages

How do we analyze a structured text? Why structured? It seems like to understand any text is an open problem. We assume that we deal with a subset that complies with a number of specific rules known as grammar. I will dwell on that in my next few articles. I will give an extensive definition and share exhaustive information for its practical application. Now, it is enough to know that in the very beginning, we need to write the grammar of a language (or borrow one) that we will then develop.

Here is what we are going to do with grammar:

  1. We take the text to pieces or constituents that are called «lexemes». In other words, we will perform the lexical analysis. Lexemes are the words in a text. It will also allow us to see whether our systems comply with the rules described.
  2. We use lexemes to form more complex structures, and again, we should check whether they comply with the rules.
  3. What we obtain is a grammar parse tree. This is the tree where the main rule lies in the root, interbedded rules are in the nodes, and specific lexemes are found in the leaves.
  4. We can also check (optional) whether our text complies with the rules that can not be complied with even using grammar. Its what we call semantic analysis-parser.
  5. If we detect mistakes we process them.
  6. If there are no mistakes or mistakes are not critical we analyze the parse tree.

How ANTLR Actually Works

You must have got the picture - text analysis is not a piece of cake. You’ve got to have proper tools. The Dragon Book (‘Compilers: Principles, Techniques, and Tools’) is considered a detailed go-to guide. The thing is its almost 1200 pages! You’ll spend years of your life studying it. And this is where ANTLR comes to the rescue. In terms of grammar it works magic:

  1. It generates a class that takes a text to lexemes.
  2. It generates syntax parser.
  3. It detects mistakes (if any) in lexemes or syntax of the text.
  4. It helps to make a circuit of the parse tree containing the syntactic analysis and it generates visitor and listener classes for this. Meantime we redefine the useful methods.

Well, we are almost done! On a bigger scale, the most critical aspect in ANTLR lies in creating a grammar for the language you choose to work with, the framework does the rest of the things. Yet, as ANTLR has a huge community...

Let’s Build Grammar

Early in the game, the language is simple. A laser burner (or any other) has only two basic options:

  • Move to point A when off
  • Cut a straight line segment in the sheet.

These two simple operations allow cutting a model of any complexity out of a metal piece because the contour of any curve can be adjusted with the help of whatever number of straight-line segments.

Thus, we need only two functions, MoveTo(x, y) and  LineTo(x, y), where x and y are the coordinates of the corresponding point. Below, you can find the code of our grammar; let’s analyze it.

grammar CuttingLanguage;

actions:action+;

action:moveTo|lineTo;

moveTo: moveToName='MoveTo' '(' x=INT ',' y=INT ')';

lineTo: lineToName='LineTo' '(' x=INT ',' y=INT ')';

INT :'-'? DIGIT+ ;

fragment DIGIT : [0-9];WS : [ \n\r\t] -> skip ;


Code always begins with the grammar name. Then rules follow. Grammar has no sense without rules; therefore, it is impossible.

A rule can consist of two rules and symbols connecting them. For example, «+» symbol in the rule «actions: action+;» means that in actions rule the action rule repeats one or more times. If it was «*» instead of «+», it would mean that the rule is repeated 0 or more times. In the given situation, it makes no sense.

Let’s now analyze the action rule. The vertical line means «or». In other words, the action rule is either moveTo or lineTo.

The moveTo rule consists of the keyword MoveTo, its braces mark and variables x and y of INT type. The INT rule begins with a capital letter and differs from the rules mentioned above. This is a lexeme. We mark this and will later return to the analysis of the differences between the lexemes and the rules. INT consists of the option sign «-» signified by the question mark that follows it and one or more DIGIT rules.

We have two curious techniques in the case with DIGIT. First, it is an enumeration in the brackets, which means there can be any symbol from 0 to 9. Second, DIGIT is a fragment. It can be a part of the rule but not the rule itself.

Finally, the last rule — WS; the arrow and skip mean that we redirect the symbols to a certain channel. We will speak about it in detail later on.

Now, it is important to remember that the gap, tabulation, and line folding are ignored in the given grammar.

Visitor

Grammar alone is not enough. The program text has to be processed, and we have two ways of doing it:

  1. We can compute (handle) rules on entry and on exit. This method is called the listener
  2. We can handle the rule on entry and return the value that will be processed by the rule higher in the hierarchy. Finally, the handler of the main rule will return the search value.

In this paragraph, we are going to describe the implementation of the second rule, while in the next paragraph, we will work with the first one.

Let’s take a look at the code.

The visitor is derived from the class generated by a parser-generator. In our case, it is a parameterized class CuttingLanguageBaseVisitor. The constructor assumes the initial position of a point and GraphicsContext, which we will draw with; I used JavaFX for this. There is a code in the GitHub project, yet I have to warn that this article is not about JavaFX. «Please do not shoot the pianist.

He is doing his best.» ©.

There are only two methods in this class: visitMoveTo and visitLineTo. The first receives MoveToContext, which is also the class generated by the parser. It has fields x and у. We have stipulated it in grammar: x = INT; y = INT. x and y are type-tags. The same time they are the respective lexemes and their text is returned with the self-evident method g etText(), and the code is handled by the Integer.parseInt method.

Next, we move the turning point of the cutter into the point (X, Y) and save the current position using  gs.save(). Now, we return the cutter position point with the renewed coordinates.

ThelineTo method works in a similar way but draws a straight line in the point (X, Y) before saving.

package org.newlanguageservice.antlrtutorial;



import org.newlanguageservice.ch1.CuttingLanguageBaseVisitor;

import org.newlanguageservice.ch1.CuttingLanguageParser.LineToContext;

import org.newlanguageservice.ch1.CuttingLanguageParser.MoveToContext;



import javafx.scene.canvas.GraphicsContext;



public class CuttingLanguageVisitor extends CuttingLanguageBaseVisitor {

private Point point;

private GraphicsContext gc;

public CuttingLanguageVisitor(Point point, GraphicsContext gc) {

this.point = point;

this.gc=gc;

}



@Override

public Point visitMoveTo(MoveToContext ctx) {

int x = Integer.parseInt(ctx.x.getText());

int y = Integer.parseInt(ctx.y.getText());

gc.moveTo(x, y);

gc.save();

return point.setX(x).setY(y);

}



@Override

public Point visitLineTo(LineToContext ctx) {



int x = Integer.parseInt(ctx.x.getText());

int y = Integer.parseInt(ctx.y.getText());



gc.strokeLine(point.getX(), point.getY(), x, y);

gc.save();

return point.setX(x).setY(y);

}

}

Now let’s take a look at how to make it work. Here is the code transmitted to the start button listener:

CuttingLanguageLexer lexer

= new CuttingLanguageLexer(new ANTLRInputStream(textArea.getText()));

lexer.removeErrorListeners();

CuttingLanguageParser parser

= new CuttingLanguageParser(new CommonTokenStream(lexer));

parser.removeErrorListeners();



lexer.addErrorListener(new CuttingErrorListener());

visitor = new CuttingLanguageVisitor(new Point(0, 0), gc);

ParseTree tree = parser.actions();

Point point = visitor.visit(tree);

System.out.println(point);


First, we create the lexer. This is the class that divides the program test into lexemes (I talked about them briefly above), while creating the parser out of lexemes in the second line.

Next, we create a visitor.

We derive the root of the parse tree, which is the entry point into the decision tree — the actions rule. We could take any other rule of grammar (not a lexeme), and the analysis would start from this rule. We use listener to go around the entire tree so that in the outcome we get the point where the cutter will come when it finishes the cutting procedure. As you see, it’s not rocket science.

Listener

You don't need to do computations every time you analyze the parse tree. For instance, if you want to color the keywords, it is enough to obtain their coordinates. Let’s use a listener for this. The work of this class somehow resembles XML round check with the help of SAX: The corresponding class method responds on entry and on exit from the rule.

The code class is below:

package org.newlanguageservice.antlrtutorial;

import java.util.ArrayList;

import java.util.List;

import org.newlanguageservice.ch1.CuttingLanguageBaseListener;

import org.newlanguageservice.ch1.CuttingLanguageParser.LineToContext;

import org.newlanguageservice.ch1.CuttingLanguageParser.MoveToContext;

public class CuttingLanguageListener extends CuttingLanguageBaseListener {

List lexemes=new ArrayList<>();

@Overridepublic void enterLineTo(LineToContext ctx) {

lexemes.add(new LexemeWithCoords(new Point(ctx.lineToName.getStartIndex(),ctx.lineToName.getStopIndex()+1), "LineTo"));

}

@Overridepublic void enterMoveTo(MoveToContext ctx) {

lexemes.add(new LexemeWithCoords(new Point(ctx.moveToName.getStartIndex(),ctx.moveToName.getStopIndex()+1), "MoveTo"));

}

public List getLexemes() {

return lexemes;

}

}


As in the previous case, we succeed from the generated ANTLR of the base class and override the two methods: enterLineTo and enterMoveTo, where we receive the coordinates of the keywords in a text.

Then, already in the editor, we color it the following way:

List lexemes = listener.getLexemes();

lexemes.forEach(lexeme->textArea.setStyleClass(lexeme.getCoords().getX(),

lexeme.getCoords().getY(), "keywords"));


Handling Errors

Sometimes, code may contain errors, and in such cases, ANTLR can help detect those lexical and syntax ones. When the code is grammatically correct but does not produce the desired outcome, it means there are logic/semantic errors which can not be detected: framework can't know what you want — yet!

Lexer and parser that we generated produce the error message. All we need to do is to attach the listener to them that will simply send the message onto the console. The listener realises the interface ANTLRErrorListener. It has four methods, but in our case, we only need one — syntaxError.

It looks the following way:

public class CuttingErrorListener implements ANTLRErrorListener {

@Overridepublic void syntaxError(Recognizer recognizer,

Object offendingSymbol, int line,

int charPositionInLine,String msg, RecognitionException e) {

System.out.println(msg);

}


The complete text of the program is available on GitHub.

I can't wait to work on my next article where I will share an extended description of the theory essential for understanding the ANTLR operation principle. Stay tuned!

Topics:
antlr ,parsers ,dsl ,java ,framework ,textparser ,grammar ,language ,text ,parser

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}