Parsing Any Language in Java in 5 Minutes Using ANTLR

There are several parser-generators out there and most of them are good enough for most goals you may have. Among them I tend to use ANTLR more than others: it is mature, it is supported, and it is fast. Read on for the full explanation.

Feb. 29, 16 · Analysis

Likes (11)

Comment

Save

50.6K Views

I like processing code for several purposes, like static analysis or automated refactoring. The interesting part to me is to reason on the models you build from the Abstract Syntax Tree (AST). To get there you need a way to get the AST from your source files. This can be done easily using ANTLR and the collection of complete grammars available here: https://github.com/antlr/grammars-v4

Thank you folks for all of the grammars!

We are just going to take the one for Python 3, which should work fine for Python 2. If we need to do any minor adjustments we can easily do that starting from this base.

Getting the Grammar

First things first—let’s get the grammar.

Just visit https://github.com/antlr/grammars-v4 and take the grammar you need. Most grammars have a very permissive license.

There are tens of grammars for languages such as R, Scala, Python, Swift, PHP, and many others. There is also one for Java, but for Java you probably prefer to use JavaParser... right?

Just copy the grammar into your new project, under src/main/antlr

Setting Up the Project Using Gradle

Now, we are going to set up a build script with Gradle.

We will use the ANTLR4 plugin from melix because I find it more flexible than the one described in the official documentation.

We will generate the code in a specific package (me.tomassetti.pythonast.parser) and therefore in a directory derived from that package (build/generated-src/me/tomassetti/pythonast/parser).

buildscript {
    repositories {
        maven {
            name 'JFrog OSS snapshot repo'
            url  'https://oss.jfrog.org/oss-snapshot-local/'
        }
        jcenter()
    }

    dependencies {
        classpath 'me.champeau.gradle:antlr4-gradle-plugin:0.1.1-SNAPSHOT'
    }
}

repositories {
    mavenCentral()
    jcenter()
}

apply plugin: 'java'
apply plugin: 'me.champeau.gradle.antlr4'

antlr4 {
    source = file('src/main/antlr')
    output = file('build/generated-src/me/tomassetti/pythonast/parser')
    extraArgs = ['-package', 'me.tomassetti.pythonast.parser']
}

compileJava.dependsOn antlr4

sourceSets.main.java.srcDirs += antlr4.output

configurations {
    compile.extendsFrom antlr4
}

task fatJar(type: Jar) {
    manifest {
        attributes 'Implementation-Title': 'Python-Parser',
                   'Implementation-Version': '0.0.1'
    }
    baseName = project.name + '-all'
    from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } }
    with jar
}

I also added a fatJar task. That task produces a JAR containing all of the dependencies. I use it to import the parser into Jetbrains MPS more easily.

To generate the parser from the grammar you can just run gradle antlr4.

You then have to explain to your IDE that it should consider the code under build/generated-src.

How to Invoke the Parser

Now, let’s see how we can invoke the parser.

public class ParserFacade {

    private static String readFile(File file, Charset encoding) throws IOException {
        byte[] encoded = Files.readAllBytes(file.toPath());
        return new String(encoded, encoding);
    }

    public Python3Parser.File_inputContext parse(File file) throws IOException {
        String code = readFile(file, Charset.forName("UTF-8"));
        Python3Lexer lexer = new Python3Lexer(new ANTLRInputStream(code));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        Python3Parser parser = new Python3Parser(tokens);
        return parser.file_input();
    }
}

Our ParserFacade has just one public method named parse. It gets a File and it returns an AST. It could hardly be simpler than that.

Let’s Look at Some ASTs

Let’s take a simple file:

def sum(a, b):
    return a + b

print("The sum of %i and %i is %i" % (5, 3, sum(5, 3)))

And now, get the AST. We can print it using this code:

public class AstPrinter {

    public void print(RuleContext ctx) {
        explore(ctx, 0);
    }

    private void explore(RuleContext ctx, int indentation) {
        String ruleName = Python3Parser.ruleNames[ctx.getRuleIndex()];
        for (int i=0;i<indentation;i++) {
            System.out.print("  ");
        }
        System.out.println(ruleName);
        for (int i=0;i<ctx.getChildCount();i++) {
            ParseTree element = ctx.getChild(i);
            if (element instanceof RuleContext) {
                explore((RuleContext)element, indentation + 1);
            }
        }
    }

}

If we parse the simple example and print it with ASTPrinter, we get a super complex AST. The first lines look like:

file_input
  stmt
    compound_stmt
      funcdef
        parameters
          typedargslist
            tfpdef
            tfpdef
        suite
          stmt
            simple_stmt
              small_stmt
                flow_stmt
                  return_stmt
                    testlist
                      ...

For the way the parser is built, there are a lot of annidated rules. That makes sense while parsing, but it produces a very polluted AST. I think there are two different ASTs: as a parsing AST which is easy to produce, and a logic AST that it is easy to reason about. Luckily, we can transform the first one in the latter without too much effort.

One simple way is to list all the rules that are just wrappers and skip them, taking their only child instead. We might have to refine this, but as a first approximation, let’s just skip the nodes which have just one child which is another parser rule (no terminals.)

In this way, we go from 164 nodes to 28. The resulting logic AST is:

file_input
  funcdef
    parameters
      typedargslist
        tfpdef
        tfpdef
    suite
      simple_stmt
        return_stmt
          arith_expr
            atom
            atom
  simple_stmt
    power
      atom
      trailer
        term
          string
          atom
            testlist_comp
              integer
              integer
              power
                atom
                trailer
                  arglist
                    integer
                    integer

In this tree, everything should be mapped to a concept we understand, with no artificial nodes in the way (nodes just created for parsing reasons.)

Conclusions

Writing parsers is not where we are able to produce the most value. We can easily reuse existing grammars, generate parsers, and build our smart applications using those parsers.

There are several parser-generators out there and most of them are good enough for most goals you may have. Among them I tend to use ANTLR more than others: it is mature, it is supported, and it is fast. The ASTs it produces can be navigated both using hereogeneous APIs (we have single classes generated for each kind of node) and homogeneous APIs (we can ask each node which rule it represents and the list of its children).

Another great benefit of ANTLR is the presence of available grammars. Building grammars requires experience and some work, especially for complex GPLs like Java or Python. It also requires extensive testing. We are still finding minor issues with the Java 8 grammars behind JavaParser even after parsing literally hundreds of thousands of files using it. This is a good reason to write your own grammar if you can avoid that.

By the way, all the code is available on GitHub: python-ast

Question: are you interested in parsing? Do you prefer other parser generators to ANTLR?

ANTLR Java (programming language) Parser (programming language)

Published at DZone with permission of Federico Tomassetti. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending