Over a million developers have joined DZone.

Building and Testing a Parser With ANTLR and Kotlin

After building a lexer, the second part of this series shows how to build a parser.

· Java Zone

Check out this 8-step guide to see how you can increase your productivity by skipping slow application redeploys and by implementing application profiling, as you code! Brought to you in partnership with ZeroTurnaround.

This post is part of a series on how to create a useful language and all the supporting tools.

  • Building a lexer
  • Code

    The code is available on GitHub. The code described in this post is associated to the tag 02_parser

    The Parser

    The parser is simply defined as an ANTLR grammar. We have previously built a separate lexer. Here we reuse those terminals (NEWLINE, VAR, ID, etc.) to build rules such as statement, assignment, etc.

    Image title

    Here it is, our new ANTLR grammar:

    parser grammar SandyParser;
    
    options { tokenVocab=SandyLexer; }
    
    sandyFile : lines=line+ ;
    
    line      : statement (NEWLINE | EOF) ;
    
    statement : varDeclaration # varDeclarationStatement
              | assignment     # assignmentStatement ;
    
    varDeclaration : VAR assignment ;
    
    assignment : ID ASSIGN expression ;
    
    expression : left=expression operator=(DIVISION|ASTERISK) right=expression # binaryOperation
               | left=expression operator=(PLUS|MINUS) right=expression        # binaryOperation
               | LPAREN expression RPAREN # parenExpression
               | ID                #varReference
               | MINUS expression  #minusExpression
               | INTLIT # intLiteral
               | DECLIT # decimalLiteral ;          
    • We reuse the existing lexer (tokenVocab=SandyLexer)
    • We start by defining the rule representing the whole file: sandyFile. It is defined as a list of at least one line.
    • Each line is composed by a statement terminated either by a new line or the end of the file.
    • A statement can be a varDeclaration or an assignment.
    • An expression can be defined in many different ways. The order is important because it determines the operator precedence. So the multiplication comes before the sum.

    To build it we simply run  ./gradlew generateGrammarSource. Please refer to the build.gradle file in the repository or take a look at the previous post in the series.

    Testing

    We defined our parser, so now we need to test it. In general, I think we need to test a parser in three ways:

    • Verify that all the code we need to parse is parsed without errors
    • Ensure that code containing errors is not parsed
    • Verify that the the shape of the resulting AST is the one we expect

    In practice, the first point is the one on which I tend to insist on the most. If you are building a parser for an existing language, the best way to test your parser is to try parsing as much code as you can, verifying that all the errors found correspond to actual errors in the original code and not errors in the parser. Typically I iterate over this step multiple times to complete my grammars.

    The second and third points are refinements on which I work once I am sure my grammar can recognize everything.

    In this simple case, we will write simple test cases to cover the first and the third points: We will verify that some examples are parsed and we will verify that the AST produced is the one we want.

    It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways to do that but in this case I chose to generate a string representation of the AST and verify it is the same as the one expected. It is an indirect way of testing the AST is the one I want but it is much easier for simple cases like this one.

    This is how we produce a string representation of the AST:

    package me.tomassetti.sandy
    
    import org.antlr.v4.runtime.ParserRuleContext
    import org.antlr.v4.runtime.tree.TerminalNode
    import java.util.*
    
    // Each AST element is wrapped as an element
    // We can dump each element to a String
    abstract class ParseTreeElement {
        abstract fun multiLineString(indentation : String = ""): String
    }
    
    // To dump a leaf (which corresponds to a Terminal) we just write
    // T[...] and inside the square brackets we write the text corresponding
    // to the terminal
    class ParseTreeLeaf(val text: String) : ParseTreeElement() {
        override fun toString(): String{
            return "T[$text]"
        }
    
        override fun multiLineString(indentation : String): String = "${indentation}T[$text]\n"
    }
    
    // For nodes things are slightly more complex: 
    // we need to first print the name of the node, then in the next lines
    // we print the children, recursively. While printing the children
    // we increase the indentation
    class ParseTreeNode(val name: String) : ParseTreeElement() {
        val children = LinkedList<ParseTreeElement>()
        fun child(c : ParseTreeElement) : ParseTreeNode {
            children.add(c)
            return this
        }
    
        override fun toString(): String {
            return "Node($name) $children"
        }
    
        override fun multiLineString(indentation : String): String {
            val sb = StringBuilder()
            sb.append("${indentation}$name\n")
            children.forEach { c -> sb.append(c.multiLineString(indentation + "  ")) }
            return sb.toString()
        }
    }
    
    // Given an AST node we wrap all the parts as elements:
    // the terminals will be Leaf elements and the non-terminals
    // will be Node elements.
    // Once we have wrapped those elements we can produce a string for them
    fun toParseTree(node: ParserRuleContext) : ParseTreeNode {
        val res = ParseTreeNode(node.javaClass.simpleName.removeSuffix("Context"))
        node.children.forEach { c ->
            when (c) {
                is ParserRuleContext -> res.child(toParseTree(c))
                is TerminalNode -> res.child(ParseTreeLeaf(c.text))
            }
        }
        return res
    }

    And these are some test cases:

    package me.tomassetti.sandy
    
    import me.tomassetti.langsandbox.SandyLexer
    import me.tomassetti.langsandbox.SandyParser
    import org.antlr.v4.runtime.ANTLRInputStream
    import org.antlr.v4.runtime.CommonTokenStream
    import java.io.*
    import java.util.*
    import org.junit.Test as test
    import kotlin.test.*
    
    class SandyParserTest {
    
        fun lexerForResource(resourceName: String) = SandyLexer(ANTLRInputStream(this.javaClass.getResourceAsStream("/${resourceName}.sandy")))
    
        fun parseResource(resourceName: String) : SandyParser.SandyFileContext = SandyParser(CommonTokenStream(lexerForResource(resourceName))).sandyFile()
    
        @test fun parseAdditionAssignment() {
            assertEquals(
    """SandyFile
      Line
        AssignmentStatement
          Assignment
            T[a]
            T[=]
            BinaryOperation
              IntLiteral
                T[1]
              T[+]
              IntLiteral
                T[2]
        T[<EOF>]
    """,
                    toParseTree(parseResource("addition_assignment")).multiLineString())
        }
    
        @test fun parseSimplestVarDecl() {
            assertEquals(
    """SandyFile
      Line
        VarDeclarationStatement
          VarDeclaration
            T[var]
            Assignment
              T[a]
              T[=]
              IntLiteral
                T[1]
        T[<EOF>]
    """,
                    toParseTree(parseResource("simplest_var_decl")).multiLineString())
        }
    
        @test fun parsePrecedenceExpressions() {
            assertEquals(
    """SandyFile
      Line
        VarDeclarationStatement
          VarDeclaration
            T[var]
            Assignment
              T[a]
              T[=]
              BinaryOperation
                BinaryOperation
                  IntLiteral
                    T[1]
                  T[+]
                  BinaryOperation
                    IntLiteral
                      T[2]
                    T[*]
                    IntLiteral
                      T[3]
                T[-]
                IntLiteral
                  T[4]
        T[<EOF>]
    """,
                    toParseTree(parseResource("precedence_expression")).multiLineString())
        }
    
    
    }

    Simple, isn’t it?

    Conclusions

    We have seen how to build a simple lexer and a simple parser. Many tutorials stop there. We are instead going to move on and build more tools from our lexer and parser. We laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get interesting.

    In the next post we will see how to build an editor with syntax highlighting for our language.

    The Java Zone is brought to you in partnership with ZeroTurnaround. Check out this 8-step guide to see how you can increase your productivity by skipping slow application redeploys and by implementing application profiling, as you code!

    Topics:
    parsers ,kotlin ,antlr

    Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    The best of DZone straight to your inbox.

    SEE AN EXAMPLE
    Please provide a valid email address.

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
    Subscribe

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}