Creating a New JVM Language, Part 1: The Lexer

First step: generate an abstract syntax tree. First part of the first step: the lexer.

Feb. 04, 16 · Analysis

Likes (9)

Comment

Save

14.1K Views

I recently started working on a new programming language named Turin. A working compiler for an initial version of the language is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the Compiler

The compiler needs to do several things:

Get the source code and generate an abstract syntax tree (AST)
Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax of a representation easier to process. For example, we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
Perform semantic checks. For example, we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequence of tokens while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post, we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why Using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well maintained and I was sure it had all the features I could possibly need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes it possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have used ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The Current Lexer Grammar

This is the current version of the lexer grammar.

lexer grammar TurinLexer;

@header {

}

@lexer::members {
    public static final int WHITESPACE = 1;
    public static final int COMMENTS = 2;
}

// It is suggested to define the token types reused in different mode.
// See mode in-interpolation below
tokens { VALUE_ID, TYPE_ID, INT, LPAREN, RPAREN, COMMA, RELOP, AND_KW, OR_KW, NOT_KW }

// Of course keywords has to be defined before the rules for identifiers
NAMESPACE_KW        : 'namespace';
PROGRAM_KW          : 'program';
PROPERTY_KW         : 'property';
TYPE_KW             : 'type';
VAL_KW              : 'val';
HAS_KW              : 'has';
ABSTRACT_KW         : 'abstract';
SHARED_KW           : 'shared';
IMPORT_KW           : 'import';
AS_KW               : 'as';
VOID_KW             : 'Void';
RETURN_KW           : 'return';
FALSE_KW            : 'false';
TRUE_KW             : 'true';
IF_KW               : 'if';
ELIF_KW             : 'elif';
ELSE_KW             : 'else';

// For definitions reused in mode in-interpolation we define and refer to fragments
AND_KW              : F_AND;
OR_KW               : F_OR;
NOT_KW              : F_NOT;

LPAREN              : '(';
RPAREN              : ')';
LBRACKET            : '{';
RBRACKET            : '}';
LSQUARE             : '[';
RSQUARE             : ']';
COMMA               : ',';
POINT               : '.';
COLON               : ':';
// We use just one token type to reduce the number of states (and not crash Antlr...)
// https://github.com/antlr/antlr4/issues/840
EQUAL               : '==' -> type(RELOP);
DIFFERENT           : '!=' -> type(RELOP);
LESSEQ              : '<=' -> type(RELOP);
LESS                : '<'  -> type(RELOP);
MOREEQ              : '>=' -> type(RELOP);
MORE                : '>'  -> type(RELOP);
// ASSIGNMENT has to comes after EQUAL
ASSIGNMENT          : '=';
// Mathematical operators cannot be merged in one token type because
// they have different precedences
ASTERISK            : '*';
SLASH               : '/';
PLUS                : '+';
MINUS               : '-';

PRIMITIVE_TYPE      : F_PRIMITIVE_TYPE;
BASIC_TYPE          : F_BASIC_TYPE;

VALUE_ID            : F_VALUE_ID;
// Only for types
TYPE_ID             : F_TYPE_ID;
INT                 : F_INT;

// Let's switch to another mode here
STRING_START        : '"' -> pushMode(IN_STRING);

WS                  : (' ' | '\t')+ -> channel(WHITESPACE);
NL                  : '\r'? '\n';

COMMENT             : '/*' .*? '*/' -> channel(COMMENTS);

LINE_COMMENT        : '//' ~[\r\n]* -> channel(COMMENTS);

mode IN_STRING;

STRING_STOP         : '"' -> popMode;
STRING_CONTENT      : (~["\\#]|ESCAPE_SEQUENCE|SHARP)+;
INTERPOLATION_START : '#{' -> pushMode(IN_INTERPOLATION);

mode IN_INTERPOLATION;

INTERPOLATION_END   : '}' -> popMode;
I_PRIMITIVE_TYPE    : F_PRIMITIVE_TYPE -> type(PRIMITIVE_TYPE);
I_BASIC_TYPE        : F_BASIC_TYPE -> type(BASIC_TYPE);
I_FALSE_KW          : 'false' -> type(FALSE_KW);
I_TRUE_KW           : 'true' -> type(TRUE_KW);
I_AND_KW            : F_AND -> type(AND_KW);
I_OR_KW             : F_OR -> type(OR_KW);
I_NOT_KW            : F_NOT -> type(NOT_KW);
I_IF_KW             : 'if' -> type(IF_KW);
I_ELSE_KW           : 'else' -> type(ELSE_KW);
I_VALUE_ID          : F_VALUE_ID   -> type(VALUE_ID);
I_TYPE_ID           : F_TYPE_ID -> type(TYPE_ID);
I_INT               : F_INT -> type(INT);
I_COMMA             : ',' -> type(COMMA);
I_LPAREN            : '(' -> type(LPAREN);
I_RPAREN            : ')' -> type(RPAREN);
I_LSQUARE           : '[' -> type(LSQUARE);
I_RSQUARE           : ']' -> type(RSQUARE);

I_ASTERISK          : '*' -> type(ASTERISK);
I_SLASH             : '/' -> type(SLASH);
I_PLUS              : '+' -> type(PLUS);
I_MINUS             : '-' -> type(MINUS);

I_POINT             : '.' -> type(POINT);
I_EQUAL             : '==' -> type(RELOP);
I_DIFFERENT         : '!=' -> type(RELOP);
I_LESSEQ            : '<=' -> type(RELOP);
I_LESS              : '<'  -> type(RELOP);
I_MOREEQ            : '>=' -> type(RELOP);
I_MORE              : '>'  -> type(RELOP);
I_STRING_START      : '"' -> type(STRING_START), pushMode(IN_STRING);
I_WS                : (' ' | '\t')+ -> type(WS), channel(WHITESPACE);

fragment F_AND            : 'and';
fragment F_OR             : 'or';
fragment F_NOT            : 'not';
fragment F_VALUE_ID       : ('_')*'a'..'z' ('A'..'Z' | 'a'..'z' | '0'..'9' | '_')*;
// Only for types
fragment F_TYPE_ID        : ('_')*'A'..'Z' ('A'..'Z' | 'a'..'z' | '0'..'9' | '_')*;
fragment F_INT            : '0'|(('1'..'9')('0'..'9')*);
fragment F_PRIMITIVE_TYPE : 'Byte'|'Int'|'Long'|'Boolean'|'Char'|'Float'|'Double'|'Short';
fragment F_BASIC_TYPE     : 'UInt';

fragment ESCAPE_SEQUENCE  : '\\r'|'\\n'|'\\t'|'\\"'|'\\\\';
fragment SHARP            : '#'{ _input.LA(1)!='{' }?;

A few choices I have done:

there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead, when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion, this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example, we need them for syntax highlighting and in general, for the IntelliJ plugin because it requires defining tokens for each single character in the source file, without gaps
the trickiest part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of the tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
I had to collapse the relational operators in one single token type so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the Lexer

I wrote a bunch of tests specifically for the lexer. In particular, I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

    @Test
    public void parseStringWithEmptyInterpolation() throws IOException {
        String code = "\"Hel#{}lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.INTERPOLATION_START, TurinLexer.INTERPOLATION_END, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseStringWithInterpolationContainingID() throws IOException {
        String code = "\"Hel#{foo}lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.INTERPOLATION_START,
                TurinLexer.VALUE_ID,
                TurinLexer.INTERPOLATION_END, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseStringWithSharpSymbol() throws IOException {
        String code = "\"Hel#lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseMethodDefinitionWithExpressionBody() throws IOException {
        String code = "Void toString() = \"foo\"";
        verify(code, TurinLexer.VOID_KW, TurinLexer.VALUE_ID, TurinLexer.LPAREN, TurinLexer.RPAREN, TurinLexer.ASSIGNMENT, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

As you can see I have just tested the tokenizer on a string and verified it produces the correct list of tokens. Easy and straight to the point.

Conclusions

My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer models is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on to more interesting parts of the compiler.

Java (programming language) Java virtual machine

Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending