DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Java Virtual Threads and Scaling
  • Java’s Next Act: Native Speed for a Cloud-Native World
  • The Energy Efficiency of JVMs and the Role of GraalVM
  • Understanding Root Causes of Out of Memory (OOM) Issues in Java Containers

Trending

  • Article Moderation: Your Questions, Answered
  • Building Resilient Identity Systems: Lessons from Securing Billions of Authentication Requests
  • Unlocking the Potential of Apache Iceberg: A Comprehensive Analysis
  • How to Perform Custom Error Handling With ANTLR
  1. DZone
  2. Coding
  3. Java
  4. Creating a New JVM Language, Part 1: The Lexer

Creating a New JVM Language, Part 1: The Lexer

First step: generate an abstract syntax tree. First part of the first step: the lexer.

By 
Federico Tomassetti user avatar
Federico Tomassetti
·
Feb. 04, 16 · Analysis
Likes (9)
Comment
Save
Tweet
Share
14.1K Views

Join the DZone community and get the full member experience.

Join For Free

I recently started working on a new programming language named Turin. A working compiler for an initial version of the language is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the Compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax of a representation easier to process. For example, we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example, we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequence of tokens while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post, we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why Using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well maintained and I was sure it had all the features I could possibly need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes it possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have used ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The Current Lexer Grammar

This is the current version of the lexer grammar.

lexer grammar TurinLexer;

@header {

}

@lexer::members {
    public static final int WHITESPACE = 1;
    public static final int COMMENTS = 2;
}

// It is suggested to define the token types reused in different mode.
// See mode in-interpolation below
tokens { VALUE_ID, TYPE_ID, INT, LPAREN, RPAREN, COMMA, RELOP, AND_KW, OR_KW, NOT_KW }

// Of course keywords has to be defined before the rules for identifiers
NAMESPACE_KW        : 'namespace';
PROGRAM_KW          : 'program';
PROPERTY_KW         : 'property';
TYPE_KW             : 'type';
VAL_KW              : 'val';
HAS_KW              : 'has';
ABSTRACT_KW         : 'abstract';
SHARED_KW           : 'shared';
IMPORT_KW           : 'import';
AS_KW               : 'as';
VOID_KW             : 'Void';
RETURN_KW           : 'return';
FALSE_KW            : 'false';
TRUE_KW             : 'true';
IF_KW               : 'if';
ELIF_KW             : 'elif';
ELSE_KW             : 'else';

// For definitions reused in mode in-interpolation we define and refer to fragments
AND_KW              : F_AND;
OR_KW               : F_OR;
NOT_KW              : F_NOT;

LPAREN              : '(';
RPAREN              : ')';
LBRACKET            : '{';
RBRACKET            : '}';
LSQUARE             : '[';
RSQUARE             : ']';
COMMA               : ',';
POINT               : '.';
COLON               : ':';
// We use just one token type to reduce the number of states (and not crash Antlr...)
// https://github.com/antlr/antlr4/issues/840
EQUAL               : '==' -> type(RELOP);
DIFFERENT           : '!=' -> type(RELOP);
LESSEQ              : '<=' -> type(RELOP);
LESS                : '<'  -> type(RELOP);
MOREEQ              : '>=' -> type(RELOP);
MORE                : '>'  -> type(RELOP);
// ASSIGNMENT has to comes after EQUAL
ASSIGNMENT          : '=';
// Mathematical operators cannot be merged in one token type because
// they have different precedences
ASTERISK            : '*';
SLASH               : '/';
PLUS                : '+';
MINUS               : '-';

PRIMITIVE_TYPE      : F_PRIMITIVE_TYPE;
BASIC_TYPE          : F_BASIC_TYPE;

VALUE_ID            : F_VALUE_ID;
// Only for types
TYPE_ID             : F_TYPE_ID;
INT                 : F_INT;

// Let's switch to another mode here
STRING_START        : '"' -> pushMode(IN_STRING);

WS                  : (' ' | '\t')+ -> channel(WHITESPACE);
NL                  : '\r'? '\n';

COMMENT             : '/*' .*? '*/' -> channel(COMMENTS);

LINE_COMMENT        : '//' ~[\r\n]* -> channel(COMMENTS);

mode IN_STRING;

STRING_STOP         : '"' -> popMode;
STRING_CONTENT      : (~["\\#]|ESCAPE_SEQUENCE|SHARP)+;
INTERPOLATION_START : '#{' -> pushMode(IN_INTERPOLATION);

mode IN_INTERPOLATION;

INTERPOLATION_END   : '}' -> popMode;
I_PRIMITIVE_TYPE    : F_PRIMITIVE_TYPE -> type(PRIMITIVE_TYPE);
I_BASIC_TYPE        : F_BASIC_TYPE -> type(BASIC_TYPE);
I_FALSE_KW          : 'false' -> type(FALSE_KW);
I_TRUE_KW           : 'true' -> type(TRUE_KW);
I_AND_KW            : F_AND -> type(AND_KW);
I_OR_KW             : F_OR -> type(OR_KW);
I_NOT_KW            : F_NOT -> type(NOT_KW);
I_IF_KW             : 'if' -> type(IF_KW);
I_ELSE_KW           : 'else' -> type(ELSE_KW);
I_VALUE_ID          : F_VALUE_ID   -> type(VALUE_ID);
I_TYPE_ID           : F_TYPE_ID -> type(TYPE_ID);
I_INT               : F_INT -> type(INT);
I_COMMA             : ',' -> type(COMMA);
I_LPAREN            : '(' -> type(LPAREN);
I_RPAREN            : ')' -> type(RPAREN);
I_LSQUARE           : '[' -> type(LSQUARE);
I_RSQUARE           : ']' -> type(RSQUARE);

I_ASTERISK          : '*' -> type(ASTERISK);
I_SLASH             : '/' -> type(SLASH);
I_PLUS              : '+' -> type(PLUS);
I_MINUS             : '-' -> type(MINUS);

I_POINT             : '.' -> type(POINT);
I_EQUAL             : '==' -> type(RELOP);
I_DIFFERENT         : '!=' -> type(RELOP);
I_LESSEQ            : '<=' -> type(RELOP);
I_LESS              : '<'  -> type(RELOP);
I_MOREEQ            : '>=' -> type(RELOP);
I_MORE              : '>'  -> type(RELOP);
I_STRING_START      : '"' -> type(STRING_START), pushMode(IN_STRING);
I_WS                : (' ' | '\t')+ -> type(WS), channel(WHITESPACE);

fragment F_AND            : 'and';
fragment F_OR             : 'or';
fragment F_NOT            : 'not';
fragment F_VALUE_ID       : ('_')*'a'..'z' ('A'..'Z' | 'a'..'z' | '0'..'9' | '_')*;
// Only for types
fragment F_TYPE_ID        : ('_')*'A'..'Z' ('A'..'Z' | 'a'..'z' | '0'..'9' | '_')*;
fragment F_INT            : '0'|(('1'..'9')('0'..'9')*);
fragment F_PRIMITIVE_TYPE : 'Byte'|'Int'|'Long'|'Boolean'|'Char'|'Float'|'Double'|'Short';
fragment F_BASIC_TYPE     : 'UInt';

fragment ESCAPE_SEQUENCE  : '\\r'|'\\n'|'\\t'|'\\"'|'\\\\';
fragment SHARP            : '#'{ _input.LA(1)!='{' }?;

A few choices I have done:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead, when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion, this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example, we need them for syntax highlighting and in general, for the IntelliJ plugin because it requires defining tokens for each single character in the source file, without gaps
  • the trickiest part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of the tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the Lexer

I wrote a bunch of tests specifically for the lexer. In particular, I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

    @Test
    public void parseStringWithEmptyInterpolation() throws IOException {
        String code = "\"Hel#{}lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.INTERPOLATION_START, TurinLexer.INTERPOLATION_END, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseStringWithInterpolationContainingID() throws IOException {
        String code = "\"Hel#{foo}lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.INTERPOLATION_START,
                TurinLexer.VALUE_ID,
                TurinLexer.INTERPOLATION_END, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseStringWithSharpSymbol() throws IOException {
        String code = "\"Hel#lo!\"";
        verify(code, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

    @Test
    public void parseMethodDefinitionWithExpressionBody() throws IOException {
        String code = "Void toString() = \"foo\"";
        verify(code, TurinLexer.VOID_KW, TurinLexer.VALUE_ID, TurinLexer.LPAREN, TurinLexer.RPAREN, TurinLexer.ASSIGNMENT, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);
    }

As you can see I have just tested the tokenizer on a string and verified it produces  the correct list of tokens. Easy and straight to the point.

Conclusions

My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer models is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on to more interesting parts of the compiler.

Java (programming language) Java virtual machine

Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Java Virtual Threads and Scaling
  • Java’s Next Act: Native Speed for a Cloud-Native World
  • The Energy Efficiency of JVMs and the Role of GraalVM
  • Understanding Root Causes of Out of Memory (OOM) Issues in Java Containers

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: