Building a Language: Generating Bytecode

Federico Tomassetti's series on building your own language reaches bytecode.

Sep. 18, 16 · Tutorial

Likes (15)

Comment

Save

14.5K Views

in this post we are going to see how to generate bytecode for our language. so far we have seen how to build a language to express what we want, how to validate that language, and how to build an editor for that language, but yet we cannot actually run the code. time to fix that. by compiling for the jvm, our code will be able to run on all sorts of platforms. that sounds pretty great to me!

series on building your own language

code is available on github under the tag 08_bytecode.

adding a print statement

before jumping into the bytecode generation let’s just add a print statement to our language. it is fairly easy: we just need to change a few lines in the lexer and parser definitions and we are good to go.

// changes to lexer
print              : 'print';

// changes to parser
statement : vardeclaration # vardeclarationstatement
          | assignment     # assignmentstatement
          | print          # printstatement ;

print : print lparen expression rparen ;

the general structure of our compiler

let’s start from the entry point for our compiler. we will either take the code from the standard input or from a file (to be specified as the first parameter). once we get the code, we try to build an ast and check for lexical and syntactical errors. if there are none we validate the ast and check for semantic errors. if still we have no errors we go on with the bytecode generation.

fun main(args: array<string>) {
    val code : inputstream? = when (args.size) {
        0 -> system.`in`
        1 -> fileinputstream(file(args[0]))
        else -> {
            system.err.println("pass 0 arguments or 1")
            system.exit(1)
            null
        }
    }
    val parsingresult = sandyparserfacade.parse(code!!)
    if (!parsingresult.iscorrect()) {
        println("errors:")
        parsingresult.errors.foreach { println(" * l${it.position.line}: ${it.message}") }
        return
    }
    val root = parsingresult.root!!
    println(root)
    val errors = root.validate()
    if (errors.isnotempty()) {
        println("errors:")
        errors.foreach { println(" * l${it.position.line}: ${it.message}") }
        return
    }
    val bytes = jvmcompiler().compile(root, "myclass")
    val fos = fileoutputstream("myclass.class")
    fos.write(bytes)
    fos.close()
}

note that in this example we are always producing a class file named myclass . probably later we would like to find a way to specify a name for the class file, but for now this is good enough.

using asm to generate bytecode

now, let’s dive into the funny part. the compile method of jvmcompiler is where we produce the bytes that later we will save into a class file. how do we produce those bytes? with some help from asm, which is a library to produce bytecode. now, we could generate the bytes array ourselves, but the point is that it would involve some boring tasks like generating the classpool structures. asm does that for us. we still need to have some understanding of how the jvm is structured but we can survive without being experts on the nitty-gritty details.

class jvmcompiler {

    fun compile(root: sandyfile, name: string) : bytearray {
        // this is how we tell asm that we want to start writing a new class. we ask it to calculate some values for us
        val cw = classwriter(classwriter.compute_frames or classwriter.compute_maxs)
        // here we specify that the class is in the format introduced with java 8 (so it would require a jre >= 8 to run)
        // we also specify the name of the class, the fact it extends object and it implements no interfaces
        cw.visit(v1_8, acc_public, name, null, "java/lang/object", null)
        // our class will have just one method: the main method. we have to specify its signature
        // this string just says that it takes an array of strings and return nothing (void)
        val mainmethodwriter = cw.visitmethod(acc_public or acc_static, "main", "([ljava/lang/string;)v", null, null)
        mainmethodwriter.visitcode()
        // labels are used by asm to mark points in the code
        val methodstart = label()
        val methodend = label()
        // with this call we indicate to what point in the method the label methodstart corresponds
        mainmethodwriter.visitlabel(methodstart)

        // variable declarations:
        // we find all variable declarations in our code and we assign to them an index value
        // our vars map will tell us which variable name corresponds to which index
        var nextvarindex = 0
        val vars = hashmap<string, var>()
        root.specificprocess(vardeclaration::class.java) {
            val index = nextvarindex++
            vars[it.varname] = var(it.type(vars), index)
            mainmethodwriter.visitlocalvariable(it.varname, it.type(vars).jvmdescription, null, methodstart, methodend, index)
        }

        // time to generate bytecode for all the statements
        root.statements.foreach { s ->
            when (s) {
                is vardeclaration -> {
                    // we calculate the type of the variable (more details later)
                    val type = vars[s.varname]!!.type
                    // the jvm is a stack based machine: it operated with values we have put on the stack
                    // so as first thing when we meet a variable declaration we put its value on the stack
                    s.value.pushas(mainmethodwriter, vars, type)
                    // now, depending on the type of the variable we use different operations to store the value
                    // we put on the stack into the variable. note that we refer to the variable using its index, not its name
                    when (type) {
                        inttype -> mainmethodwriter.visitvarinsn(istore, vars[s.varname]!!.index)
                        decimaltype -> mainmethodwriter.visitvarinsn(dstore, vars[s.varname]!!.index)
                        else -> throw unsupportedoperationexception(type.javaclass.canonicalname)
                    }
                }
                is print -> {
                    // this means that we access the field "out" of "java.lang.system" which is of type "java.io.printstream"
                    mainmethodwriter.visitfieldinsn(getstatic, "java/lang/system", "out", "ljava/io/printstream;")
                    // we push the value we want to print on the stack
                    s.value.push(mainmethodwriter, vars)
                    // we call the method println of system.out to print the value. it will take its parameter from the stack
                    // note that we have to tell the jvm which variant of println to call. to do that we describe the signature of the method,
                    // depending on the type of the value we want to print. if we want to print an int we will produce the signature "(i)v",
                    // we will produce "(d)v" for a double
                    mainmethodwriter.visitmethodinsn(invokevirtual, "java/io/printstream", "println", "(${s.value.type(vars).jvmdescription})v", false)
                }
                is assignment -> {
                    val type = vars[s.varname]!!.type
                    // this code is the same we have seen for variable declarations
                    s.value.pushas(mainmethodwriter, vars, type)
                    when (type) {
                        inttype -> mainmethodwriter.visitvarinsn(istore, vars[s.varname]!!.index)
                        decimaltype -> mainmethodwriter.visitvarinsn(dstore, vars[s.varname]!!.index)
                        else -> throw unsupportedoperationexception(type.javaclass.canonicalname)
                    }
                }
                else -> throw unsupportedoperationexception(s.javaclass.canonicalname)
            }
        }

        // we just says that here is the end of the method
        mainmethodwriter.visitlabel(methodend)
        // and we had the return instruction
        mainmethodwriter.visitinsn(return)
        mainmethodwriter.visitend()
        mainmethodwriter.visitmaxs(-1, -1)
        cw.visitend()
        return cw.tobytearray()
    }

}

about types

we have seen that our code uses types. this is needed because depending on the type we need to use different instructions. for example, to put a value in an integer variable we use istore, while to put a value in a double variable we use dstore . when we call system.out.println on an integer we need to specify the signature (i)v, while when we call it to print a double we specify (d)v .

to be able to do so we need to know the type of each expression. in our super simple language we use just int and double for now. in a real language we may want to use more types, but this will be enough to show you the principles.

interface sandytype {
    // given a type we want to get the corresponding string used in the jvm
    // for example: int -> i, double -> d, object -> ljava/lang/object; string -> [ljava.lang.string;
    val jvmdescription: string
}

object inttype : sandytype {
    override val jvmdescription: string
        get() = "i"
}

object decimaltype : sandytype {
    override val jvmdescription: string
        get() = "d"
}

fun expression.type(vars: map<string, var>) : sandytype {
    return when (this) {
        // an int literal has type int. easy :)
        is intlit -> inttype
        is declit -> decimaltype
        // the result of a binary expression depends on the type of the operands
        is binaryexpression -> {
            val lefttype = left.type(vars)
            val righttype = right.type(vars)
            if (lefttype != inttype && lefttype != decimaltype) {
                throw unsupportedoperationexception()
            }
            if (righttype != inttype && righttype != decimaltype) {
                throw unsupportedoperationexception()
            }
            // an operation on two integers produces integers
            if (lefttype == inttype && righttype == inttype) {
                return inttype
            // if at least a double is involved the result is a double
            } else {
                return decimaltype
            }
        }
        // when we refer to a variable the type is the type of the variable
        is varreference -> vars[this.varname]!!.type
        // when we cast to a value, the resulting value is that type :)
        is typeconversion -> this.targettype.tosandytype()
        else -> throw unsupportedoperationexception(this.javaclass.canonicalname)
    }
}

expressions

as we have seen, the jvm is a stack-based machine. so every time we want to use a value we push it on the stack and then do some operations. let’s see how we can push values into the stack

// convert, if needed
fun expression.pushas(methodwriter: methodvisitor, vars: map<string, var>, desiredtype: sandytype) {
    push(methodwriter, vars)
    val mytype = type(vars)
    if (mytype != desiredtype) {
        if (mytype == inttype && desiredtype == decimaltype) {
            methodwriter.visitinsn(i2d)
        } else if (mytype == decimaltype && desiredtype == inttype) {
            methodwriter.visitinsn(d2i)
        } else {
            throw unsupportedoperationexception("conversion from $mytype to $desiredtype")
        }
    }
}

fun expression.push(methodwriter: methodvisitor, vars: map<string, var>) {
    when (this) {
        // we have specific operations to push integers and double values
        is intlit -> methodwriter.visitldcinsn(integer.parseint(this.value))
        is declit -> methodwriter.visitldcinsn(java.lang.double.parsedouble(this.value))
        // to push a sum we first push the two operands and then invoke an operation which
        // depend on the type of the operands (do we sum integers or doubles?)
        is sumexpression -> {
            left.pushas(methodwriter, vars, this.type(vars))
            right.pushas(methodwriter, vars, this.type(vars))
            when (this.type(vars)) {
                inttype -> methodwriter.visitinsn(iadd)
                decimaltype -> methodwriter.visitinsn(dadd)
                else -> throw unsupportedoperationexception("summing ${this.type(vars)}")
            }
        }
        is subtractionexpression -> {
            left.pushas(methodwriter, vars, this.type(vars))
            right.pushas(methodwriter, vars, this.type(vars))
            when (this.type(vars)) {
                inttype -> methodwriter.visitinsn(isub)
                decimaltype -> methodwriter.visitinsn(dsub)
                else -> throw unsupportedoperationexception("summing ${this.type(vars)}")
            }
        }
        is divisionexpression -> {
            left.pushas(methodwriter, vars, this.type(vars))
            right.pushas(methodwriter, vars, this.type(vars))
            when (this.type(vars)) {
                inttype -> methodwriter.visitinsn(idiv)
                decimaltype -> methodwriter.visitinsn(ddiv)
                else -> throw unsupportedoperationexception("summing ${this.type(vars)}")
            }
        }
        is multiplicationexpression -> {
            left.pushas(methodwriter, vars, this.type(vars))
            right.pushas(methodwriter, vars, this.type(vars))
            when (this.type(vars)) {
                inttype -> methodwriter.visitinsn(imul)
                decimaltype -> methodwriter.visitinsn(dmul)
                else -> throw unsupportedoperationexception("summing ${this.type(vars)}")
            }
        }
        // to push a variable we just load the value from the symbol table
        is varreference -> {
            val type = vars[this.varname]!!.type
            when (type) {
                inttype -> methodwriter.visitvarinsn(iload, vars[this.varname]!!.index)
                decimaltype -> methodwriter.visitvarinsn(dload, vars[this.varname]!!.index)
                else -> throw unsupportedoperationexception(type.javaclass.canonicalname)
            }
        }
        // the pushas operation take care of conversions, as needed
        is typeconversion -> {
            this.value.pushas(methodwriter, vars, this.targettype.tosandytype())
        }
        else -> throw unsupportedoperationexception(this.javaclass.canonicalname)
    }
}

gradle

we can also create a gradle task to compile source files

task compilesandyfile(type:javaexec) {
    main = "me.tomassetti.sandy.compiling.jvmkt"
    args = "$sourcefile"
    classpath = sourcesets.main.runtimeclasspath
}

conclusions

we did not go into any detail and we sort of rush through the code. my goal here is just to give you an overview of the general strategy to use to generate bytecode. of course if you want to build a serious language you will need to do some studying and understand the internals of the jvm, there is no escape from that. i just hope that this brief introduction was enough to show you that this is not as scary or complicated as most people think.

Build (game engine) Abstract syntax Syntax highlighting Data Types Express Tree (data structure) Sort (Unix) push

Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending