Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Writing a Compiler in C#: Parsing, Part 3

DZone's Guide to

Writing a Compiler in C#: Parsing, Part 3

·
Free Resource

Last time we left off on the brink of finishing the parser for Jack expressions. We need only fill in the blanks for parsing subroutine calls.

There are three forms of subroutine calls allowed in Jack:

class C {
constructor C new() { return this; }
function void f() {
var C c;
var D d;
var int i;
let c = C.new(); //ctor call
let d = D.new(); //ctor call
let i = c.m(); //method call on another instance
let i = d.m(); //method call on another instance
let i = D.f(); //function call on another instance
}
method int m() {
var int i;
let i = x(); //method call on this instance
return i;
}
method int x() { return 42; }
}
class D {
constructor D new() { return this; }
method int m() { return 42; }
}

The easiest form is the dotless method call on the same instance, e.g. “x()”. This thing is characterized by the lack of the dot. The next two forms are slightly more difficult to tell apart—either a method call on a variable, e.g. “c.m()” or a function call on a class, e.g. “D.m()”.

To tell the latter two forms apart, we need to see if the first identifier is a variable name or not. For that, we need a symbol table, discussed later. To tell the first form apart from the latter two forms, we need to see if the next identifier after the first one is a dot or an opening parenthesis. This is the only case in the Jack grammar when double look-ahead is required, making Jack an LL(2) language. Fortunately, we can code a special case for parsing sub-call so that we don’t need to introduce backtracking of the look-ahead token.

When the ParseTerm method calls the ParseSubCall method, it passes to it the token already eaten by ParseTerm (which could be a method name, a variable name, or a class name). The look-ahead token is then the second token in the BNF definition of sub-call, so the ParseSubCall method needs to treat it accordingly:

private void ParseSubCall(Token firstPart)
{
if (firstPart.Type != TokenType.Ident)
{
ThrowCompilationException(...);
}
string classNameOfSubroutine = null;
string subroutineName = null;
bool isInstanceMethod = false;
Symbol classOrVarName = GetClosestSymbol(firstPart.Value);
if (classOrVarName == null)
{
if (LookAheadToken.Value == "(")
{
...

classNameOfSubroutine = _currentClassName;
subroutineName = firstPart.Value;
isInstanceMethod = true;
}
else if (LookAheadToken.Value == ".")
{
Match(new Token(TokenType.Symbol, "."));
Token subName = NextToken();
if (subName.Type != TokenType.Ident)
{
ThrowCompilationException(...);
}
classNameOfSubroutine = firstPart.Value;
subroutineName = subName.Value;
isInstanceMethod = false;
}
else
{
ThrowCompilationException(...);
}
}
else
{
//We must be able to tell the class name by looking
//at the variable's type.
classNameOfSubroutine = classOrVarName.Type;
if (Syntax.IsKeyword(classNameOfSubroutine))
{
ThrowCompilationException(
"Can't call methods on built-in types");
}
Match(new Token(TokenType.Symbol, "."));
Token subName = NextToken();
if (subName.Type != TokenType.Ident)
{
ThrowCompilationException(...);
}
subroutineName = subName.Value;
isInstanceMethod = true;
}
Match(new Token(TokenType.Symbol, "("));
//If this is an instance method, we need to supply
//'this' as the first parameter to be extracted on
//the other side.
if (isInstanceMethod)
{
if (classNameOfSubroutine == _currentClassName)
{ //'this' is actually our very own instance,
//as we're calling an instance method on ourselves.
_codeGenerator.This();
}
else
{ //'this' is actually the variable on which
//we're making the call, so we need that pushed.
_codeGenerator.VariableRead(firstPart, false);
}
}
ParseExpressionList();
Match(new Token(TokenType.Symbol, ")"));
_codeGenerator.Call(classNameOfSubroutine, subroutineName);

}
private void ParseExpressionList()
{
if (LookAheadToken.Value == ")")
{
return;//This is an empty list
}
while (true)
{
ParseBLOCKED EXPRESSION;
if (LookAheadToken.Value != ",")
{
break;
}
Match(new Token(TokenType.Symbol, ","));
}
}

Now we must come back to the subject of a symbol table.

When compiling the method call “c.m()” where c is a variable of class C, we need the type of c to be available to the compiler to emit the appropriate method call. In other words, we need a repository that maps variable names to their types.

In a similar vein, when compiling the method call “C.m()” where C is a class name, we need the compiler to know that C is not a variable name in any active scope, and therefore it’s a class name. Again, we need a repository that contains the set of variable names in the current scope.

The repository that takes the latter two responsibilities is called a symbol table, and in our case is implemented using a simple Dictionary<string,Symbol> where Symbol is a class with properties Name, Type, and Kind (local, parameter, field, or static).

Clearly, some part of the compiler has to populate the symbol table. In most cases, it’s the parser’s responsibility (although some smart lexical analyzers may also help). In our compiler, the parser populates the symbol table when it processes local variable declarations in methods and class-level variable declarations.

Unfortunately, one symbol table is not enough. Consider the following legal Jack program fragment:

class C {
static int i;
function void f() {
var int i;
let i = 15;
}
function void g() {
let i = 10;
do C.f();
do System.printInt(i);
}
}

After invoking C.g, the output should be 10, not 15! In other words, we need at least two symbol tables to parse properly a Jack program—a class-level symbol table and a method-level symbol table. Fortunately, we don’t allow nested blocks in Jack, so two tables are indeed enough and we don’t have to deal with scope lookup for an identifier.

One final note: the error-handling capabilities of our parser are significantly limited by the fact that Jack does not require subroutines to be declared prior to their use, and the fact that our compiler is a one-pass compiler. In other words, we might happily parse the following program, only to determine at a later time that the method calls can’t be resolved:

class C {
function void f() {
do C.questionable_method();
do D.questionable_method();
}
}

One alleviation is to require subroutines to be declared prior to their use, i.e. something along the lines of:

class C {
function void C.questionable_method() FORWARD;
function void D.questionable_method() FORWARD;
function void f() {
do C.questionable_method();
do D.questionable_method();
}
}

This will at the very least ensure that the developer really meant to invoke the declared method.

Another approach is to perform two passes over the source code. In the first pass, the compiler doesn’t look inside subroutine bodies—it looks only at the class and method declarations to know “what’s available”. In the second pass, parsing can proceed with full knowledge of all class and method declarations and can flag any errors that arise. (Incidentally, the former approach is taken by the C++ compiler, and the latter approach taken by the C# compiler. Which one is friendlier to the developer?)

Next time, the full BNF of a Jack program and the final parser implementation.

Topics:

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}