Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Writing a Compiler in C#: Parsing, Part 1

DZone's Guide to

Writing a Compiler in C#: Parsing, Part 1

·
Free Resource

In the previous installment we saw the core of a lexical analyzer, a module that generates from a stream of characters a set of tokens for symbols, identifiers, keywords, integer constants, and string constants. Today, we move to parsing.

The parser’s job is to give semantic structure to the syntactic tokens bestowed upon it by the lexical analyzer. There are, as always, automatic tools like yacc that create from a BNF grammar a program that parses tokens in a certain language. However, it is often more efficient and certainly more educational to write a parser by hand.

We’ll be dealing with a very special kind of parser—a recursive descent predictive parser. As you might remember from the last installment, Jack is almost an LL(1) language, meaning that a single look-ahead token is sufficient for parsing almost all Jack statements and expressions. (This predictive nature is a property of deterministic finite automata.)

When you look at a Jack program fragment (such as the following one, taken from the previous installment), you probably see a certain structure, but translating that structure into a set of parsing procedures is usually non-trivial without some sort of grammar representation.

let prime = true;
while (i < numbers[j] & prime) {
    if (numbers[j] % i = 0) {
        do System.printInt(numbers[j]);
        do System.print(" is composite.");
        do System.println();
        let prime = false;
    }
    let i = i + 1;
}

One such convenient representation is Backus-Naur Form (BNF).

If you’re familiar with regular expressions, you should have no difficulty understanding the following simplified BNF definitions of some of Jack’s grammar:

stmt       ::= while-stmt | if-stmt | let-stmt
while-stmt ::= while '(' expr ')' { stmt* }
if-stmt    ::= if '(' expr ')' { stmt* }
let-stmt   ::= let var ( [ expr ] )? = expr ;
var        ::= identifier
identifier ::= letter_ ( letter_ | digit )*
letter_    ::= A | B | … | Z | a | b | … | z | _
digit      ::= 0 | 1 | … | 9

The entire Jack grammar can be expressed in less than a single printed page of BNF definitions. I hope you’re convinced by now that BNF is a concise form of expression; it remains for me to convince you that it’s also a convenient form.

Let’s take a simple BNF grammar for summation expressions, where the terms are separated by + and there are no parentheses:

summation ::= term ( '+' term )* ;
term      ::= digit+

Assuming that the tokenizer hands us the terms (so that we don’t have to parse individual digits), the parser almost writes itself from the BNF:

procedure ParseSummation
    ParseTerm()
    while LookaheadToken <> ";"
        Match("+")
        ParseTerm()
        //placeholder (1)
    end while
end procedure

procedure ParseTerm
    term := NextToken()
    //placeholder (2)
end procedure

One thing this parser could do is convert infix summation expressions to postfix summation expressions. (Later, postfix will be an excellent choice for actually compiling a Jack expression to assembly language.) To accomplish that, we need to replace the first placeholder with “print +” and the second placeholder with “print term”.

We can parse more complex expressions in a very similar fashion; in the next installment we’ll deal with the full BNF form of a Jack expression.

Topics:

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}