Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

I Built a Query Parser Over a Weekend (Part 2)

DZone's Guide to

I Built a Query Parser Over a Weekend (Part 2)

I got something that's easy to work with. We can compare it to the formal definition to guide us in parsing to get code that is readable and has quite good performance.

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

In the previous post, I talked about what I wanted to get and how I decided to use the GOLD parser to formally define the language grammar. However, I had no intention of actually generating the parser — I wanted to write it myself. This is because I have a lot of poor experience with the results of parser generators. They generate horrible code that manages to be both unreadable and force you to go into it frequently enough to be maddening.

In particular, I haven’t been able to find anything that would be good enough as a hand-rolled parser in terms of performance, readability of the code, and the quality of errors it will generate. The latter might sound strange, but a key part of the usefulness of any language is the kind of errors that you get when you give it invalid output.

An important consideration is that I’m intending to accept input over HTTP, so I can assume that my input is already wrapped in a pretty System.String package, which saves a lot of the complications that you usually have to deal with if your input is a streaming medium. Initially, I tried to go with a scanner/parser in the same place, but that ended up being a bad idea; it was too complex to handle. Instead, I split the responsibility to a scanner and parser classes.

The scanner is responsible for scanning the string and finding the next token. But I decided to make it reactive — so you’ll tell it what you expect and it will see if the next bit matches. Scanners will typically scan the input and produce a stream of tokens. That works, but it means that you need a lot more state in the scanner and it can be more complex. I decided to simply it as much as I possibly could.

Here is how we find identifiers in the scanner:

// Matches: [_, A-Z] [A-Z, 0-9, _]*
public bool Identifier(bool skipWhitspace = true)
{
    if (SkipWhitespace(skipWhitspace) == false)
        return false;
    if (char.IsLetter(_q[_pos]) == false &&
            _q[_pos] != '_')
        return false;
    TokenStart = _pos;
    _pos++;
    for (; _pos < _q.Length; _pos++)
        if (char.IsLetterOrDigit(_q[_pos]) == false &&
            _q[_pos] != '_')
            break;
    TokenLength = _pos - TokenStart;
    Column += TokenLength;
    return true;
}

You might notice that the code is pretty simple; it runs over the input string (_q) and check if the next token is an identifier based on our rules. That makes the parser code easier to handle. Let's consider how we can parse a FROM clause. The formal definition is:

<From> ::= 'FROM INDEX' <From Source> | 'FROM' <From Source>
<From Source> ::= <Simple Field> 'AS' <Simple Field> | <Simple Field>

<Simple Field> ::= Identifier | StringLiteral

If you are familiar with BNF, this is quite readable. Now, how do we parse this? Here is the actual parser code:

private (FieldToken From, FieldToken Alias,  bool Index) FromClause()
{
    if (Scanner.TryScan("FROM") == false)
        ThrowParseException("Expected FROM clause");

    FieldToken field;
    bool index = false;
    if (Scanner.TryScan("INDEX"))
    {
        if (!Scanner.Identifier() && !Scanner.String())
            ThrowParseException("Expected FROM INDEX source");

        field = new FieldToken
        {
            TokenLength = Scanner.TokenLength,
            TokenStart = Scanner.TokenStart,
            EscapeChars = Scanner.EscapeChars
        };

        index = true;
    }
    else
    {
        if (!Scanner.Identifier() && !Scanner.String())
            ThrowParseException("Expected FROM source");
        
        field = new FieldToken
        {
            TokenLength = Scanner.TokenLength,
            TokenStart = Scanner.TokenStart,
            EscapeChars = Scanner.EscapeChars
        };
    }

    FieldToken alias = null;
    if (Scanner.TryScan("AS"))
    {
        if (!Scanner.Identifier() && !Scanner.String())
            ThrowParseException("Expected ALIAS after AS in FROM");

        alias = new FieldToken
        {
            TokenLength = Scanner.TokenLength,
            TokenStart = Scanner.TokenStart,
            EscapeChars = Scanner.EscapeChars
        };

    }

    return (field, alias,  index);
}

As you can see, we start by asking for a FROM token (the scanner is case-insensitive, so we don’t need to worry about casing) then check if this is followed by an INDEX term, then we get actual identifier or string literal to use, and possible alias.

In this manner, we also get something that is pretty easy to work with, and we can compare it to the formal definition to guide us in the parsing. At the same time, we get code that is readable and has quite good performance.

Aggregations provide vital intelligence to the success of a business. Crush the challenge of providing real time aggregations for daily, weekly, and monthly totals without having to tie up your servers.

Topics:
database ,query ,parsing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}