This post is going to be about language processing. Language processing could be anything like an arithmetic expression evaluator, a SQL parser or even a compiler or interpreter. Many times when we build user facing products, we give users a new language to interact with the product. Say, if you had used JIRA for project management, it gives you a Jira Query Language. Google also has a language to search as documented here – https://support.google.com/websearch/answer/136861?hl=en. Splunk has it’s own language called SPL. How to build such a system is what we will see in this post.
A test use case
I always believe to learn something we need to have a problem to solve that can serve as a use case. Let’s say I want to come up with a new language that’s simpler than SQL. Say I want the user to be able to key in the below text:
Abishek AND (country=India OR city=NY) LOGIN 404 | show name city
And this should fetch name and city fields from a table where the text matches “Abishek” and Abishek could be either in some city in India or gone to New York. We also need to filter results that contain the text LOGIN and 404 as we are trying to trace what happened when Abishek was trying to login but landed with some error codes. Say the data is in a database, what we need here is a language parser to understand the input and then a translator that can translate to SQL so that we can run the query on DB.
What is ANTLR?
From antlr.org, ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. And ANTLR can greatly help solve our use case pretty quickly. There are few more similar tools like javacc etc, but I found ANTLR to the well documented and top project in this space.
The first step: Grammar
When we want a parser, an approach many take is to go and write the parser from scratch. I remember doing so in an interview where I was asked to write an arithmetic expression evaluator. Though this approach works – it’s not the best choice when you have complex operators, keywords and many choices. Choices are an interesting thing – If you know Scala you will realise 5 + 3 is the same as 5.+(3). Usually there is more than one way to do things, in our example we could either say “LOGIN AND 404” or just say “LOGIN 404”. Grammar involves identifying these choices, sequences and tokens.
ANTLR uses a variant of the popular LL(*) parsing technique (http://en.wikipedia.org/wiki/LL_parser) which takes a top down approach. So we define the grammar from top down – fist look at what the input is – Say a file input can have a set of statements – statements can be classified into different statement types based on identifying patterns and tokens. Then statements can be broken down in different types of expressions and expressions can contain operators and operands.
In this approach a quick grammar I came up for our use case is like below:
grammar Simpleql; statement : expr command* ; expr : expr ('AND' | 'OR' | 'NOT') expr # expopexp | expr expr # expexp | predicate # predicexpr | text # textexpr | '(' expr ')' # exprgroup ; predicate : text ('=' | '!=' | '>=' | '<=' | '>' | '<') text ; command : '| show' text* # showcmd | '| show' text (',' text)* # showcsv ; text : NUMBER # numbertxt | QTEXT # quotedtxt | UQTEXT # unquotedtxt ; AND : 'AND' ; OR : 'OR' ; NOT : 'NOT' ; EQUALS : '=' ; NOTEQUALS : '!=' ; GREQUALS : '>=' ; LSEQUALS : '<=' ; GREATERTHAN : '>' ; LESSTHAN : '<' ; NUMBER : DIGIT+ | DIGIT+ '.' DIGIT+ | '.' DIGIT+ ; QTEXT : '"' (ESC|.)*? '"' ; UQTEXT : ~[ ()=,<>!\r\n]+ ; fragment DIGIT : [0-9] ; fragment ESC : '\\"' | '\\\\' ; WS : [ \t\r\n]+ -> skip ;
Going by top down approach:
- We can see than in my case, my input is a statement.
- A statement comprises of an expression part and a command part.
- Expression has multiple patterns – it can be two expressions connected by an expression operator.
- Expression can be internally two expression without an explicit operator between them.
- An expression can be predicate – A predicate is of the patter <text> <operator> <text>
- An expression can be just a text. Eg: We just want to do full text search on “LOGIN”.
- An expression can be an expression inside brackets for grouping.
- A command has a command starting with a pipe, then a command like “show” followed by arguments.
Creating the Lexer, Parser and Listener
With ANTLR, once you come up with the grammar, you are close to done! ANTLR generates the lexer, parser and listener code for us. Lexer helps with breaking our input into tokens. We usually don’t deal with the Lexer. What we will use is the Parser – The Parser can give us a parsed expression tree like shown below.
ANTLR also gives you a tree walker than can traverse the tress and gives you a base listener with methods that get called when the traverser is navigating the tree. All I had to implement the translator was to extend the listener and overwrite the methods for the nodes I am interested in and use a stack push the translations at each node. And that’s all, my robust translator was ready pretty fast. I am not going to post about ANTLR setup and running guide here, because that’s quite clear in their documentation. But feel free to reach out to me incase of an clarifications!