Getting Started With ANTLR in C#
Integrating an ANTLR grammar in a C# project is very easy with the provided Visual Studio extensions and Nuget packages.
Join the DZone community and get the full member experience.
Join For FreeANTLR is a great tool to quickly create parsers and help you work with a known language or create your DSL. While the tool itself is written in Java, it can also be used to generate parsers in several other languages like Python, C#, or JavaScript (with more languages supported by the recently released 4.6 version).
If you want to use C#, you can integrate ANTLR in your favorite IDE as long as that IDE is a recent edition of Visual Studio. The runtime itself also works on Mono and can be used as a standalone. You can look at the issues for the official C# target for ANTLR 4 to see if you can make it work with other setups, but the easiest way is to use Visual Studio and the provided extension to integrate the generation of the grammar into your C# project.
Setup
The first step is to install the ANTLR Language Support extension for Visual Studio. You just have to search for it in for Visual Studio, going to Tools > Extensions and Updates. This will allow you to easily integrate ANTLR into your workflow by automatically generating the parser and, optionally, listeners and visitors starting from your grammar. Now you can add a new ANTLR 4 Combined Grammar or an ANTLR 4 Lexer/Parser in the same way you add any other new item. Then, for each one of your projects, you must add the Nuget package for Antlr4. If you want to manage options and, for instance, disable the visitor/listener generation, check out the official GitHub project.
Create the Grammar
For our simple project, we are going to create grammar that parses two lines of text that represents a chat between two people. This could be the basis for a chat program or for a game in which whoever says the shortest word get beaten up with a thesaurus. This is not relevant for the grammar itself, because it handles only the recognition of the various elements of the program. What you choose to do with these elements is managed through the normal code. Add a new ANTLR 4 Combined Grammar with the name Speak. You will see that there is already some text in the new file; delete all and replace it with the following text.
grammar Speak;
/*
* Parser Rules
*/
chat : line line EOF ;
line : name SAYS word NEWLINE;
name : WORD ;
word : WORD ;
/*
* Lexer Rules
*/
fragment A : ('A'|'a') ;
fragment S : ('S'|'s') ;
fragment Y : ('Y'|'y') ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
SAYS : S A Y S ;
WORD : (LOWERCASE | UPPERCASE)+ ;
WHITESPACE : (' '|'\t')+ -> skip ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
While you may create separate lexer and parser grammar, for a simple project you will want to use a combined grammar and put the parser before the lexer. That’s because as soon as antlr recognize a token in the lexer part, it stops searching. So, it’s also important to put the more specific tokens first and then the generic ones, like WORD or ID later. In this example, if we had inverted SAYS and WORD, SAYS would have been hidden by WORD. Another thing to notice is that you can’t use fragments outside of lexer rules.
Having said that, the lexer part is pretty straightforward: We identify a SAYS, which could be written uppercase or lowercase, a WORD, that could be composed of any letter uppercase or lowercase and a NEWLINE. Any text that is WHITESPACE, space and tab, is simply ignored. While this is clearly a simple case, lexer rules will hardly be more complicated than this. Usually, the worst thing that could happen is to have to use semantic predicates. These are essentially statements that evaluate true or false, and in the case they are false, they disable the following rule. For instance, you may want to use a "/" as the beginning of a comment only if it is the first character of a line; otherwise, it should be considered an arithmetic operator.
The parser is usually where things gets more complicated, although that’s not the case this time. Every document given to a speak grammar must contain a chat, that in turn is equal to two line rules followed by a End Of File marker. The line must contain a name, the SAYS keyword and a word. Name and word are identical rules, but they have different names because they correspond to different concepts, and they could easily change in a real program.
Visiting the Tree
Just like we have seen for Roslyn, ANTLR will automatically create a tree and base visitor (and/or listener). We can create our own visitor class and change what we need. Let’s see an example.
public class SpeakVisitor : SpeakBaseVisitor<object>
{
public List<SpeakLine> Lines = new List<SpeakLine>();
public override object VisitLine(SpeakParser.LineContext context)
{
NameContext name = context.name();
WordContext word = context.word();
SpeakLine line = new SpeakLine() { Person = name.GetText(), Text = word.GetText() };
Lines.Add(line);
return line;
}
The first line shows how to create a class that inherits from the SpeakBaseVisitor class, which is automatically generated by ANTLR. If you need it, you could restrict the type, for instance for a calculator grammar you could use something like int or double. SpeakLine (not shown) is a custom class that contains two properties: Person and Text. Line 5 shows how to override the function to visit the specific type of node that you want, you just need to use the appropriate type for the context, that contains the information provided by the parser generated by ANTLR. At line 13 we return the SpeakLine object that we just created, this is unusual and it’s useful for the unit testing that we will create later. Usually, you would want to return base.VisitLine(context) so that the visitor could continue its journey across the tree.
This code simply populates a list of SpeakLine that hold the name of the person and the word they have spoken. The Lines properties will be used by the main program.
Putting It All Together
private static void Main(string[] args) {
try { string input = "";
StringBuilder text = new StringBuilder();
Console.WriteLine("Input the chat.");
// to type the EOF character and end the input: use CTRL+D, then press <enter>
while ((input = Console.ReadLine()) != "\u0004")
{ text.AppendLine(input);
}
AntlrInputStream inputStream = new AntlrInputStream(text.ToString());
SpeakLexer speakLexer = new SpeakLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer); SpeakParser speakParser = new SpeakParser(commonTokenStream); SpeakParser.ChatContext chatContext = speakParser.chat(); SpeakVisitor visitor = new SpeakVisitor(); visitor.Visit(chatContext); foreach(var line in visitor.Lines) { Console.WriteLine("{0} has said \"{1}\"", line.Person, line.Text); } } catch (Exception ex)
{ Console.WriteLine("Error: " + ex);
}
}
As you can see, there is nothing particularly complicated. Lines 15-18 shows how to create the lexer and then create the tree. The subsequent lines show how to launch the visitor that you have created: You have to get the context for whichever starting rule you use, in our case chat, and the order to visit the tree from that node.
The program itself simply outputs the information contained in the tree. It would be trivial to modify the grammar program to allow infinite lines to be added. Both the visitor and the main program would not need to be changed.
Unit Testing
Testing is useful in all cases, but it is absolutely crucial when you are creating grammar to check that everything is working correctly. If you are creating grammar for an existing language, you probably want to check many working source file, but in any case, you want to start with unit testing the single rules. Luckily, since the creation of the Community Edition of Visual Studio, there is a free version of Visual Studio that includes a unit testing framework. All you have to do is to create a new Test Project, add all the necessary NuGet packages, and add a reference to the project assembly you need to test.
[TestClass] public class ParserTest
{ private SpeakParser Setup(string text)
{
AntlrInputStream inputStream = new AntlrInputStream(text);
SpeakLexer speakLexer = new SpeakLexer(inputStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer);
SpeakParser speakParser = new SpeakParser(commonTokenStream);
return speakParser;
}
[TestMethod] public void TestChat()
{
SpeakParser parser = Setup("john says hello \n michael says world \n");
SpeakParser.ChatContext context = parser.chat();
SpeakVisitor visitor = new SpeakVisitor();
visitor.Visit(context);
Assert.AreEqual(visitor.Lines.Count, 2);
}
[TestMethod] public void TestLine()
{
SpeakParser parser = Setup("john says hello \n");
SpeakParser.LineContext context = parser.line();
SpeakVisitor visitor = new SpeakVisitor();
SpeakLine line = (SpeakLine) visitor.VisitLine(context); Assert.AreEqual(line.Person, "john"); Assert.AreEqual(line.Text, "hello"); } [TestMethod] public void TestWrongLine() { SpeakParser parser = Setup("john sayan hello \n"); var context = parser.line(); Assert.IsInstanceOfType(context, typeof(SpeakParser.LineContext)); Assert.AreEqual(context.name().GetText(), "john"); Assert.AreEqual(context.word().GetText(), "sayan"); Assert.AreEqual("john<missing SAYS>sayanhello\n", context.GetText());
} }
There is nothing unexpected in these tests. One observation is that we can create a test to check the single line visitor or we can test the matching of the rule itself. You obviously should do both. You may wonder how the last test works, since we are trying to match a rule that doesn’t match, but we still get the correct type of context as a return value and some correct matching values. This happens because ANTLR is quite robust and there is only checking one rule. There are no alternatives. Since it starts the correct way, it is considered a match, although a partial one.
Conclusion
Integrating ANTLR grammar in a C# project is quite easy with the provided Visual Studio extensions and NuGet packages, making it the best way to quickly create a parser for your DSL. While there will be no more piles of fragile RegEx(s), you still can’t forget the tests.
Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments