We have tried to be practical in this article. Our goal is to help practitioners, not to explain the full theory. We just explain what you need to know to understand and build parsers.
Definition of Parsing
Parsing is defined as "the analysis of an input to organize the data according to the rule of a grammar."
There are a few ways to define parsing. However, the gist remains the same: parsing means to find the underlying structure of the data we are given.
In a way, parsing can be considered the inverse of templating: identifying the structure and extracting the data. In templating, instead, we have a structure and we fill it with data. In the case of parsing, you have to determine the model from the raw representation, while for templating, you have to combine the data with the model to create the raw representation. Raw representation is usually text, but it can also be binary data.
Fundamentally, parsing is necessary because different entities need the data to be in different forms. Parsing allows transforming data in a way that can be understood by a specific software. The obvious example is programs — they are written by humans, but they must be executed by computers. So, humans write them in a form that they can understand, then a software transforms them in a way that can be used by a computer.
However, parsing might be necessary even when passing data between two software that have different needs. For instance, it is needed when you have to serialize or deserialize a class.
The Big Picture
In this section, we are going to describe the fundamental components of a parser. We are not trying to give you formal explanations, but practical ones.
Regular expressions are a sequence of characters that can be defined by a pattern.
Regular expressions are often touted as the thing that you should not use for parsing. This is not strictly correct because you can use regular expressions for parsing simple input. The problem is that some programmers only know regular expressions, so they use them to try to parse everything — even the things that they should not. The result is usually a series of regular expressions hacked together that are very fragile.
You can use regular expressions to parse some simpler languages, but this excludes most programming languages — even the ones that look simple enough, like HTML. In fact, languages that can be parsed with just regular expressions are called regular languages. There is a formal mathematical definition, but that is beyond the scope of this article.
One important consequence of the theory is that regular languages can also be parsed or expressed by a finite state machine. That is to say, regular expressions and finite state machines are equally powerful. This is the reason they are used to implement lexers, as we are going to see later.
A regular language can be defined by a series of regular expressions, while more complex languages need something more. A simple rule of thumb is that if a grammar of a language has recursive, or nested, elements, it is not a regular language. For instance, HTML can contain an arbitrary number of tags inside any tag; therefore, it is not a regular language and it cannot be parsed using solely regular expressions, no matter how clever.
Regular Expressions in Grammar
The familiarity of a typical programmer with regular expressions lends them to be often used to define the grammar of a language. More precisely, their syntax is used to define the rules of a lexer or a parser. For example, the Kleene star (
*) is used in a rule to indicate that a particular element can be present zero or an infinite amount of times.
The definition of the rule should not be confused with how the actual lexer or parser is implemented. You can implement a lexer using the regular expression engine provided by your language. However, usually, the regular expressions defined in the grammar are converted are actually converted to a finite-state machine to gain better performance.
Structure of a Parser
Having clarified the role of regular expressions, we can look at the general structure of a parser. A complete parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. The parser needs the lexer because it does not work directly on the text but on the output produced by the lexer. Not all parsers adopt this two-step schema; some parsers do not depend on a separate lexer and they combine the two steps. They are called scannerless parsers.
A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens; the parser then scans the tokens and produces the parsing result.
Let’s look at the following example and imagine that we are trying to parse addition.
437 + 734
The lexer scans the text and finds
7, and then a space
( ). The job of the lexer is to recognize that the characters
437 constitute one token of type NUM. Then the lexer finds a
+ symbol, which corresponds to the second token of type PLUS, and lastly, it finds another token of type NUM.
The parser will typically combine the tokens produced by the lexer and group them.
The definitions used by lexers and parsers are called rules or productions. In our example, a lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.
It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. For example, this was the case of the venerable lex and yacc couple: using lex, it was possible to generate a lexer, while using yacc, it was possible to generate a parser.
Scannerless parsers operate differently because they process directly the original text instead of processing a list of tokens produced by a lexer. That is to say, a scannerless parser works as a lexer and a parser combined.
While it is certainly important to know for debugging purposes if a particular parsing tool is a scannerless parser, in many cases, it is not relevant to define a grammar. That is because you usually still define the pattern that group sequences of characters to create (virtual) tokens; then, you combine them to obtain the final result. This is simply because it is more convenient to do so. In other words, the grammar of a scannerless parser looks very similar to one for a tool with separate steps. Again, you should not confuse how things are defined for your convenience and how things work behind the scenes.
You can check out the rest of the series below: