The Do's and Don'ts of Using Regexes With Big Data
The Do's and Don'ts of Using Regexes With Big Data
This article covers how regular expressions have a perennial place in any software engineer's toolkit.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Regular expressions have been around for decades. They predate almost all of the tools and technologies we talk about in today’s world of Big Data, scalability, slick UIs, and machine learning. Commonly viewed as cryptic, frustrating, and difficult to learn, many developers scoff at the thought of using regular expressions for anything besides validating an e-mail address. Regular expressions are extremely powerful, however, as well as elegant, simple, and universal.
Even in today’s world of Big Data, regular expressions have a perennial place in any software engineer’s toolkit. According to formal language theory, regular expressions are as fundamental to computer science as any programming language or machine-readable data format. They are the mechanism by which unstructured data becomes structured. They turn chaos into order.
When faced with a text processing challenge, few tools are better suited to the task than regular expressions. At a certain scale however, there are performance caveats to be aware of. To explore those caveats, let’s first dive into a bit of computer science theory.
What exactly is a regular expression? A regular expression formally is the textual encoding of a nite state machine. Because of this, their expressive power is limited, but they process input extremely efficiently. They cannot, for example, determine whether or not an input is a palindrome, whereas the regular expression’s more powerful cousin, the context-free grammar, can determine whether a string is a palindrome. Regular expressions are strictly less powerful than context-free grammars, which in turn are less powerful than Turing complete languages, like C or Java. Thus, anything you could do with a regular expression, you could technically also do with a standard programming language.
Strictly speaking, regular expressions are not the same thing as the regex we know now. The two terms are often conflated, but over the years regex has evolved to be more powerful than formal regular expression, starting with Perl’s Regex engine. Features such as capture group back references are not technically possible with actual regular expressions. True regular expressions can be processed quickly because they represent nite state machines and the engines that support these regular expressions use the extremely efficient Thompson NFA algorithm (e.g. SED, AWK, and GREP). When regexes stopped representing formal regular expressions, a new algorithm was devised: the recursive backtracking algorithm. The advent of this new algorithm introduced a slew of regex performance gotchas, which are now almost a defining characteristic of regexes.
To illustrate this, let’s consider an example use case: parsing log data. Due to their predictable format, log lines can easily be matched against regexes to extract useful pieces of information. Let’s say we use the following example regex to parse log lines:
The example log lines might look something like this:
The information we’re interested in gathering is
web.1, which are extracted via the first and second capture groups. Although the above regex is simple, readable, and sufficient for extracting the needed information from the input text, a far more efficient regex would look like this:
Both regular expressions extract the same data from the same input text. The second regular expression, however, is far more specific about the format of the data it expects to match. It first encodes the format of the timestamp of the expected log lines, which will almost instantly rule out log lines that do not match. It also uses negated character classes rather than
.*’s which eliminates almost all backtracking. The inefficient regular expression, because of all of its
.*’s, backtracks excessively, which is what causes it to match so much more slowly. Since it begins with a
.*, it will consume as much text as it can, zooming down the input until it reaches the end. Then it will search backwards until it reaches the next character it’s looking for, which in this case is a space character. Once it encounters the first space from the end, it again consumes all characters until the end of the input and backtracks again in search of the next character, which is an open square bracket. It will reach the end and it won’t nd an open bracket, so it goes back to looking for spaces, hoping that if it had consumed less text in the first place, maybe it could still match the input. This process continues until either the input matches or there are no more possibilities to try. This is the recursive backtracking process that causes the exponential runtime characteristics. The efficient regular expression on the other hand, does none of this backtracking and will basically consume characters from left to right.
A simple benchmark where each regex is fed the sample input 10,000 times shows that the inefficient regex takes about 20 times longer to match than the efficient regex. This is pretty astounding since both regexes extract the same information from the same input. If you are using regexes to parse huge amounts of data, a few small adjustments in the regex definition can make the difference between needing 20 machines to do the processing, and needing 1 machine. This, obviously, translates to serious cost and latency savings.
Opinions expressed by DZone contributors are their own.