Over a million developers have joined DZone.

The Do's and Don'ts of Using Regexes With Big Data

This article covers how regular expressions have a perennial place in any software engineer's toolkit.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

This article is featured in the new DZone Guide to  Big Data Processing, Volume III. Get your free copy for more insightful articles, industry statistics, and more.

Regular expressions have been around for decades. They predate almost all of the tools and technologies we talk about in today’s world of Big Data, scalability, slick UIs, and machine learning. Commonly viewed as cryptic, frustrating, and difficult to learn, many developers scoff at the thought of using regular expressions for anything besides validating an e-mail address. Regular expressions are extremely powerful, however, as well as elegant, simple, and universal.

Even in today’s world of Big Data, regular expressions have a perennial place in any software engineer’s toolkit. According to formal language theory, regular expressions are as fundamental to computer science as any programming language or machine-readable data format. They are the mechanism by which unstructured data becomes structured. They turn chaos into order.

When faced with a text processing challenge, few tools are better suited to the task than regular expressions. At a certain scale however, there are performance caveats to be aware of. To explore those caveats, let’s first dive into a bit of computer science theory.

What exactly is a regular expression? A regular expression formally is the textual encoding of a nite state machine. Because of this, their expressive power is limited, but they process input extremely efficiently. They cannot, for example, determine whether or not an input is a palindrome, whereas the regular expression’s more powerful cousin, the context-free grammar, can determine whether a string is a palindrome. Regular expressions are strictly less powerful than context-free grammars, which in turn are less powerful than Turing complete languages, like C or Java. Thus, anything you could do with a regular expression, you could technically also do with a standard programming language.

Strictly speaking, regular expressions are not the same thing as the regex we know now. The two terms are often conflated, but over the years regex has evolved to be more powerful than formal regular expression, starting with Perl’s Regex engine. Features such as capture group back references are not technically possible with actual regular expressions. True regular expressions can be processed quickly because they represent nite state machines and the engines that support these regular expressions use the extremely efficient Thompson NFA algorithm (e.g. SED, AWK, and GREP). When regexes stopped representing formal regular expressions, a new algorithm was devised: the recursive backtracking algorithm. The advent of this new algorithm introduced a slew of regex performance gotchas, which are now almost a defining characteristic of regexes.

With recursive backtracking based regex engines, it is possible to craft regular expressions that match in exponential time with respect to the length of the input, whereas the Thompson NFA algorithm will always match in linear time. As the name would imply, the slower performance of the recursive backtracking algorithm is caused by the backtracking involved in processing input. This backtracking has serious consequences when working with regexes at a high scale because an inefficient regex can take orders of magnitude longer to match than an efficient regex. The standard regex engines in most modern languages, such as Java, Python, Perl, PHP, and JavaScript, use this recursive backtracking algorithm, so almost any modern solution involving regexes will be vulnerable to poorly performing regexes. Fortunately, though, in almost all cases, an inefficient regex can be optimized to be an efficient regex, potentially resulting in enormous savings in terms of CPU cycles.

To illustrate this, let’s consider an example use case: parsing log data. Due to their predictable format, log lines can easily be matched against regexes to extract useful pieces of information. Let’s say we use the following example regex to parse log lines:

Image title

The example log lines might look something like this:

Image title

The information we’re interested in gathering is app and web.1, which are extracted via the first and second capture groups. Although the above regex is simple, readable, and sufficient for extracting the needed information from the input text, a far more efficient regex would look like this:

Image title

Both regular expressions extract the same data from the same input text. The second regular expression, however, is far more specific about the format of the data it expects to match. It first encodes the format of the timestamp of the expected log lines, which will almost instantly rule out log lines that do not match. It also uses negated character classes rather than .*’s which eliminates almost all backtracking. The inefficient regular expression, because of all of its .*’s, backtracks excessively, which is what causes it to match so much more slowly. Since it begins with a .*, it will consume as much text as it can, zooming down the input until it reaches the end. Then it will search backwards until it reaches the next character it’s looking for, which in this case is a space character. Once it encounters the first space from the end, it again consumes all characters until the end of the input and backtracks again in search of the next character, which is an open square bracket. It will reach the end and it won’t nd an open bracket, so it goes back to looking for spaces, hoping that if it had consumed less text in the first place, maybe it could still match the input. This process continues until either the input matches or there are no more possibilities to try. This is the recursive backtracking process that causes the exponential runtime characteristics. The efficient regular expression on the other hand, does none of this backtracking and will basically consume characters from left to right.

A simple benchmark where each regex is fed the sample input 10,000 times shows that the inefficient regex takes about 20 times longer to match than the efficient regex. This is pretty astounding since both regexes extract the same information from the same input. If you are using regexes to parse huge amounts of data, a few small adjustments in the regex definition can make the difference between needing 20 machines to do the processing, and needing 1 machine. This, obviously, translates to serious cost and latency savings.

For more insights on machine learning, neural nets, data health, and more get your free copy of the new DZone Guide to Big Data Processing, Volume III!

Learn how you can modernize your data warehouse with Apache Hadoop. View an on-demand webinar now. Brought to you in partnership with Hortonworks.

big data,regex,regular expression

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}