DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. The Do's and Don'ts of Using Regexes With Big Data

The Do's and Don'ts of Using Regexes With Big Data

This article covers how regular expressions have a perennial place in any software engineer's toolkit.

Liz Bennett user avatar by
Liz Bennett
·
Sep. 06, 16 · Opinion
Like (12)
Save
Tweet
Share
15.66K Views

Join the DZone community and get the full member experience.

Join For Free

This article is featured in the new DZone Guide to  Big Data Processing, Volume III. Get your free copy for more insightful articles, industry statistics, and more.

Regular expressions have been around for decades. They predate almost all of the tools and technologies we talk about in today’s world of Big Data, scalability, slick UIs, and machine learning. Commonly viewed as cryptic, frustrating, and difficult to learn, many developers scoff at the thought of using regular expressions for anything besides validating an e-mail address. Regular expressions are extremely powerful, however, as well as elegant, simple, and universal.

Even in today’s world of Big Data, regular expressions have a perennial place in any software engineer’s toolkit. According to formal language theory, regular expressions are as fundamental to computer science as any programming language or machine-readable data format. They are the mechanism by which unstructured data becomes structured. They turn chaos into order.

When faced with a text processing challenge, few tools are better suited to the task than regular expressions. At a certain scale however, there are performance caveats to be aware of. To explore those caveats, let’s first dive into a bit of computer science theory.

What exactly is a regular expression? A regular expression formally is the textual encoding of a nite state machine. Because of this, their expressive power is limited, but they process input extremely efficiently. They cannot, for example, determine whether or not an input is a palindrome, whereas the regular expression’s more powerful cousin, the context-free grammar, can determine whether a string is a palindrome. Regular expressions are strictly less powerful than context-free grammars, which in turn are less powerful than Turing complete languages, like C or Java. Thus, anything you could do with a regular expression, you could technically also do with a standard programming language.

Strictly speaking, regular expressions are not the same thing as the regex we know now. The two terms are often conflated, but over the years regex has evolved to be more powerful than formal regular expression, starting with Perl’s Regex engine. Features such as capture group back references are not technically possible with actual regular expressions. True regular expressions can be processed quickly because they represent nite state machines and the engines that support these regular expressions use the extremely efficient Thompson NFA algorithm (e.g. SED, AWK, and GREP). When regexes stopped representing formal regular expressions, a new algorithm was devised: the recursive backtracking algorithm. The advent of this new algorithm introduced a slew of regex performance gotchas, which are now almost a defining characteristic of regexes.

With recursive backtracking based regex engines, it is possible to craft regular expressions that match in exponential time with respect to the length of the input, whereas the Thompson NFA algorithm will always match in linear time. As the name would imply, the slower performance of the recursive backtracking algorithm is caused by the backtracking involved in processing input. This backtracking has serious consequences when working with regexes at a high scale because an inefficient regex can take orders of magnitude longer to match than an efficient regex. The standard regex engines in most modern languages, such as Java, Python, Perl, PHP, and JavaScript, use this recursive backtracking algorithm, so almost any modern solution involving regexes will be vulnerable to poorly performing regexes. Fortunately, though, in almost all cases, an inefficient regex can be optimized to be an efficient regex, potentially resulting in enormous savings in terms of CPU cycles.

To illustrate this, let’s consider an example use case: parsing log data. Due to their predictable format, log lines can easily be matched against regexes to extract useful pieces of information. Let’s say we use the following example regex to parse log lines:

Image title

The example log lines might look something like this:

Image title

The information we’re interested in gathering is app and web.1, which are extracted via the first and second capture groups. Although the above regex is simple, readable, and sufficient for extracting the needed information from the input text, a far more efficient regex would look like this:

Image title

Both regular expressions extract the same data from the same input text. The second regular expression, however, is far more specific about the format of the data it expects to match. It first encodes the format of the timestamp of the expected log lines, which will almost instantly rule out log lines that do not match. It also uses negated character classes rather than .*’s which eliminates almost all backtracking. The inefficient regular expression, because of all of its .*’s, backtracks excessively, which is what causes it to match so much more slowly. Since it begins with a .*, it will consume as much text as it can, zooming down the input until it reaches the end. Then it will search backwards until it reaches the next character it’s looking for, which in this case is a space character. Once it encounters the first space from the end, it again consumes all characters until the end of the input and backtracks again in search of the next character, which is an open square bracket. It will reach the end and it won’t nd an open bracket, so it goes back to looking for spaces, hoping that if it had consumed less text in the first place, maybe it could still match the input. This process continues until either the input matches or there are no more possibilities to try. This is the recursive backtracking process that causes the exponential runtime characteristics. The efficient regular expression on the other hand, does none of this backtracking and will basically consume characters from left to right.

A simple benchmark where each regex is fed the sample input 10,000 times shows that the inefficient regex takes about 20 times longer to match than the efficient regex. This is pretty astounding since both regexes extract the same information from the same input. If you are using regexes to parse huge amounts of data, a few small adjustments in the regex definition can make the difference between needing 20 machines to do the processing, and needing 1 machine. This, obviously, translates to serious cost and latency savings.

For more insights on machine learning, neural nets, data health, and more get your free copy of the new DZone Guide to Big Data Processing, Volume III!

Big data

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Steel Threads Are a Technique That Will Make You a Better Engineer
  • When Should We Move to Microservices?
  • Strategies for Kubernetes Cluster Administrators: Understanding Pod Scheduling
  • Cloud Performance Engineering

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: