DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • An Overview of the Top 10 Programming Languages Used in the World
  • Using Regular Expressions in Python: A Brief Guide
  • Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
  • Subtitles: The Good, the Bad, and the Resource-Heavy

Trending

  • Docker Base Images Demystified: A Practical Guide
  • Secrets Sprawl and AI: Why Your Non-Human Identities Need Attention Before You Deploy That LLM
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • Scaling DevOps With NGINX Caching: Reducing Latency and Backend Load
  1. DZone
  2. Coding
  3. Languages
  4. Using Zero-Width Assertions in Regular Expressions

Using Zero-Width Assertions in Regular Expressions

Explore anchors, lookahead, and lookbehind assertions, which allow you to manage which characters will be included in a match and more.

By 
Peter Kankowski user avatar
Peter Kankowski
·
Jul. 08, 24 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
7.0K Views

Join the DZone community and get the full member experience.

Join For Free

Anchors ^ $ \b \A \Z

Anchors in regular expressions allow you to specify the context in a string where your pattern should be matched. There are several types of anchors:

  • ^ matches the start of a line (in multiline mode) or the start of the string (by default).
  • $ matches the end of a line (in multiline mode) or the end of the string (by default).
  • \A matches the start of the string.
  • \Z or \z matches the end of the string.
  • \b matches a word boundary (before the first letter of a word or after the last letter of a word).
  • \B matches a position that is not a word boundary (between two letters or between two non-letter characters).

These anchors are supported in Java, PHP, Python, Ruby, C#, and Go. In JavaScript, \A and \Z are not supported, but you can use ^ and $ instead of them; just remember to keep the multiline mode disabled. 

For example, the regular expression ^abc will match the start of a string that contains the letters "abc". In multiline mode, the same regex will match these letters at the beginning of a line. You can use anchors in combination with other regular expression elements to create more complex matches. For example, ^From: (.*) matches a line starting with From:

The difference between \Z and \z is that \Z matches at the end of the string but also skips a possible newline character at the end. In contrast, \z is more strict and matches only at the end of the string.

If you have read the previous article, you may wonder if the anchors add any additional capabilities that are not supported by the three primitives (alternation, parentheses, and the star for repetition). The answer is that they do not, but they change what is captured by the regular expression. You can match a line starting with abc by explicitly adding the newline character: \nabc, but in this case, you will also match the newline character itself. When you use ^abc, the newline character is not consumed.

In a similar way, ing\b matches all words ending with ing. You can replace the anchor with a character class containing non-letter characters (such as spaces or punctuation): ing\W, but in this case, the regular expression will also consume the space or punctuation character.

If the regular expression starts with ^ so that it only matches at the start of the string, it's called anchored. In some programming languages, you can do an anchored match instead of a non-anchored search without using ^. For example, in PHP (PCRE), you can use the A modifier.

So the anchors don't add any new capabilities to the regular expressions, but they allow you to manage which characters will be included in the match or to match only at the beginning or end of the string. The matched language is still regular.

Zero-Width Assertions (?= ) (?! ) (?<= ) (?<! )

Zero-width assertions (also called lookahead and lookbehind assertions) allow you to check that a pattern occurs in the subject string without capturing any of the characters. This can be useful when you want to check for a pattern without moving the match pointer forward.

There are four types of lookaround assertions:

(?=abc) The next characters are “abc” (a positive lookahead)
(?!abc) The next characters are not “abc” (a negative lookahead)
(?<=abc) The previous characters are “abc” (a positive lookbehind)
(?<!abc) The previous characters are not “abc” (a negative lookbehind)

Zero-width assertions are generalized anchors. Just like anchors, they don't consume any character from the input string. Unlike anchors, they allow you to check anything, not only line boundaries or word boundaries. So you can replace an anchor with a zero-width assertion, but not vice versa. For example, ing\b could be rewritten as ing(?=\W|$).

Zero-width lookahead and lookbehind are supported in PHP, JavaScript, Python, Java, and Ruby. Unfortunately, they are not supported in Go.

Just like anchors, zero-width assertions still match a regular language, so from a theoretical point of view, they don't add anything new to the capabilities of regular expressions. They just make it possible to skip certain things from the captured string, so you only check for their presence but don't consume them.

Checking Strings After and Before the Expression

The positive lookahead checks that there is a subexpression after the current position. For example, you need to find all div selectors with the footer ID and remove the div part:

Search for Replace to Explanation
div(?=#footer)
“div” followed by “#footer”

(?=#footer) checks that there is the #footer string here, but does not consume it. In div#footer, only div will match. A lookahead is zero-width, just like the anchors.

In div#header, nothing will match, because the lookahead assertion fails.

Of course, this can be solved without any lookahead:

Search for Replace to Explanation
div#footer #footer A simpler equivalent

Generally, any lookahead after the expression can be rewritten by copying the lookahead text into a replacement or by using backreferences.

In a similar way, a positive lookbehind checks that there is a subexpression before the current position:

Search for Replace to Explanation
(?<=<a href=")news/ blog/ Replace “news/” preceded by “<a href="” with “blog/”
<a href="news/ <a href="blog/ The same replacement without lookbehind

The positive lookahead and lookbehind lead to a shorter regex, but you can do without them in this case. However, these were just basic examples. In some of the following regular expressions, the lookaround will be indispensable.

Testing the Same Characters for Multiple Conditions

Sometimes you need to test a string for several conditions.

For example, you want to find a consonant without listing all of them. It may seem simple at first: [^aeiouy] However, this regular expression also finds spaces and punctuation marks, because it matches anything except a vowel. And you want to match any letter except a vowel. So you also need to check that the character is a letter.

(?=[a-z])[^aeiouy] A consonant
[bcdfghjklmnpqrstvwxz] Without lookahead

There are two conditions applied to the same character here:

Two conditions applied to the same character

After (?=[a-z]) is checked, the current position is moved back because a lookahead has a width of zero: it does not consume characters, but only checks them. Then, [^aeiouy] matches (and consumes) one character that is not a vowel. For example, it could be H in HTML.

The order is important: the regex [^aeiouy](?=[a-z]) will match a character that is not a vowel, followed by any letter. Clearly, it's not what is needed.

This technique is not limited to testing one character for two conditions; there can be any number of conditions of different lengths:

border:(?=[^;}]*\<solid\>)(?=[^;}]*\<red\>)(?=[^;}]*\<1px\>)[^;}]* Find a CSS declaration that contains the words solid, red, and 1px in any order.

This regex has three lookahead conditions. In each of them, [^;}]* skips any number of any characters except ; and } before the word. After the first lookahead, the current position is moved back and the second word is checked, etc.

The anchors \< and \> check that the whole word matches. Without them, 1px would match in 21px.

The last [^;}]* consumes the CSS declaration (the previous lookaheads only checked the presence of words, but didn't consume anything).

This regular expression matches {border: 1px solid red}, {border: red 1px solid;}, and {border:solid green 1px red} (different order of words; green is inserted), but doesn't match {border:red solid} (1px is missing).

Simulating Overlapped Matches

If you need to remove repeating words (e.g., replace the the with just the), you can do it in two ways, with and without lookahead:

Search for Replace to Explanation
\<(\w+)\s+(?=\1\>)
Replace the first of repeating words with an empty string
\<(\w+)\s+\1\> \1 Replace two repeating words with the first word

The regex with lookahead works like this: the first parentheses capture the first word; the lookahead checks that the next word is the same as the first one.

Regex with lookahead

The two regular expressions look similar, but there is an important difference. When replacing 3 or more repeating words, only the regex with lookahead works correctly. The regex without lookahead replaces every two words. After replacing the first two words, it moves to the next two words because the matches cannot overlap:

Regex without lookahead

However, you can simulate overlapped matches with lookaround. The lookahead will check that the second word is the same as the first one. Then, the second word will be matched against the third one, etc. Every word that has the same word after it will be replaced with an empty string:

Simulate overlapped matches with lookaround

The correct regex without lookahead is \<(\w+)(\s+\1)+\> It matches any number of repeating words (not just two of them).

Checking Negative Conditions

The negative lookahead checks that the next characters do NOT match the expression in parentheses. Just like a positive lookahead, it does not consume the characters. For example, (?!toves) checks that the next characters are not “toves” without including them in the match.

<\?(?!php) “<?” without “php” after it

This pattern will match <? in <?echo 'text'?> or in <?xml.

Another example is an anagram search. To find anagrams for “mate”, check that the first character is one of M, A, T, or E. Then, check that the second character is one of these letters and is not equal to the first character. After that, check the third character, which has to be different from the first and the second one, etc.

\<([mate])(?!\1)([mate])(?!\1)(?!\2)([mate])(?!\1)(?!\2)(?!\3)([mate])\> Anagram for “mate”

The sequence (?!\1)(?!\2) checks that the next character is not equal to the first subexpression and is not equal to the second subexpression.

The anagrams for “mate” are: meat, team, and tame. Certainly, there are special tools for anagram search, which are faster and easier to use.

A lookbehind can be negative, too, so it's possible to check that the previous characters do NOT match some expression:

\w+(?<!ing)\b A word that does not end with “ing” (the negative lookbehind)

In most regex engines, a lookbehind must have a fixed length: you can use character lists and classes ([a-z] or \w), but not repetitions such as * or +. Aba is free from this limitation. You can go back by any number of characters; for example, you can find files not containing a word and insert some text at the end of such files.

Search for Replace to Explanation
(?<!Table of contents.*)$$ <a href="/toc">Contents</a> Insert the link to the end of each file not containing the words “Table of contents”
^^(?!.*Table of contents) <a href="/toc">Contents</a> Insert it to the beginning of each file not containing the words

However, you should be careful with this feature because an unlimited-length lookbehind can be slow.

Controlling Backtracking

A lookahead and a lookbehind do not backtrack; that is, when they have found a match and another part of the regular expression fails, they don't try to find another match. It's usually not important, because lookaround expressions are zero-width. They consume nothing and don't move the current position, so you cannot see which part of the string they match.

However, you can extract the matching text if you use a subexpression inside the lookaround. For example:

Search for Replace to Explanation
(?=\<(\w+)) \1 Repeat each word

Since lookarounds don't backtrack, this regular expression never matches:

(?=(\N*))\1\N A regex that doesn't backtrack and always fails
\N*\N A regex that backtracks and succeeds on non-empty lines

The subexpression (\N*) matches the whole line. \1 consumes the previously matched subexpression and \N tries to match the next character. It always fails because the next character is a newline.

A similar regex without lookahead succeeds because when the engine finds that the next character is a newline, \N* backtracks. At first, it has consumed the whole line (“greedy” match), but now it tries to match less characters. And it succeeds when \N* matches all but the last character of the line and \N matches the last character.

It's possible to prevent excessive backtracking with a lookaround, but it's easier to use atomic groups for that.

In a negative lookaround, subexpressions are meaningless because if a regex succeeds, negative lookarounds in it must fail. So, the subexpressions are always equal to an empty string. It's recommended to use a non-capturing group instead of the usual parentheses in a negative lookaround.

(?!(a))\1 A regex that always fails: (not A) and A
CSS PHP Ruby (programming language) ABC (stream cipher) Assertion (software development) Strings

Published at DZone with permission of Peter Kankowski. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • An Overview of the Top 10 Programming Languages Used in the World
  • Using Regular Expressions in Python: A Brief Guide
  • Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
  • Subtitles: The Good, the Bad, and the Resource-Heavy

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!