Regular Expressions 101
All regular expression features can be expressed with three primitives (alternation, parentheses, and the Kleene star). Everything else is a syntax sugar.
Join the DZone community and get the full member experience.
Join For FreeWith regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <img>
tags, and you want to move all these images to the images
folder:
<img src="9.png"> → <img src="images/9.png">
<img src="10.png"> → <img src="images/10.png">
and so on
You can easily write a regular expression that matches all file names that are numbers and then replace all such tags at once.
Basic Syntax
If you need to match one of the alternatives, use an alternation (vertical bar). For example:
Regex | Meaning |
a|img|h1|h2 |
either a , or img , or h1 , or h2 |
When using alternation, you often need to group characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:
Regex | Meaning |
<h1|h2|b|i> |
<h1 or h2 (without the angle brackets) or b or i> |
Because <
applies to the first alternative only and >
applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:
<(h1|h2|b|i)>
The last primitive (star) allows you to repeat anything zero or more times. You can apply it to one character, for example:
Regex | Meaning |
a* |
an empty string, a , aa , aaa , aaaa , etc. |
You also can apply it to multiple characters in parentheses:
Regex | Meaning |
(ab)* |
an empty string, ab , abab , ababab , abababab , etc. |
Note that if you remove the parentheses, the star will apply to the last character only:
Regex | Meaning |
ab* |
an empty string, ab , abb , abbb , abbbb , etc. |
The star is named Kleene star after an American mathematician Stephen Kleene, who invented regular expressions in 1940-1950s. It can match an empty string as well as any number of repetitions.
These three primitives (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you can now write a regex for matching the file names that are numbers in an <img>
tag:
Regex | Meaning |
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* |
one or more digits |
(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* |
a positive integer number (don't allow zero as the first character) |
The parentheses may be nested without a limit, for example:
Regex | Meaning |
(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*(,(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*)* |
one or more positive integer numbers, separated with commas |
Convenient Shortcuts for Character Classes
You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match any of the listed characters, please put them into square brackets:
Regex | Shorter regex | Meaning |
a|e|i|o|u|y |
[aeiouy] |
a vowel |
0|1|2|3|4|5|6|7|8|9 |
[0123456789] |
a digit |
0|1|2|3|4|5|6|7|8|9 |
[0-9] |
a digit |
a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z |
[a-z] |
a letter |
As you can see, it's possible to specify only the first and the last allowed characters if you put a dash between them. There may be several such ranges inside square brackets:
Regex | Meaning |
[a-z0-9] |
a letter or a digit |
[a-z0-9_] |
a letter, a digit, or the underscore character |
[a-f0-9] |
a hexadecimal digit |
There are some predefined character classes that are even shorter to write:
Regex | Meaning |
\s |
a space character: the space, the tab character, the new line, or the carriage feed |
\d |
a digit |
\w |
a word character (a letter, a digits, or the underscore character) |
. |
any character |
In Aba Search and Replace, these character classes include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so \d
is typically the same as [0-9]
and \w
is the same as [a-zA-Z0-9_]
.
The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:
Regex | Meaning |
[1-9][0-9]*(,[1-9][0-9])* |
one or more positive integer numbers, separated with commas |
Repetitions
A Kleene star means "repeating zero or more times," but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:
Regex | Shorter regex | Meaning |
\d\d* |
\d+ |
one or more digits |
(0|1)(0|1)* |
[01]+ |
any binary number (consisting of zeros and ones) |
(\s|) |
\s? |
either a space character or nothing |
http(s|) |
https? |
either http or https |
(-|\+|) |
[-+]? |
the minus sign, the plus sign, or nothing |
[a-z][a-z] |
[a-z]{2} |
two small letters |
[a-z][a-z]((([a-z]|)[a-z]|)[a-z]|) |
[a-z]{2,5} |
from two to five small letters |
[a-z][a-z][a-z]* |
[a-z]{2,} |
two or more small letters |
So there are the following repetition operators:
- A Kleene star
*
means repeating zero or more times, so it can never match; it can match once, twice, three times, etc.; - A plus sign
+
means repeating one or more times, so it must match at least once; - An optional part
?
means zero times or once; - Curly brackets
{m,n}
means repeating from m to n times.
Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:
Regex | Shorter regex | Meaning |
\d{0,} |
\d* |
nothing or some digits |
\d{1,} |
\d+ |
one or more digits |
\s{0,1} |
\s? |
either a space character or nothing |
Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.
Escaping
If you need to match any of the special characters like parentheses, vertical bar, plus, or star, you must escape them by adding a backslash \
before them. For example, to find a number in parentheses, use \(\d+\)
.
A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write example.com
in a regular expression, it will match examplexcom
or something similar, which may even cause a security issue in your program. Now, we can write a regex to match the <img>
tags:
<img src="\d+\.png">
This matches any filename consisting of digits, and we correctly escaped the dot.
Other Features
Modern regex engines add more features, such as backreferences or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.
Next time, we will discuss anchors and zero-width assertions.
Published at DZone with permission of Peter Kankowski. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments