Over a million developers have joined DZone.

Table Those Regular Expressions...Periodically

Here's a neat example of regular expressions, writing a regex to match chemical elements.

· Java Zone

Navigate the Maze of the End-User Experience and pick up this APM Essential guide, brought to you in partnership with CA Technologies

Here’s a frivolous exercise in regular expressions: Write a regex to match any chemical element symbol.

Here’s one solution.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(u[opst])?|V|W|Xe|Yb?|Z[nr]

Making it More Readable

Here’s the same expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[dgnot]        | 
N[abdeiop]?     | 
Os?             | 
P[abdmortu]     | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilm]     | 
U(u[opst])?     | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

The /x option in Perl says to ignore white space. Other regular expression implementations have something similar. Python has two such options, X for similarity with Perl, and VERBOSE for readability. Both have the same behavior.

Regex Syntax

The regular expression says that a chemical element symbol may start with A, followed by c, g, l, m, r, s, t, or u; or a B, optionally followed by a, e, h, i, k, or r; or …

The most complicated part of the regex is the part for symbols starting with U. There’s Uranium whose symbols is simply U, and there are the elements who have temporary names based on their atomic numbers: Ununtrium, Ununpentium, Ununseptium, and Ununoctium. These are just Latin for one-one-three, one-one-five, one-one-seven, and one-one-eight. The symbols are U, Uut, Uup, Uus, and Uuo. The regex U(u[opst])? can be read “U, optionally followed by u and one of o, p, s, or t.”

Note that the regex will match any string that contains a chemical element symbol, but it could match more. For example, it would match “I’ve never been to Boston in the fall” because that string contains B, the symbol for boron. Exercise: Modify the regex to only match chemical element symbols.

Regex Golf

There may be clever ways to use fewer characters at the cost of being more obfuscated. But this is for fun anyway, so we’ll indulge in a little regex golf.

There are five elements whose symbols start with I or Z: I, In, Ir, Zn, and Zr. You could write [IZ][nr] to match four of these. The regex I|[IZ][nr] would represent all five with 10 characters while I[nr]?|Z[nr] uses 12. Two characters saved! Can you cut out any more?

Regex Resources

Thrive in the application economy with an APM model that is strategic. Be E.P.I.C. with CA APM.  Brought to you in partnership with CA Technologies.

Topics:
regex ,perl ,python ,humor

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}