Proper String Normalization for Comparison Purposes

This article exposes some of the most important points to take into consideration when comparing strings from different systems.

Bertrand Florat

Updated Jan. 22, 21 · Tutorial

Likes (3)

Comment

Save

15.1K Views

TL;DR

In Java, do:

    Java
   
xxxxxxxxxx

String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

Introduction to String Normalization

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

The human brain is a machine to fill gaps. Hence it has absolutely no problem reading or typing 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database, but you have to compare it with text created by an end-user that's the missing accent?

We also have to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would search for the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How can we fix that mess?

Rule #1: Don't Ever Compare Human Text if You Can

When you can, never compare strings to heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings that come from different information systems will probably store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are made up of plain ASCII strings and are thus free from encoding issues.

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare descriptions, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Character normalization is the action of computing a canonical form of a string. To avoid false positives when comparing strings coming from several information systems, normalize both strings and compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, the strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in Memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. There are at least two reasons for this:

If you only store a normalized version of your string, you lose information. You may need proper diacritics later for display purposes or other reasons. As an IT professional, one of your tasks is to never lose information humans provided you.
What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second.

Rule #3: Always Trim Externally and Internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

Trim the text to remove spaces at the beginning and at the end of the text.
Remove superfluous spaces in the middle of the string.

In Java, one of the ways to achieve this is:

    Java
   
          x
         
s = s.replaceAll("\\s{2,}", " ").trim();

Rule #4: Harmonize Letter Casing

This is the most well-known and straightforward normalization method: simply put every letter into lower or upper case. As far as I know, there is no preference for one or the other choice. Most developers (myself included) use lower case.

In Java, just use toLowerCase():

    Java
   
xxxxxxxxxx

s = s.toLowerCase();

Rule #5: Transform Characters With Diacritical Signs to ASCII

When typed out, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD, and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but, basically, some Unicode characters can be encoded either as a single combined character or in a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform an NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with an accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

    Java
   
xxxxxxxxxx

s = Normalizer.normalize(s, Normalizer.Form.NFD)
    .replaceAll("[^\\p{ASCII}]", "");

Note: even if the code covers this issue, I prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose Ligatures to a Set of ASCII Characters

The other thing to understand is that Unicode maintains some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or precomposed Roman numerals) and a list of regular characters. Characters supporting this feature are documented (check the 'decomposition' attribute in the Unicode characters documentation).

For instance; the Roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as an 'X' and two 'I's. Likewise, the ĳ (U+0133) character (like in 'fĳn' - 'nice' in Dutch) can be decomposed into an 'i' and a 'j'.

For these kinds of 'Siamese twin' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decomposes the characters (see 'Rule #5' previously) but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

    Java
   
          x
         
s = Normalizer.normalize(s, Normalizer.Form.NFKD)
    .replaceAll("[^\\p{ASCII}]", "");

Now the bad news: for obscure reasons, Unicode doesn't support the decomposition equivalence of some widely used ligatures like the French 'œ' and 'æ' or the German eszett 'ß'. If you need to handle them, you will have to write your own replacements before applying the NFKD normalization:

    Java
   
xxxxxxxxxx

    s = s.replaceAll("œ", "oe");
    s = s.replaceAll("æ", "ae");
    s = Normalizer.normalize(s, Normalizer.Form.NFKD)
    .replaceAll("[^\\p{ASCII}]", "");

Rule #7: Beware Punctuation

This is more of a minor issue, but, according to the context, you may want to normalize some special punctuation characters as well.

For example, in a literary context, like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

As far as I know, Unicode doesn't provide mapping for that, so you'll have to do it yourself the old fashioned way:

    Java
   
xxxxxxxxxx

s = s.replaceAll("—", "-");

Final Word

String normalization is very helpful when comparing strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of casing or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian character manipulation or cultural normalization of semantically equivalents items (like the 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

Strings Data Types Java (programming language) Comparison (grammar) Space (architecture)

Published at DZone with permission of Bertrand Florat. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending