Searching for Names in all the Wrong Places
Zone Leader Tim Spann shows us how to use the Soundex library with Elasticsearch to find people's names based on phonetic similarities and variations (Jim vs. Jimmy, etc.).
Join the DZone community and get the full member experience.
Join For FreeAlgorithms for Searching for People
As you can imagine, searching for people's names is not trivial. Besides the usual text issues of mixed case, you have name variations, nicknames, and the like. Phonetic searching is very interesting.
You want to find Catherine when you search for Katherine. And searching for James should find you Jim and Jimmy.
There's a number of algorithms and libraries to help you do this more advanced name-matching.
Soundex is the standard and is very commonly used. It's in most of the major databases and is reasonably good at finding matches. Soundex was created by for the US census, so they had a pretty good test data set.
For those lucky enough to have ElasticSearch, there are a number of heavy duty options. The examples I have been talking about have been related to US/English names—obviously there's other languages and countries that have their own algorithms that make more sense for them.
So how do I put this cool searching into practice? Here is a list of some common Java solutions:
import org.apache.commons.codec.language.Soundex;
import org.apache.commons.codec.language.Nysiis;
import org.apache.commons.codec.language.DoubleMetaphone;
// ...
Soundex soundex = new Soundex();
String soundexEncodedValue = soundex.encode("Timothy");
String soundexEncodedCompareValue = soundex.encode("Tim");
String s3 = soundex.encode("Timmy");
// Timothy = T530 Tim = T500, Timmy = T500
Nysiis n = new Nysiis();
// Timothy = TANATY, Tim =TAN, Timmy = TANY
DoubleMetaphone m = new DoubleMetaphone();
// Timothy = TM0, Tim = TM, Timmy = TM
Levenshtein Distance is another option or at least an enhancer.
A slightly better alternative is NYSIIS. NYIIS is implemented in Java by the Apache Commons Codec library.
NYIIS is also pretty simple to implement on your own.
Also, Metaphone is very good and also in the Swiss army knife of text searching, Apache Commons Codec.
In my example code, for my name, it seems Double Metaphone is the best. For really advanced queries you may need to use multiple algorithms. Since Apache Commons Codec has them all and they all use the same encoding method, you should have no issues integrating this into your Java 8, Spring, Hadoop, or Spark code. It would be really easy to write a REST service that looks up names and similar names in Spring Boot with Apache Commons Codec running in a CloudFoundry instance.
Opinions expressed by DZone contributors are their own.
Comments