Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Searching for Names in all the Wrong Places

DZone's Guide to

Searching for Names in all the Wrong Places

Zone Leader Tim Spann shows us how to use the Soundex library with Elasticsearch to find people's names based on phonetic similarities and variations (Jim vs. Jimmy, etc.).

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

Algorithms for Searching for People

As you can imagine, searching for people's names is not trivial. Besides the usual text issues of mixed case, you have name variations, nicknames, and the like. Phonetic searching is very interesting.

You want to find Catherine when you search for Katherine. And searching for James should find you Jim and Jimmy.

There's a number of algorithms and libraries to help you do this more advanced name-matching.

Soundex is the standard and is very commonly used. It's in most of the major databases and is reasonably good at finding matches.  Soundex was created by for the US census, so they had a pretty good test data set.

For those lucky enough to have ElasticSearch, there are a number of heavy duty options. The examples I have been talking about have been related to US/English names—obviously there's other languages and countries that have their own algorithms that make more sense for them.

So how do I put this cool searching into practice? Here is a list of some common Java solutions:

import org.apache.commons.codec.language.Soundex;
import org.apache.commons.codec.language.Nysiis;
import org.apache.commons.codec.language.DoubleMetaphone;

// ...

Soundex soundex = new Soundex();
String soundexEncodedValue = soundex.encode("Timothy");
String soundexEncodedCompareValue = soundex.encode("Tim");
String s3 = soundex.encode("Timmy");

// Timothy = T530 Tim = T500, Timmy = T500

Nysiis n = new Nysiis();

// Timothy = TANATY, Tim =TAN, Timmy = TANY

DoubleMetaphone m = new DoubleMetaphone();
// Timothy = TM0, Tim = TM, Timmy = TM


Levenshtein Distance is another option or at least an enhancer.

A slightly better alternative is NYSIISNYIIS is implemented in Java by the Apache Commons Codec library.

NYIIS is also pretty simple to implement on your own. 

Also, Metaphone is very good and also in the Swiss army knife of text searching, Apache Commons Codec.

In my example code, for my name, it seems Double Metaphone is the best. For really advanced queries you may need to use multiple algorithms. Since Apache Commons Codec has them all and they all use the same encoding method, you should have no issues integrating this into your Java 8, Spring, Hadoop, or Spark code. It would be really easy to write a REST service that looks up names and similar names in Spring Boot with Apache Commons Codec running in a CloudFoundry instance.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:
java ,algorithm ,data ,search algorithms

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}