Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Searching for Names in all the Wrong Places

DZone's Guide to

Searching for Names in all the Wrong Places

Zone Leader Tim Spann shows us how to use the Soundex library with Elasticsearch to find people's names based on phonetic similarities and variations (Jim vs. Jimmy, etc.).

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Algorithms for Searching for People

As you can imagine, searching for people's names is not trivial. Besides the usual text issues of mixed case, you have name variations, nicknames, and the like. Phonetic searching is very interesting.

You want to find Catherine when you search for Katherine. And searching for James should find you Jim and Jimmy.

There's a number of algorithms and libraries to help you do this more advanced name-matching.

Soundex is the standard and is very commonly used. It's in most of the major databases and is reasonably good at finding matches.  Soundex was created by for the US census, so they had a pretty good test data set.

For those lucky enough to have ElasticSearch, there are a number of heavy duty options. The examples I have been talking about have been related to US/English names—obviously there's other languages and countries that have their own algorithms that make more sense for them.

So how do I put this cool searching into practice? Here is a list of some common Java solutions:

import org.apache.commons.codec.language.Soundex;
import org.apache.commons.codec.language.Nysiis;
import org.apache.commons.codec.language.DoubleMetaphone;

// ...

Soundex soundex = new Soundex();
String soundexEncodedValue = soundex.encode("Timothy");
String soundexEncodedCompareValue = soundex.encode("Tim");
String s3 = soundex.encode("Timmy");

// Timothy = T530 Tim = T500, Timmy = T500

Nysiis n = new Nysiis();

// Timothy = TANATY, Tim =TAN, Timmy = TANY

DoubleMetaphone m = new DoubleMetaphone();
// Timothy = TM0, Tim = TM, Timmy = TM


Levenshtein Distance is another option or at least an enhancer.

A slightly better alternative is NYSIISNYIIS is implemented in Java by the Apache Commons Codec library.

NYIIS is also pretty simple to implement on your own. 

Also, Metaphone is very good and also in the Swiss army knife of text searching, Apache Commons Codec.

In my example code, for my name, it seems Double Metaphone is the best. For really advanced queries you may need to use multiple algorithms. Since Apache Commons Codec has them all and they all use the same encoding method, you should have no issues integrating this into your Java 8, Spring, Hadoop, or Spark code. It would be really easy to write a REST service that looks up names and similar names in Spring Boot with Apache Commons Codec running in a CloudFoundry instance.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
java ,algorithm ,data ,search algorithms

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}