Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Implementing Soundex Using LINQ (With Help From OzCode)

DZone's Guide to

Implementing Soundex Using LINQ (With Help From OzCode)

Soundex is like the party game charades. A lot of words sound similar but have very different meanings. Here's a remarkably simple algorithm that's been used for years to disambiguate misspellings and mispronunciations.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

A while ago I came across the very interesting Soundex algorithm. It’s a way to find similarity between words based on how they sound – I’ll let Wikipedia explain:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, Ingres, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.

So basically Soundex can help you fix spelling mistakes by finding the word you meant to use based on how the words sound — so if you accidentally searched the Internet for “Drawer Hellber” you’ll still be able find my blog:

image

Actually you won’t, but you get the point

It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):

  1. Retain the first letter of the name and drop all other occurrences of a, e, I, o, u, y, h, w.
  2. Replace consonants with digits as follows (after the first letter):
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
  3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
  4. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

In a clear case of “When you have an hammer everything looks like a nail,” I thought to myself, "why not implement this algorithm in LINQ?" And so I came up with the following code:

public string Encode(string word)
{
    if (IsNullOrEmpty(word))
    {
        return Empty;
    }

    return word
        .Select((ch, index) => EncodeCharacter(word, ch, index))
        .Where((encodedChar, index) => 
                    encodedChar.IsValidEncoding && encodedChar.Curr != encodedChar.Prev)
        .Select(arg => arg.Curr)
        .Concat(Enumerable.Repeat("0", MaxEncodedLength))
        .Take(MaxEncodedLength)
        .Aggregate((I, j) => I + j);
}

If you’re interested the entire Encoder.cs code file, it can be found here.

Although the algorithm seems simple enough, I had four bugs in my initial implementation but they were quickly squashed using OzCode with it’s new LINQ debugging feature (in the Early Access Preview). Now that I’ve got it working, I’m going to use OzCode to show you how the Soundex algorithm processes “Drawer” and “Dror” (one of which is not my name) and check that they both provide the same results.

Let’s start with “Drawer” — it should be encoded as D660.

Stopping the debugger at the beginning of the LINQ query shows us that indeed the result of the LINQ query is “D660”, but it also shows us numbers at the end of each operator – those numbers indicate how many items were returned from each LINQ operator.

image

Looking at these numbers, I can tell that from the 6 characters in the beginning, only three were left after Where. The rest were either Invalid or the same as the letter before (rules 1-3). Then we concatenated 4 ‘0’ and then taken the first four characters (Take) and from there it was a simple case of aggreagating all of the letters into a single string.

Let see it step by step using OzCode’s Detailed LINQ tool window:

Step 1: First letter saved and encode the rest of the letters according to rule #2 (‘*’ means invalid encoding):

image

As you can see, ‘D’ was kept and the two ‘r’s were encoded as ‘6’s while ‘a’, ‘w’ and ‘e’ were replaced with stars (invalid character).

Next we get rid of invalid and duplicate letters:

image

As you can tell by the red X’s, all of the invalid characters were thrown away.

image

The Select is a simple conversion from my result type into simple strings.

With your permission I’ll jump directly to Aggregate where I stitch the strings together:

image

Simple.

In the case of “Dror” the result is similar

DrorSoundex

How cool is that?

If you want to find out more about OzCode and LINQ debugging – try the EAP, or better yet come to visit us at the OzCode booth at NDC Oslo – we’ve just landed and we plan on having a great week. 

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
search algorithms ,soundex ,nlp

Published at DZone with permission of Dror Helper, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}