Saturday, April 30, 2011

Soundex for String Matching

Soundex is a useful function for performing data matching

While you can use a Soundex function in the process of identifying potential duplicate strings, I don't recommend it.  Here's why ...

  • The algorithm encodes consonants

  • Vowels will not be encoded unless it is the first letter

  • Consonants to the right of a vowel are not coded

  • Similar sounding consonants share the same digit

  • C,G,J,K,Q,S,X,Z are all encoded with the same digit


To illustrate the impact of this type of encoding let's look at an example of soundex codes for deviations of my first name, William.



 As you can see from the brief example above, Soundex codes fall short of matching like strings.  One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William.  Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.

 I plan to dig deeper into Soundex functions and their applicability in a future blog post.  In the meantime, I wanted to get this observation of mine out there for public consumption.

6 comments:

  1. Yes, I agree. On top of it, SoundEx is an algorithm that has been improved with metaphones and double-metaphones.

    ReplyDelete
  2. Soundex has been pretty much deprecated in favor of newer algorithms such as double-metaphone. At Aware Research we often use multiple algorithms for phonetic similarity and then run a second pass to select and rank the most probable match(es).

    ReplyDelete
  3. I think it is important that people realize the real limitations of Soundex. It is imbedded in a lot of technologies, specifically Microsoft products like Dynamics CRM and SQL Server.

    ReplyDelete
  4. Two pass validations are often the most thorough process. Thanks for stopping by and commenting, Justin.

    ReplyDelete
  5. New @dqchronicle blog explores issues with - Soundex String Matching http://t.co/BZ0CXz9X

    ReplyDelete
  6. Greetings from Los angeles! I’m bored to death at work so I decided to browse your blog on my iphone during lunch break. I love the information you provide here and can’t wait to take a look when I get home. I’m shocked at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .. Anyways, amazing blog!
    new startup bussiness

    ReplyDelete

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...