While you can use a Soundex function in the process of identifying potential duplicate strings, I don't recommend it. Here's why ...
- The algorithm encodes consonants
- Vowels will not be encoded unless it is the first letter
- Consonants to the right of a vowel are not coded
- Similar sounding consonants share the same digit
- C,G,J,K,Q,S,X,Z are all encoded with the same digit
To illustrate the impact of this type of encoding let's look at an example of soundex codes for deviations of my first name, William.
As you can see from the brief example above, Soundex codes fall short of matching like strings. One of my biggest issues with Soundex can be illustrated in the comparison of the typical nicknames for William. Only Billy and Bill are similarly coded, while Will is not coded similar to Bill or William.
I plan to dig deeper into Soundex functions and their applicability in a future blog post. In the meantime, I wanted to get this observation of mine out there for public consumption.
Yes, I agree. On top of it, SoundEx is an algorithm that has been improved with metaphones and double-metaphones.
ReplyDeleteSoundex has been pretty much deprecated in favor of newer algorithms such as double-metaphone. At Aware Research we often use multiple algorithms for phonetic similarity and then run a second pass to select and rank the most probable match(es).
ReplyDeleteI think it is important that people realize the real limitations of Soundex. It is imbedded in a lot of technologies, specifically Microsoft products like Dynamics CRM and SQL Server.
ReplyDeleteTwo pass validations are often the most thorough process. Thanks for stopping by and commenting, Justin.
ReplyDeleteNew @dqchronicle blog explores issues with - Soundex String Matching http://t.co/BZ0CXz9X
ReplyDeleteGreetings from Los angeles! I’m bored to death at work so I decided to browse your blog on my iphone during lunch break. I love the information you provide here and can’t wait to take a look when I get home. I’m shocked at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .. Anyways, amazing blog!
ReplyDeletenew startup bussiness