The Data Quality Chronicle: Hamming Distance Matching Algorithm

Introduction

Last month we examined, at a high level, each of the matching algorithms available in Informatica's data quality tool, Data Quality Workbench or IDQ. In this month's edition we'll dive deeper into one of those options, the Hamming distance algorithm.

As we touched on last month, the Hamming algorithm score measures the minimum number of substitutions required to change one string into the other. Often this algorithm is used when analyzing strings which represent numeric strings.

A Little Background

The Hamming distance is named after Richard Hamming. Hamming was an American mathematician whose accomplishments include many advancements in Information Science. Not the least of which was his work on The Manhattan Project in 1945. One notable contribution of his work is his philosophy on scientific computing which appeared as the preface to his book in 1962 on numerical methods (Numerical Methods for Scientists and Engineers) which read:

The purpose of computing is insight, not numbers

A Little about the algorithm

Perhaps as a result of Hamming's time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers. However the advantages of the algorithm are applicable to various types of strings and is not limited to numeric strings.

There is one condition that needs to be adhered to when using this algorithm that is worth noting. The strings analyzed need to be of the same length. Since the Hamming distance algorithm is based on the "cost" of transposing one string into another, strings of unequal length will result in high penalties for transposition. If you are using the Hamming algorithm to analyze telephone numbers it is critically important to cleanse the data before analyzing it. For instance, if not all telephone numbers include an area code than the area codes that are in the data need to be parsed out before analysis.

Hamming Distance based checks

Hamming distance based checks determine the number of errors between two strings. If we want to detect the number of errors (x) in a string we can map every string (y) into a bigger y+x+1 string so that the minimum Hamming distance between each valid mapping is x+1.

Hamming Distance in Informatica Data Quality Workbench

In Data Quality workbench implementing the Hamming Distance algorithm is very simple and is done via adding the Hamming component from the Component Palette. Refer to the figure below for a look at the Hamming icon.

[caption id="attachment_388" align="alignnone" width="76" caption="Informatica Data Quality Workbench Hamming Distance icon"]

[/caption]

Once the Hamming component has been added the your IDQ plan it is time to configure it. Configuration options include the specification of Inputs, Parameters, & Outputs. Refer to the figure below for a look at the configuration options.

[caption id="attachment_389" align="alignleft" width="792" caption="Configuration Options of the Hamming Distance Component"]

[/caption]

As the name indicates the Inputs tab is where to specify what data elements will be matched in the component. It is important to note that when configuring the inputs there is a requirement to select two data elements for each match desired. Without going into extensive details when the data is grouped Informatica IDQ formats the data so that each element will have two instances. One instance is labeled "x_1" and the other is labeled "x_2" where x is the data element name. Both the x_1 and the x_2 data element need to be selected.

The parameters tab is where you configure some important options. The Reverse Hamming option is a check box that can be selected that configures the component to read the string from right to left instead of the default left to right. The remaining two options dictate what the match score will be when null values exist in the pair of strings. The Single Null Match Value setting indicates the match score when one field in the pair of matched values is null. The Both Null Match Value setting indicates the match score when both fields are null.

The Outputs tab is where you can define the name of the output value. Pretty straight forward there.

Now for the good part, the results! Refer to the to the figure below for a look at what results look like when using the Hamming distance component.

[caption id="attachment_396" align="alignleft" width="780" caption="Results of the Hamming distance match analysis in Informatica Data Quality workbench"]

[/caption]

I've masked the results to ensure data privacy however it is still useful in describing the results. As you can see in the sample above, row two resulted in a Hamming distance score of 1 because both values are identical. However the data in row one is not identical and consequently resulted in a Hamming distance score of approximately 0.93. This is due to the transposition "cost" of turning "ZBL ZSSXCNZTES" into "ZBT ZSSXCNZTES". The penalty of 0.07 was due entirely to the "ZBL" into "ZBT".

Summary

Among Richard Hamming's many accomplishments is the development of an algorithm to compare various types of strings of the same length to determine how different they are. Due to the requirement of equal length, the algorithm is primarily used to detect differences in numeric strings but can be used with textual data as well.

Informatica has incorporated the Hamming algorithm into the data quality workbench tool in order to produce a match score. The Hamming component requires the selection of at least two inputs, it can be configured to handle data with nulls and will output a match score. In IDQ a Hamming match score of one (1) indicates a perfect match while a Hamming match score of zero (0) indicates that there was no correlation between the two values being analyzed.

I've used the Hamming component in IDQ to analyze match possibilities in telephone numbers and postal codes. I've found it to be reliable in detecting true positive matches and sensitive enough to detect even slight differences (as indicated in the sample data above). I hope this review will help those of you interested in using the Hamming component in IDQ or those just interested in developing knowledge of the algorithm.

Thank you for reading this month's edition of The Data Quality Chronicle. Stop by again next month when I detail the Jaro-Winkler algorithm and it's implementation in Informatica IDQ.

The Data Quality Chronicle

Monday, January 4, 2010

Hamming Distance Matching Algorithm