The Data Quality Chronicle: Flexibility: The advantage to Talend's Matching Techniques

One of the most interesting things about Talend's matching technology offering is that it provides both deterministic and probabilistic options. In my opinion, this is a unique approach that allows for flexibility in creating a match solution. I see advantages to using these techniques in combination which could increase the number of true positives and, perhaps more importantly, decrease the number of false positives.

Deterministic matching is a rules based approach that uses "fuzzy" matching algorithms which are based on the number of changes required to make two or more strings equivalent. This technique, one of my favorites, is based on the theory that data entry can be flawed and accounts for these flaws through the use of the transposition technique.

For example, deterministic matching will identify Smith and Smyth as a possible match because only one transposition is required to make Smyth equal to Smith (the exchange of an i for the y).

Talend offers the following deterministic algorithms:

Soundex

Metaphone

Levenshtein

Jaro-Winkler

I have covered Soundex in a previous post and while it has its applications, I personally do not recommend it for matching. Soundex assigns a code to strings and matching is done of the codes assigned.

Metaphone does compensate for the short falls of Soundex, however, the matches are more likely to include false positive results than other matching algorithms like Levenshtein and Jaro-Winkler. Unlike Soundex, Metaphone forms two codes for every string. One primary and one secondary code and both are used in the matching analysis.

Levenshtein is a transposition based algorithm and measures the number of deletions, insertions or substitutions required to make two or more strings equivalent. I like the deletion aspect of this algorithm which is something that distinguishes it from other transpositions based algorithms.

Jaro-Winkler is another transpositions based algorithm and its distinguishing factor is that it weights the characters at the beginning of the string higher than those at the end. Most solutions allow you to define what is considered the beginning of the string by setting the number of characters to select as the beginning. This is particularly useful when trying to compensate for data entry issues. Often data entry is accurate for the first few characters and tends to wane toward the end of the string.

Probabilistic matching implores statistical and algorithmic analysis to determine, based on the frequency of values in the data set, what values are more unique and weights these unique values higher in the matching process.

For example, a common name like Smith would be weighted lower than a more unique name like Winkler. This weighting is then applied to the match analysis and directly affects the identification of these two values as a match. The complexity of the algorithms is beyond the focus of this post, but are something I will cover in future posts.

One key factor to note when using probabilistic matching is that, due to the statistical analysis required, there is a significant increase in the time and resources required to perform such a matching technique.

In order to overcome this resource requirement, it is beneficial and highly recommended that matching solutions include a logical grouping of records which will be compared to each other. This approach is highly dependent on the intent and dat included in the matching. For example, customer matching are often grouped by name and residence while products are often grouped by code and product classes. The key is to devise a grouping strategy that will group transactions based on their inherent transactional similarities and is often the part of matching than is more an art than a science.

The Data Quality Chronicle

Sunday, May 1, 2011

Flexibility: The advantage to Talend's Matching Techniques

1 comment:

What data quality is (and what it is not)