The Data Quality Chronicle: Informatica Data Quality Workbench Matching Algorithms

Thursday, December 10, 2009

Informatica Data Quality Workbench Matching Algorithms

I'd like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post I'll start by giving a quick overview of the algorithms available and some typical uses for each. In subsequent postings I'll get more detailed and outline the math behind the algorithm. Finally I'd like to finish up with some baseline comparisons using a single set of data.

IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis. Each of these serve a different purpose or are tailored toward a specific type of matching. These algorithms include the following:

Hamming Distance

Jaro-Winkler

Edit Distance

Bigram or Bigram frequency

Let's look at the differences and main purpose for each of these algorithms.

The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Jaro-Winkler algorithm is well suited for matching strings where the prefix of the string is of particular importance. Examples include strings like company names (xyz associates vs. abc associates). The Jaro-Winkler algorithm is a measure of how similar two strings are by calculating the number of matching characters and number of transpositions required.

The Edit Distance algorithm is an implementation of the Levenshtein distance algorithm where matches are calculated based on the minimum number of operations needed to transform one string into the other. These operations can include an insertion, deletion, or substitution of a single character. This algorithm is well suited for matching fields containing a short text string such as a name or short address field.

The Bigram algorithm is one of my favorites due to its thorough decomposition of a string. The bigram algorithm matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score. This algorithm is useful in the comparison of long text strings, such as free format address lines.

Informatica provides several options for matching data out-of-box with Data Quality (IDQ) Workbench. Although some will argue the ability of another algorithm to detect with greater strength, Informatica has provided some very robust methods to match various types of strings. With this flexibility the data quality professional is enabled to handle various types of data elements in their match routines. As with any tool, it is not a replacement for the research required to use the right method in the right way. This is one of the aspects I'll cover in the subsequent postings where we take each algorithm and get more detailed.

Drop by next month for more about the Hamming distance algorithm and some real word examples of how it can be implemented!

17 comments:

Per OlssonDecember 10, 2009 at 7:02 AM
This be fun to follow, thanks for an interesting post!
ReplyDelete
Replies
wesharpDecember 10, 2009 at 7:45 AM
Per Olsson: Glad you are interested in following the series! I love the enthusiasm you express by using the word "fun"! I'll try to meet that expectation!
ReplyDelete
Replies
Dalton CervoDecember 10, 2009 at 12:22 PM
That sounds very interesting! I'll sure be following it too.

Thanks!
ReplyDelete
Replies
Informatica Data Quality Workbench Matching Algorithms « Data … | Suporte de InformáticaDecember 10, 2009 at 2:57 PM
[...] post: Informatica Data Quality Workbench Matching Algorithms « Data … [...]
ReplyDelete
Replies
wesharpDecember 10, 2009 at 4:17 PM
Dalton,
Glad you are interested. I'll try and keep the postings spicy! Thanks for the comment. It is energizing!
ReplyDelete
Replies
Peter JaumannDecember 16, 2009 at 4:49 PM
Interesting! We've implemented most of these plus more but not
'Bigrams' yet. Will be curious to see further expos on this.
How is validation done and waht are the results from that?
We use decision tree (DT) validation
ReplyDelete
Replies
wesharpDecember 17, 2009 at 3:37 AM
Peter -- Glad you enjoyed the post! I'll be sure to send you an alert when it is time to expand on Bigram algorithm matching and it's benefits. As for validation, are you interested in match validation or address validation? Thanks for the comment! It is always beneficial to hear from readers!
ReplyDelete
Replies
Peter JaumannDecember 17, 2009 at 4:39 PM
wesharp,
I should have been more specific.....match validation/analysis on both, records that matched and records that didn't match.
ReplyDelete
Replies
wesharpDecember 17, 2009 at 6:41 PM
To the best of my knowledge there is no automated way to do this. I typically facilitate this exercise with predefined use cases of known duplicates. I load my match results into a table and use SQL to analyze their validity.
ReplyDelete
Replies
Sue CorwinDecember 28, 2009 at 2:46 PM
Nice post. I don't have any experience with IDQ, but I've done quite a bit of matching work using the UTL_MATCH and Jaro-Winkler in Oracle. Very interested in learning more about the DQ tools and how they simplify this work.
ReplyDelete
Replies
wesharpDecember 28, 2009 at 4:10 PM
Sue - Thanks for your comment. I am not familiar with the process in Oracle but I'd be happy to discuss the process using Informatica with you in depth. As we continue in this series, please feel free to ask specific questions.
ReplyDelete
Replies
Information and Data Quality Blog Carnival, February 2010 « Liliendahl on Data QualityFebruary 2, 2010 at 2:10 PM
[...] Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data. [...]
ReplyDelete
Replies
2010 in review « The Data Quality ChronicleJanuary 2, 2011 at 1:43 PM
[...] Informatica Data Quality Workbench Matching Algorithms December 2009 12 comments 3 [...]
ReplyDelete
Replies
vijjiFebruary 18, 2011 at 5:42 AM
Hi wesharp,

Thanks for your replies.
I am working on a IDQ plan to eliminate duplicate records coming coming from source.
I am usingg the following components in my plan and not able to export the plan into informatica power center designer as a maplet.
components : Group source and Group target.

Can you please advice me on this.

Thanks very much
ReplyDelete
Replies
William SharpFebruary 18, 2011 at 6:16 AM
Thanks for the cooment, vijji. I hate to answer a question with a question, but I am afraid I need some additional information before I can give you a firm answer.
What version of IDQ and PowerCenter are you using?
Have you tried to validate your IDQ plan?
Are you using an IDQ mapping or a mapplet?

Looking forward to your answers!
Regards,
William
ReplyDelete
Replies
infotrellisusJuly 31, 2019 at 9:51 AM
Mastech InfoTrellis - Data and Analytics Consulting Company extending premier services in Master Data Management, Big Data and Data Integration.

Visit for More : http://www.infotrellis.com/integrate-informatica-data-quality-idq-informatica-mdm/
ReplyDelete
Replies
MOUNIKASeptember 18, 2020 at 6:34 AM
Nice post.
https://a2zinformatica.blogspot.com/
Informatica Data Quality training
Informatica idq online training
Informatica idq training
Informatica mdm online training
Informatica mdm training
Informatica message Queue online training
Informatica message Queue training
Informatica power center online training
Informatica power center training
Manual Testing online training
Manual Testing training
Microservices online training
Microservices training
Office 365 online training
Office 365 training
Open stack online training
Open stack training
ReplyDelete
Replies

Add comment

Thursday, December 10, 2009

Informatica Data Quality Workbench Matching Algorithms

17 comments:

What data quality is (and what it is not)