Thursday, December 10, 2009

Informatica Data Quality Workbench Matching Algorithms

I'd like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench.  In this post I'll start by giving a quick overview of the algorithms available and some typical uses for each.  In subsequent postings I'll get more detailed and outline the math behind the algorithm.  Finally I'd like to finish up with some baseline comparisons using a single set of data.

IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis.  Each of these serve a different purpose or are tailored toward a specific type of matching.   These algorithms include the following:

  1. Hamming Distance

  2. Jaro-Winkler

  3. Edit Distance

  4. Bigram or Bigram frequency


 Let's look at the differences and main purpose for each of these algorithms.

The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important.  Examples of such strings are telephone numbers, dates and postal codes.  The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Jaro-Winkler algorithm is well suited for matching strings where the prefix of the string is of particular importance.  Examples include strings like company names (xyz associates vs. abc associates).   The Jaro-Winkler algorithm is a measure of how similar two strings are by calculating the number of matching characters and number of transpositions required.

The Edit Distance algorithm is an implementation of the Levenshtein distance algorithm where matches are calculated based on  the minimum number of operations needed to transform one string into the other.  These operations can include an insertion, deletion, or substitution of a single character.  This algorithm is well suited for matching fields containing a short text string such as a name or short address field.

The Bigram algorithm is one of my favorites due to its thorough decomposition of a string.  The bigram algorithm matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score.  This algorithm is useful in the comparison of long text strings, such as free format address lines.

Informatica provides several options for matching data out-of-box with Data Quality (IDQ) Workbench.  Although some will argue the ability of another algorithm to detect with greater strength, Informatica has provided some very robust methods to match various types of strings.  With this flexibility the data quality professional is enabled to handle various types of data elements in their match routines.  As with any tool, it is not a replacement for the research required to use the right method in the right way.  This is one of the aspects I'll cover in the subsequent postings where we take each algorithm and get more detailed. 

Drop by next month for more about the Hamming distance algorithm and some real word examples of how it can be implemented!

17 comments:

  1. This be fun to follow, thanks for an interesting post!

    ReplyDelete
  2. Per Olsson: Glad you are interested in following the series! I love the enthusiasm you express by using the word "fun"! I'll try to meet that expectation!

    ReplyDelete
  3. That sounds very interesting! I'll sure be following it too.

    Thanks!

    ReplyDelete
  4. Dalton,
    Glad you are interested. I'll try and keep the postings spicy! Thanks for the comment. It is energizing!

    ReplyDelete
  5. Interesting! We've implemented most of these plus more but not
    'Bigrams' yet. Will be curious to see further expos on this.
    How is validation done and waht are the results from that?
    We use decision tree (DT) validation

    ReplyDelete
  6. Peter -- Glad you enjoyed the post! I'll be sure to send you an alert when it is time to expand on Bigram algorithm matching and it's benefits. As for validation, are you interested in match validation or address validation? Thanks for the comment! It is always beneficial to hear from readers!

    ReplyDelete
  7. wesharp,
    I should have been more specific.....match validation/analysis on both, records that matched and records that didn't match.

    ReplyDelete
  8. To the best of my knowledge there is no automated way to do this. I typically facilitate this exercise with predefined use cases of known duplicates. I load my match results into a table and use SQL to analyze their validity.

    ReplyDelete
  9. Nice post. I don't have any experience with IDQ, but I've done quite a bit of matching work using the UTL_MATCH and Jaro-Winkler in Oracle. Very interested in learning more about the DQ tools and how they simplify this work.

    ReplyDelete
  10. Sue - Thanks for your comment. I am not familiar with the process in Oracle but I'd be happy to discuss the process using Informatica with you in depth. As we continue in this series, please feel free to ask specific questions.

    ReplyDelete
  11. [...] Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data. [...]

    ReplyDelete
  12. [...] Informatica Data Quality Workbench Matching Algorithms December 2009 12 comments 3 [...]

    ReplyDelete
  13. Hi wesharp,

    Thanks for your replies.
    I am working on a IDQ plan to eliminate duplicate records coming coming from source.
    I am usingg the following components in my plan and not able to export the plan into informatica power center designer as a maplet.
    components : Group source and Group target.

    Can you please advice me on this.

    Thanks very much

    ReplyDelete
  14. Thanks for the cooment, vijji. I hate to answer a question with a question, but I am afraid I need some additional information before I can give you a firm answer.
    What version of IDQ and PowerCenter are you using?
    Have you tried to validate your IDQ plan?
    Are you using an IDQ mapping or a mapplet?

    Looking forward to your answers!
    Regards,
    William

    ReplyDelete
  15. Mastech InfoTrellis - Data and Analytics Consulting Company extending premier services in Master Data Management, Big Data and Data Integration.

    Visit for More : http://www.infotrellis.com/integrate-informatica-data-quality-idq-informatica-mdm/

    ReplyDelete

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...