Thursday, June 17, 2010

Overly conservative? Does Enterprise Data Quality need to push the boundary more?

Recently I had coffee with Dr. John Talburt of the University of Arkansas at Little Rock's Information Quality program. During the conversation we exchanged experiences that we've both encountered while implementing data quality solutions, especially with regard to matching.  One of the prevailing topics was the promotion of a master record from the duplicates.  That's when Dr. Talburt shared with me a perspective that I had not considered.

He asserted that most often the focus of duplication validation is placed on the confirmation of true positive results.  In other words, most often data quality analysts focus on confirming that the duplicates they identify are really duplicates.  Sounds straight forward enough and I confess I am guilty of this approach.  However, Dr. Talburt proposed that, at least, one more validation process be added to the results analysis.  This is the validation of true negative results, or validating that those records not identified as duplicates are truly not duplicates.

Granted this approach is more difficult and certainly more time-consuming, particularly for large datasets.  However, it can be viewed as a more valuable exercise to an organization.  Afterall leaving duplicates unidentified in an enterprise dataset compounds the cost of the data by affectively under-utilizing the investment in the de-duplication project.

With this in mind, I have begun compiling some methodologies to assure that my efforts to eliminate duplicates has not left the proverbial stone unturned.  Here are a list of what I have been able to identify as effective and worth the extra time and resources.

Post Matching Group Analysis


Just as I stated earlier, after I run my primary matching runs I analyze the results to be sure I have positively identified duplicates.  As if with blinders on, I have not traditionally analyzed the results to determine if there were possible duplicates that were left behind.  Afterall, I went through extensive efforts to cleanse and standardize the data in order to increase matching accuracy.  Why then would I analyze those records that were not identified as potential duplicates?  The answer lies in, as I stated earlier, capitalizing on the investment of the data quality initiative.  Well it is time to take the blinders off and look at those transactions not identified as a duplicate.

On CDI projects, this process would involve reviewing non-duplicate transactions that share, at a minimum, a similar last name or similar address values.  On PIM projects, the process would include non-duplicate transactions that share similar product descriptions.

Multiple Matching Runs


Multiple matching processes with varied match thresholds is another way that it is possible to assure all duplicate transactions are identified.  Match thresholds are typically used to determine an acceptable level of matching to be considered positive and are usually in the form of a percentage.  For instance, I typically set an initial threshold of 0.9, or 90%.  In light of my conversations with Dr. Talburt I will start running at least one more match run with a lower threshold of 0.75-0.80 to possibly identify some duplicates not identified in my initial run.

Summary


It is always good to gain new perspectives, especially on something that you do frequently.  My eyes have definitely been opened to another potential outcome of my matching runs.  I feel strongly about the need for data quality.  As a result, I also feel strongly about delivering the need to return the investment in data quality initiatives.  To do this I now firmly believe that I need to include a process to analyze and, if necessary, remediate the existence of false negatives in my future matching routines.

20 comments:

  1. Excellent post, William.

    I believe that it is the fear of False Positives (records that matched, but should not have been matched), which makes many implementations perform very little analysis for False Negatives (records that did not match, but should have been matched).

    Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives – the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

    However, the pursuit to reduce false negatives carries with it the risk of creating false positives.

    In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked – after all, those records were not linked before investing in the data matching software.

    Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

    However, I agree with Dr. Talburt that the validation of True Negatives is essential.

    Best Regards,

    Jim

    ReplyDelete
  2. Henrik Liliendahl S&June 17, 2010 at 7:13 PM

    Really nice one William. Paraphrasing Shakespeare: To find false negatives or not, that’s the question. Strangely, the ability to do so is usually not among the criteria when data quality tools are weighted in industry reports.

    ReplyDelete
  3. Great post William.

    I think this makes sense, there is often a one-pass strategy but perhaps this will spark a debate on the value of an iterative strategy, I can definitely see the merit in this given the often critical nature of the data.

    ReplyDelete
  4. Interesting article! I've worked with several data quality tool vendors and multiple match passes is something that was introduced to me early as a part of their methodology, to avoid missing out on potential true positives. Although I must say that not all vendors have a methodology explaining this part. Some vendors (unfortunatelly) adopt a black box marketing approach to matching and de-duplication. The fact that you must work quite a lot with your cleansing and match passes to get the best results is nothing you want to share with your customers.

    ReplyDelete
  5. Jim,
    Thanks, as usual, for taking the time read and comment on the post. I feel the best part of writing is the exchange!
    On your comments, I agree 100% that most customers are concerned with linkeages that they should have but do not. The should always be the primary goal of a de-duplication effort. I also agree with Dr. Talburt that the incremental effort to checking for false negatives - or records that should be but are not matched - is a worthwhile cause. Essentially, it is a slight tweak to the matching process and a union of datasets. Not a large price to pay for a possible increase on the return and potential gain of duplicate consolidation.
    Again, thanks for taking the time to read and comment!

    ReplyDelete
  6. Thanks Henrik! I think we are at the dawn of data quality tool maturity and as we evolve further, so too will the industry analysis.
    Thanks for taking the time to read and comment on the post!

    ReplyDelete
  7. Thanks for taking the time to read and comment on the post! I must admit I like the idea of sparking a debate :)

    The thing about the strategy is that it is not a great deal of development to produce. The analysis is probably more involved, as you'd likely be going through a lot of false positives. However, if we believe in the value of consolidation, ROI and potential revenue increases that follow, then it is justifiable in some cases to perform the extra work.

    ReplyDelete
  8. Dario,
    Thanks for taking the time to read and comment on the post. I can see why vendors position it this way though. I think there is a certain number of skills that each practioner must have that are not part of the tool, but is part of knowing how to use the tool.
    For example, you can use a hammer to prevent falling off a roof by driving the nail removal end into the roof, however, hammer manufacturers don't typcially advertise that "feature".
    In all trades there are some things that are gained through experience and I think this is one of them. This is also why I wanted to write about it and share it with the DQ community.

    ReplyDelete
  9. Hi William - Thanks for sharing your thoughts. This is a worthwhile post that points out the tools will only work as the well as the people who develop or run them. Not meant to bash any developer or DQ Analyst - but there is always room for improvement.

    I always encourage manual review before finalizing a CDI project. This is especially important since our "customer" will, at some point, manually review records -- somewhere in the downstream supply chain.

    While manual review and analysis is always the most timely and expensive, it is necessary to ensure maximum thresholds are maintained.

    ReplyDelete
  10. Hi William,

    One of the things we often find about false negatives, false positives, is that the value of detecting them varies with the data that is being matched.

    For example, if it is a marketing list in a B2C environment where the data is being de-duplicated to avoid sending multiple direct mail offers, the cost of false negatives and positives is low. Either you don't send the mailer to someone that should have received it or you send the mailer twice to the same person. Low impact.

    However, in other situations, the costs can be extremely high. You mention false positives as being the focus and there are many examples of where those can cause problems such as privacy and other compliance issues.

    In the case of false negatives, one common example that comes to mind is credit risk. Imagine a situation where you are doing business with a global organization with multiple subsidiaries. Trying to understand your total exposure from an accounts receivable standpoint requires that you collect and aggregate all outstanding debts. If the matching process misses a few transactions, then the finance department may extend credit to one of the subsidiaries beyond what is specified by the corporate policy. A false negative in this situation can have a big impact to the organization.

    The answer to that is exactly as you indicate -- use probabilistic approaches to identify the potential duplicates and use multiple matching strategies to identify the candidates. However, the software still won't be able to automatically match 100% the correct matches (maybe only 97% or 98%). The way to deal with those that fall into the middle (not sure) is to have a data stewardship environment that allows the business users who know and understand the data to make the final determination on those suspect records. Having an integrated, web based environment with the ability to apply multiple match rules and route exceptions to the business users is the key to making this an easy and economical approach.

    ReplyDelete
  11. Mary,
    Thanks for taking the time to read and comment on the post! I strongly agree with your statement regarding the need for manual review. I typcially identify several cases that define a successful outcomes prior to running match routines. This helps the developer and the data consumer have confidence that the desired outcomes are being accomplished. I also strongly agree with your statement regarding tools and also feel that they have limits and are not to be viewed as a "blackbox".
    Thanks again!
    William

    ReplyDelete
  12. Micheal,
    Thanks for taking time to read and comment on the post! I agree with your statements on the risk exposure example. However, my last two project sponsors were marketing folks so I'd have to say that the impact of false positives/negatives is high for them. Afterall it was the driver behind the initiative. Not delivering on it would be determined a project failure to some extent.
    I think your statements around the data steward program are valid and important for data consumers to realize. As I have stated previously, matching tools are not a blackbox and require skill, from many perspectives, to fully deliver the return.
    Thanks again,
    William

    ReplyDelete
  13. At this point DQ tools with matching capabilities depend on the craftmanship of a data quality professional. I believe that the matching hit rates can vary between people and even between days or data domains. I think that the future of matching in data quality tools will bring A.I. and data mining into the matching functionality to automate a match strategy based on a number of identified possible match combinations, with a press of a "BUILD BEST MATCH STRATEGY - BUTTON"... at least I hope this will happen!

    ReplyDelete
  14. Dario,
    Sounds interesting. I am ramping up on some tools that have some of those capabilities (Global IDs).
    I am interested on the other end of the matching chain - the reporting of match results. Also the integration of match results as well. Informatica has a set of tools that delivers on these functions with profiling, matching, reporting and consolidation routines built into Informatica v9.
    Thanks again for stopping by and joining the discussion
    William

    ReplyDelete
  15. William -

    Great post and discussion!

    If you want to see the advancement of balancing FP and FN you should check out our Master Data Service. It implements "likelihood theory" and excels at applying probabilistic matching to a variety of business use cases. If you want to see FP and FN management in action you should contact us. It not only produces higher quality matches than a simple one-dimensional deterministic match, it out-performs and scales way beyond other matching techniques. It weeds out everything below a configurable FN threshold (known non-matches) and automatically links everything above a configurable FP threshold ("obvious" matches), leaving a set of probable matches between the FN and FP thresholds for data stewards to take care of. These thresholds can be set to be very conservative (no auto-matches unless identical records) or very liberal (last name and address is good enough). In fact, within a single org, you can have several definitions of "what is unique" at the same time.

    Enjoy!

    ReplyDelete
  16. Mark,
    Thanks for taking time to read and comment on the post! I'd love to see the features you've listed above. I think it would be a great idea to post who "our" and "us" refer to so that the readers can gain the insight to "your" offerings.
    I am interested in performing a review of the software in the hopes of writing about these features you have described.
    Thanks again!

    ReplyDelete
  17. @Jim Harris. False Positives are truely feared in some areas. A recent finance project almost ground to a halt over bickering about what constituted a match.

    And this was for a marketing data-mart that was only to be use for mailshots.

    There were no financial or legal impacts and yet meetings dragged on for hours over the matching of names and initials at exactly the same addresses.

    At the end of the day the argument was won by the simply point that all we put on the address label was the initial anyway. Outside of the debate, there was no difference in practice.

    >>Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

    Now that's a good point!

    ReplyDelete
  18. [...] Overly conservative? Does Enterprise Data Quality need to push the boundary more? June 2010 17 comments [...]

    ReplyDelete
  19. [...] Overly conservative? Does Enterprise Data Quality need to push the boundary more? -… http://thedataqualitychronicle.org/… #dataquality [...]

    ReplyDelete
  20. Ah, the FP/FN discussion. Talk about a remediation and workflow nightmare. But, it is so important. A way to overcome is to allow for a period of time that FNs and general suspect records are put through manual merge. Log the records that merge, their score, and the end result (survivor record). First, this puts pressure on the business to divert resources to something that can be very time consuming making it necessary to address quickly. Secondly, it allows you to analyze their manual results and tune the match rules.



    In general, I'll add it is bad practice to do "one and done" for data quality - whether it is cleansing or matching. Black box limits visibility so don't implement matching and create the black box once you deploy. Continue to profile and monitor match results, tuning based on outcomes and feedback from the business.

    ReplyDelete

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...