Wednesday, February 2, 2011

Data Quality Basic Training

Recently a reader asked me if I had any posts on "data quality basics".  Turns out, I didn't.  So I've decided to put together a series of posts that covers what I feel are the basic essentials to a data quality program.

The many sides of data quality


It is generally accepted in the data quality world that there are  seven categories by which data quality can be analyzed.  These include the following:

  1. Conformity

  2. Consistency

  3. Completeness

  4. Accuracy

  5. Integrity

  6. Duplication

  7. Timeliness



  • Conformity - Analyzing data for conformity measures adherence to the data definition standards.  This can include determining if data is of the correct type and length

  • Consistency - Analyzing data for consistency measures that data is uniformly represented across systems.  This can involve comparing the same attribute across the various systems in the enterprise

  • Completeness - Analyzing data for completeness measures whether or not required data is populated.  This can involve one or more elements and is usually tightly coupled with required field validation rules

  • Accuracy - Analyzing data for accuracy measures if data is nonstandard or out-of-date.  This can involve comparing data against standards like USPS deliverability and ASCII code references

  • Integrity - Analyzing data for integrity measures data references that link information like customers and their addresses.  Using our example, this analysis would determine what addresses are not associated with customers

  • Duplication - Analyzing data for duplication measures the pervasiveness of redundant records.  This involves determining those pieces of information that uniquely define data and identifying the extent to which this is present

  • Timeliness - Analyzing data for timeliness measures the availability of data.  This involves analyzing the creation, update or deletion of data and its dependent business processes



Data Quality In Motion


There are a certain set of core activities that define data quality.  In my opinion, data quality is about change.  As a result, many of the data quality activities have to do with changes to data.  The following is my list of the basic data quality activities:

  1. Data Profiling

  2. Data Cleansing

  3. Data Standardization

  4. Data Deduplication

  5. Data Consolidation

  6. Data Validation



  • Data Profiling - during data profiling the main goal is to define the nature and magnitude of the data quality issues

  • Data Cleansing - during data cleansing the main goal is to transform data from the original value which contains inappropriate values or errors to one that is devoid of error

  • Data Standardization - during data standardization the main goal is to transform data from the original value to a defined enterprise standard

  • Data Deduplication - during data deduplication the main goal is to identify the redundant data present

  • Data Consolidation - during data consolidation the main goal is to reduce the redundancy present through survivorship rules

  • Data Validation - during data consolidation the main goal is to verify that the cleansing, standardization, duplicate identification and consolidation have been conducted in a manner that complies with business requirements


 Not all data quality efforts require all six steps, however, there are a two activities mentioned above that, in my opinion, should be included in every effort.  Those are data profiling and data validation. 

Without a data profiling step, the effort is reduced to a the proverbial dog chasing its tail.  Profiling gives the efforts direction and an accurate estimate of when the destination can be reached.

Without data validation, you cannot be confident that the direction gleaned from the profiling is the right direction.  The list maybe somewhat misleading in that data validation really happens after each one of those activities and not at the end of the effort.

Basic basics


There are some routine processes within each data quality activity.  These processes seemed to cut across data quality efforts in various industries  and systems. 

Data profiling often involves investigating data types and lengths, as well as the pervasiveness of non-alphanumeric characters in the data.  These types of errors often cause issues with data migration efforts.  From a data quality perspective these types of errors would fall into the data conformity domain.   Moreover, these issues could also affect the data timeliness domain as a result of the data migration failures.  As you may detect, data quality domains and their issues are closely related and often have a ripple effect.

Data cleansing typically focuses on the removal of things like special characters from data but can also include conversion from one data type to another.  If you think about data being dirty as data not looking like it should than data cleansing makes a little more sense.

Data standardization is the process of converting data to a commonly accepted norm, or standard.  Typically this is done with various reference files or tables.  For instance, there are commonly accepted abbreviations for things like state and country codes.  Data standardization is the process by which conversion from the original source value to the defined standard occurs.

Data deduplication typically focuses on either the customer or product domains.  Often organizations that grow through merger and acquisition end up having multiple records for the same customer and/or products.  Being able to gauge the magnitude of this issue and remediate it is one of the most compelling offerings in the data quality space.

Data consolidation is performed in conjunction with duplicate identification and is the process by which the duplicates are reduced.  Often this process is dependent on "survivorship" rules that define which record is kept and which is archived.  Sometimes there are rules which condense the two records into a master record.  Simple examples of survivorship rules include most recent, most complete, or source system based rules.  It is crucial to note that the non-surviving record is archived and not discarded.

Data validation is the process by which data remediation steps taken and their results are reviewed with the data owner.  Typically it is not possible to go through all the records involved so various use cases are often selected prior to remediation and reviewed once the work is completed.  Use cases are involved data or records that are known to be an issue.  This process is affectively the before and after snapshot and provide the data owner with a comfort level that those changes implemented are valid for all the data or records in question.

Off to battle!


I think that covers a great deal of what data quality efforts involve.  I try to revisit these basics from time to time and drill into more detail when I come across examples.  Hopefully this post contains enough information for those of you that are interested to go out and win some data quality battles.

12 comments:

  1. Thanks William,

    This is really very informative !!

    ReplyDelete
  2. Henrik Liliendahl S&February 2, 2011 at 8:54 PM

    Nice 101 William. One of my many pet peeves is with dimension 6 that you and many other call duplication. I like the term uniqueness as this is the positive outcome like accuracy, consistency etc.

    ReplyDelete
  3. Hi William

    A very comprehensive and well structured lesson on how to put things right after they have gone wrong (with the exception of "validation", which I shall return to).

    The Achilles Heel of data quality in its current state of (im)maturity is that it it is all about sorting out data errors, as opposed to to preventing them occurring. Sadly, a whole industry has grown up around this flawed approach - an industry that some might argue has a vested interest in not preventing errors occurring as it would become redundant.

    The validation that you referred to cannot be be brought about by Use Case. OO and UML techniques are not effective for preventing data errors at a business level. The only effective agent for the is the Business Function.

    All data in an enterprise is created or transformed by Business Functions. So, by correctly executing Business Functions (either manually or electronically), an enterprise can prevent data errors arising in the first place.

    Why would any enterprise want to create data errors and then spend time, effort and money trying to find and remedy them?

    Regards
    John

    ReplyDelete
  4. John,
    I may have used the term use case out of context. I wasn't referring to the typical systematic use cases seen in UML modeling but more data centric use cases that sample known transactions and determine the affect of the remediation.

    I have to disagree with you that data quality is purely reactive and not focused on preventing errors. There are many SaaS data services offerings possible now that can take a remediation effort and turn it into a prevention effort.

    I'm not sure I agree that ALL data is created by business functions either. Not to mention, it is difficult to develop every possible scenario in which data can be corrupted and engineer a solution. Data timeliness is one of those data quality issues that is just a fact. As data ages it becomes less accurate. Like address data. People move. Buidings are demolished. The United States Postal Service rezones postal codes.

    Point is, some things you have to react to and some things you can prevent. It is important to mix both solutions into a data quality program.

    Thanks for stopping by and commenting on the post!

    ReplyDelete
  5. William

    Great post. I have to agree with Henrik - it always pays to accentuate the positive aspects of metrics and targets so the words we use to describe them are important.

    My take on the maturity/immaturity of Information/Data quality is that it is actually more to do with the immaturity of people. My experience in the phone company days was that if we fixed the 'hot potato' problem, senior management breathed a sigh of relief and then handed us another potato. Because in their heads the problem was fixed because the thing they were being beaten up about was fixed. It took me about 2 yrs, an analysis of the root cause of why we had problems (>85% stemmed from poor product design leading to billing errors) and a change in legislation get management to take the prevention mantra seriously.
    So, we need a maturing in management away from "quick fix" to emphasise getting to prevention.

    ReplyDelete
  6. kenoconnordataconsulFebruary 3, 2011 at 12:19 PM

    Nice list William, and always good to see differing opinions in the comments,

    Rgds Ken

    ReplyDelete
  7. Thanks! I agree. A healthy debate is always best. John's right. They really aren't use cases. That's just what I call them. They are more of a "test scenario"

    ReplyDelete
  8. Daragh,
    I've had very similiar experiences. Which kind of proves John's point.

    ReplyDelete
  9. Henrik,

    I'd have to agree. I think I may refer to it as uniqueness from now on. Low uniqueness is bad. High uniqueness is good. I like it. Thanks for the fresh perspective!

    ReplyDelete
  10. A good list of data quality criteria to begin with amongst others and list of techniques. The most comprehensive list I have seen is in the book; Managing Information Quality by Martin Eppler in which he lists 70 typical information quality criteria which was compiled from various other sources (and referenced).

    With respect to a comprehensive data quality program however if you do not address data modeling, metadata management (including semantics, controlled vocabularies, naming and definition standards) - ISO 11179), you will remain in the mode of reacting and fixing rather than preventing data errors. The error’s already there so you haven’t prevented its effect on the business. In any quality initiative the primary goal is to prevent the error.

    ReplyDelete
  11. Great points, Richard! Thanks for stopping by and commenting. I did want to stick to the basics in this post. I'll have to check out the book you've referenced.

    ReplyDelete
  12. Very Nice post William. Couple of additional points for DQ activities....
    Perhaps we need to add Data Enrichment as well. The idea is to use third party data providers for data elements that is not available with the organization. Not everybody does it, but common enough.
    Another is manual review and remediation. Even with best data quality firewalls and best processes in place, data errors sneak in. And even with best data remediation tools in place, not all errors can be corrected automatically by software. Manual intervention is necessary. A Data Steward, domain expert needs to look at data and make it of acceptable quality.

    ReplyDelete

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...