Sunday, April 17, 2011

Know Your Data - Data Profiling

Having data issues is a lot like having any other issue; you just want to get it fixed and move on with life.  I talk with a lot of business users who feel this way. 

They have stories for me when I arrive.  From 62 different states codes to 5 different versions of one of their most high profile customers, they tell me these stories and I patiently listen. Well, for the most part.  When they finish I know a few things I didn’t before I walked into their office.  I have a good idea of the relative impact of the data issues, I have an approximated idea of the data domain(s) I need to concentrate on and I know a lot about my client and their passion for the data issues.  One thing I don’t know is very much about the data itself. 

In one of my previous posts, Know Your Data, I talked about the importance of knowing data from the perspective of their relationships.  Another import aspect of knowing data is its structure and content; sort of from the inside out.  The way you get this perspective is to profile the data.

There are several types of profiling, but I am specifically referring to metadata and data profiling.  Metadata profiling provides details about the data’s structure and data profiling provides details about the actual data values.  While at first this seems obvious, it is a hard practice to implement.  It’s hard because of two main reasons:

  1. After listening to business users describe their issues and the urgency around getting it fixed, the tendency is get to work on the resolution

  2. Lack of a tools that provide adequate profiling functionality


However, when data quality practitioners are equipped with tools that can perform these profiling functions, they can reach more accurate and useful conclusions.  In a recent blog post Dalton Cervo highlighted the importance of performing data analysis.  In it he mentions how a data quality team engaged in multiple discussions regarding how duplicate records will be consolidated.  However, without profiling the data, as he points out, there was no way of determining how duplicates would be identified.  In other words, without profiling the data there is no way to know what data you can use to identify duplicates so there is no point in engineering a solution until you know your data.

Typically I use Informatica Data Quality Developer and Analyst to profile data if I know exactly what data I am to concentrate on.  If I do not have that advantage, I use Global IDs data discovery functionality.  The combination of these two powerful tools allows me to gain a lot of knowledge in a short period of time. 

Once armed with the details of my insight, I begin to have clarity around the following:

  1. A high level idea around the conformity of the data.  That is to say, is the data defined correctly?

  2. A high level idea around the completeness of the data.  That is to say, are the required data elements present and populated?

  3. A high level idea around the integrity of the data.  That is to say, are there orphaned records?

  4. A high level idea around the consistency of the data.  That is to say, are the same attributes in separate systems represented uniformly?

  5. What types of remediation need to be performed to bring the data into a higher state of conformity

  6. What types of remediation need to be performed to bring the data into a higher state of completeness

  7. What types of remediation need to be performed to bring the data into a higher state of integrity

  8. What data I have at my disposable to determine data accuracy and duplication

  9. A high level idea of the data taxonomies present

  10. A high level idea of the present data attributes to be incorporated into a canonical model


That’s a lot of information from a few fairly straight forward exercises.  Again, these exercises are made simple through the power of the aforementioned tools.  However, rather than jumping into remediation, I suggest you get to know your data by performing data profiling.

No comments:

Post a Comment

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...