Sunday, October 12, 2014

No budget, No problem! - Data Quality on the cheap

While most organizations have data quality issues, not every organization has a budget for software to monitor, report and remedy data quality. However just because there is a lack of funding that doesn't mean there needs to be a lack of quality.  

With some SQL skills and creativity, a data quality program can thrive. It can even have slick dashboards that measure quality trends.
What you really need
In order to build any data quality program only one thing is absolutely necessary; a business rule that differentiates good quality data from bad.  Preferably, and inevitably, there would be many rules, but the point is still the same.  
The main components of a data quality program are business rules

Walking the walk
So now that we've talked the talk, let's walk the walk.  As I mentioned, all you really need is a rule.
RuleEvery bill must be associated with a customer
To understand the rest of this article we need to quickly go over the framework of a data quality scorecard.
  1. Data Quality Dimension: Integrity is the dimension we will use in our example
  2. Metric: That's the name we give to the rule
  3. Rule: Our essential component
  4. Total Records: In our example this would be the total number of bill records
  5. Total Records Violating the rule: In our example this would be the total number of bill records not associated to a customer
  6. % Violating: Violations / Total Records
  7. Target: Number of violating records we are willing to accept
  8. Target Description: Explanation of the target derivation
  9. Count Trend: Are the number of violating records increasing or decreasing
  10. % Trend: Is the percentage of violating records increasing or decreasing
Step 2. Step one was get the rule from the business.  Step two involves writing the SQL code that will give you numbers to plug into the scorecard framework above.

Total Records SQL

-------------------------------------------------------- Total BILL COUNT
SELECT COUNT(*) AS TOTAL_BILL_COUNT FROM dbo.TBL_Bill_Detail AS C

Total Records Violating Rule SQL
-------------------------------------------------------- TOTAL Bill Orphans
SELECT COUNT(*) AS TOTAL_BILL_Orphans FROM dbo.TBL_Bill_Detail AS C WHERE C.Customer_Key IS NULL
% Violating SQL
-------------------------------------------------------- ORPHANED BILL % VIOLATING
DECLARE @TOTAL_BILLING AS decimal (38,2)
SELECT @TOTAL_BILLING = COUNT(*) FROM dbo.TBL_Bill_Detail AS C
DECLARE @TOTAL_BILLING_VIOLATING AS decimal (38,2)
SELECT @TOTAL_BILLING_VIOLATING = COUNT(*) FROM dbo.TBL_Bill_Detail AS C WHERE C.Customer_Key IS NULL 
SELECT SUM(@TOTAL_BILLING_VIOLATING / @TOTAL_BILLING*100) AS ORPHAN_BILLING_PERCENTAGE_VIOLATING
Target SQL
-------------------------------------------------------- ORPHANED BILL TARGET (HARD CODED AT 10% FOR TESTING PURPOSES)
DECLARE @TARGET_PERCENT AS NUMERIC(2,2)
SELECT @TARGET_PERCENT = 0.10
SELECT COUNT(*) * @TARGET_PERCENT as TOTAL_BILL_TARGET FROM dbo.TBL_Bill_Detail AS C
Now that we have our SQL coded, let's take a look at the Excel formulas for determining the trends.  One thing to keep in mind is that trends are a comparison of the current total and percent violating against the a previous total and percent violating measurement.  In this case I have chosen a week over week measurement, or this week's execution against last week's execution.
Here is what it looks like in Excel to the user:
Here are the Excel formulas
=SUM(E15-'DQ SCORECARD 8292014'!E15)
In this example, E15 is the current week count violating the rule and 'DQ SCORECARD 8292014'!E15 is the previous week count violating the rule.
I use conditional formatting to create the traffic light icon using a rule like this:

Once I have the SQL and Excel formulas setup, I simply take the query results and plug them into the Excel spreadsheet.  
The end result looks like this ...


Interpreting the results
  1. Only 0.13 % of Bills are not associated to a customer.  This provides a measure of the severity of the issue relative to the number of bills 
  2. The count trend is heading in the wrong direction indicated by the red traffic light icon.  59 bills were processed this week that are not associated with a customer
  3. The % trend is headed in the right direction in that more bills are being processed each week and the overall percentage of bills not associated with a customer is not growing
  4. In the end, there is an issue.  It is a small one.  The issue is not growing relative to the growth in bills processed

With this structure you can quickly add metrics by writing new SQL and then leveraging the Excel formulas to expand your data quality program and scorecards.
That's how you do data quality on the cheap!

No comments:

Post a Comment

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...