Thursday, December 26, 2019

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. Keeping with the analogy, strategic leaders, not the tactical resources, decide where and when to place these systems. Data quality is strategic, not tactical. It has tactical applications, but it is very much a calculated implementation of a system used to guard critical assets.
Prior to that implementation, organizations decide on which assets are critical, when to start detecting threats, and where is the most effective way to solicit threat warnings. That is data governance. A strategic outline of what is critical (note: not important, critical), when the need for warning is most effective, and where is it most likely to provide ample warning.
Data quality supports data governance. And since you cannot support that which does not exist, you need data governance before having effective data quality. In other words, you need a well defined strategy before you can have a tactical plan which supports that strategy.
I have read many articles, whitepapers, blog posts, and so forth that get into the minutiae of data quality. I have written about the minutiae myself. But before you understand that data quality is the implementation and early warning detection system for a more comprehensive and strategic data governance plan, all that detail will likely be misplaced and guard a non-critical asset.
If your governance plan centers around data security, point your early warning system at entry and end points which can threaten that secure strategy. If your data governance plan centers around data accuracy, patrol the landscape of systems that threaten accuracy.
Hopefully, subtly, I have demonstrated that a governance plan needs to be centered around a theme. Too many alarms going off in too many directions provides no more clear a direction than no alarms at all. Another reason why having a strategy is more important than tactics. Find out what threatens the end goal the most and focus on that.
In short, data quality is not a plan. It is an implementation of a plan; an early warning detection system to monitor threats to the plan.

Monday, March 23, 2015

Data Validation: Data Quality's Doorman

Assumptions and Issues Abound

Often data quality seems to be an assumption. The business assumes that IT has controls in place, IT assumes that data supplied is in order and everyone assumes they'd know about it if these assumptions weren't correct.
Fact is, most often, these assumptions are false and there are data quality issues. The costs, both tangible and intangible, associated to data quality issues are extremely high so you'd think people would assume less and attack data quality root causes more often. But for some reason data quality remains something people assume is either inherent or addressed.
Maybe this lack of diligence is because data quality root causes are so varied in nature and often tough to trace. I know of many instances where the source produced issues and there was nothing the data consumer could do to drive down, much less stop, the constant production of erroneous data.
If this situation sounds familiar and you are looking for relief, there is an important step that can give you back control of your data, stop the proliferation of issues, and define root causes for these issues.

Check it out the rest of this article on LinkedIn here

Monday, February 16, 2015

Data Quality (Data Profiling) Tutorial: Talend Open Studio for Data Qual...



Talend is an interesting vendor.  I've watched them grow and harden all their product offerings over the past few years.

I wanted to share this tutorial from youtube as an introduction to their data quality offering.

Personally I have used the profiler and it is hard to beat the no configure, out-of-the-box reports.

Enjoy!

Sunday, October 12, 2014

No budget, No problem! - Data Quality on the cheap

While most organizations have data quality issues, not every organization has a budget for software to monitor, report and remedy data quality. However just because there is a lack of funding that doesn't mean there needs to be a lack of quality.  

With some SQL skills and creativity, a data quality program can thrive. It can even have slick dashboards that measure quality trends.
What you really need
In order to build any data quality program only one thing is absolutely necessary; a business rule that differentiates good quality data from bad.  Preferably, and inevitably, there would be many rules, but the point is still the same.  
The main components of a data quality program are business rules

Walking the walk
So now that we've talked the talk, let's walk the walk.  As I mentioned, all you really need is a rule.
RuleEvery bill must be associated with a customer
To understand the rest of this article we need to quickly go over the framework of a data quality scorecard.
  1. Data Quality Dimension: Integrity is the dimension we will use in our example
  2. Metric: That's the name we give to the rule
  3. Rule: Our essential component
  4. Total Records: In our example this would be the total number of bill records
  5. Total Records Violating the rule: In our example this would be the total number of bill records not associated to a customer
  6. % Violating: Violations / Total Records
  7. Target: Number of violating records we are willing to accept
  8. Target Description: Explanation of the target derivation
  9. Count Trend: Are the number of violating records increasing or decreasing
  10. % Trend: Is the percentage of violating records increasing or decreasing
Step 2. Step one was get the rule from the business.  Step two involves writing the SQL code that will give you numbers to plug into the scorecard framework above.

Total Records SQL

-------------------------------------------------------- Total BILL COUNT
SELECT COUNT(*) AS TOTAL_BILL_COUNT FROM dbo.TBL_Bill_Detail AS C

Total Records Violating Rule SQL
-------------------------------------------------------- TOTAL Bill Orphans
SELECT COUNT(*) AS TOTAL_BILL_Orphans FROM dbo.TBL_Bill_Detail AS C WHERE C.Customer_Key IS NULL
% Violating SQL
-------------------------------------------------------- ORPHANED BILL % VIOLATING
DECLARE @TOTAL_BILLING AS decimal (38,2)
SELECT @TOTAL_BILLING = COUNT(*) FROM dbo.TBL_Bill_Detail AS C
DECLARE @TOTAL_BILLING_VIOLATING AS decimal (38,2)
SELECT @TOTAL_BILLING_VIOLATING = COUNT(*) FROM dbo.TBL_Bill_Detail AS C WHERE C.Customer_Key IS NULL 
SELECT SUM(@TOTAL_BILLING_VIOLATING / @TOTAL_BILLING*100) AS ORPHAN_BILLING_PERCENTAGE_VIOLATING
Target SQL
-------------------------------------------------------- ORPHANED BILL TARGET (HARD CODED AT 10% FOR TESTING PURPOSES)
DECLARE @TARGET_PERCENT AS NUMERIC(2,2)
SELECT @TARGET_PERCENT = 0.10
SELECT COUNT(*) * @TARGET_PERCENT as TOTAL_BILL_TARGET FROM dbo.TBL_Bill_Detail AS C
Now that we have our SQL coded, let's take a look at the Excel formulas for determining the trends.  One thing to keep in mind is that trends are a comparison of the current total and percent violating against the a previous total and percent violating measurement.  In this case I have chosen a week over week measurement, or this week's execution against last week's execution.
Here is what it looks like in Excel to the user:
Here are the Excel formulas
=SUM(E15-'DQ SCORECARD 8292014'!E15)
In this example, E15 is the current week count violating the rule and 'DQ SCORECARD 8292014'!E15 is the previous week count violating the rule.
I use conditional formatting to create the traffic light icon using a rule like this:

Once I have the SQL and Excel formulas setup, I simply take the query results and plug them into the Excel spreadsheet.  
The end result looks like this ...


Interpreting the results
  1. Only 0.13 % of Bills are not associated to a customer.  This provides a measure of the severity of the issue relative to the number of bills 
  2. The count trend is heading in the wrong direction indicated by the red traffic light icon.  59 bills were processed this week that are not associated with a customer
  3. The % trend is headed in the right direction in that more bills are being processed each week and the overall percentage of bills not associated with a customer is not growing
  4. In the end, there is an issue.  It is a small one.  The issue is not growing relative to the growth in bills processed

With this structure you can quickly add metrics by writing new SQL and then leveraging the Excel formulas to expand your data quality program and scorecards.
That's how you do data quality on the cheap!

Thursday, September 4, 2014

Graph DB Comparison



OrientDB vs Neo4j


I'm glad I read this and I think you will be too!

It was beneficial in that it outlined many of the enterprise limitations of Neo4j. 

Here are a few ...
1. One master server which creates a bottleneck on write operations when distributed
2. Cypher. Easy to learn but who needs to constantly covert sql to cypher?
3. Inability to reclaim storage space on deletes without a restart
4. Schema less architecture. Inhibits design for a lot of use cases

I'm looking forward to working with Orient some more.  I'll write about that soon!

Sunday, June 22, 2014

Graph Database Observations

I have been ramping up (maybe a little late to the dance?) on graph database technologies, specifically Neo4j and Titan.

I really like modeling for graph databases.  It is much more intuitive and direct in terms of moving toward a solution.  I can see where graph dbs fit into the Agile development methodology really well (about time something fit into Agile well).
Even as a novice, data modeling for a graph database feels intuitive, fast and fun!

I have seen some promise in Talend Big Data Integration tool in migrating data into Neo4j.  I think that is a HUGE plus!  Way to go Talend!!

However what I do see, at least initially, is a large amount of coding to migrate data from database to a graph database.  Maybe this is a good thing?  Graph dbs seem to shrink the design time, document in an intuitive manner and require more time for developers to code the solution.  For years I have felt strongly that too much time is wasted designing and documenting solutions and not enough time is spent actually coding them.  Graph databases seem to tip the scale in favor of the developer with regard to this dynamic.
Graph databases seem to reduce design and documentation durations and give that time back to the developer!

Here are five things I have learned so far that are worth passing on:

  1. Nodes = vertices = records (in the relational world)

  2. Relationships = edges = constraints (in the relational world)

  3. Nodes/Vertices = records = you will have a lot of nodes/vertices (learn how to create 'em, you'll need it)

  4. Nodes/Vertices have properties which hold values (kind of like table attributes)

  5. Relationship/edges have direction and properties allowing for the developer to program the strength of the relationship and make the relationship bi or uni directional


I mentioned modeling for graph databases so I will share my first graph model.  This model is just a draft and a high level depiction of a customer centric solution.  However I feel it is a good example of how intuitive they are to read, develop from and iterate through.  This only took about 15 minutes to work out.

[caption id="attachment_3491" align="aligncenter" width="747"]high level graph model of a customer centric solution high level graph model of a customer centric solution[/caption]

That's it for now.  I'll be adding to this topic as I develop more graph skills!

Helpful links

http://nosql.mypopescu.com/post/10162152437/what-is-a-graph-database#about-blog

http://thinkaurelius.github.io/titan/

https://github.com/thinkaurelius/titan/wiki/Getting-Started

http://www.neo4j.org/

http://www.neo4j.org/learn

Wednesday, April 16, 2014

Eventual Consistency Models: Release Consistency

As a data quality professional, eventual consistency is a concern. Don't get me wrong, it is a necessary and incredible programmatic feature.  However, the fact that a system can store different values for the same attribute is concerning.

Before we get into the gritty details let's define some key terms.

  • Node: In a distributed system a node represents hardware which contains data and code

  • Memory Object: Medium for data storage. Could be a JSON document, graph node or cached memory location


In this post, let's look at the release consistency model.  This model enforces, or maybe implements is the better word, consistency through two main operations; acquire and release.

In order to write data, a node must acquire a memory object.  When the write is complete, the node releases the memory object. 
The model is implemented by distributing all the writes across the system before another node can acquire the memory object.

There are two methods to implement release consistency; eager and lazy.  In the eager implementation writes are distributed after the node releases the memory object.  In the lazy implementation writes are distributed before a node acquires the memory object.

Eager release consistency is (potentially) more costly to the system in that it is data not necessarily required by the current users.  The upside of eager release consistency is that when the user does request the data, there is a low latency period.

Lazy release consistency essentially represents the reverse scenario.  Only required data is distributed, conserving system resources, however request latency can be high.

 Hope you enjoyed the post!  I enjoyed writing it.  Stay tuned for more write ups on eventual consistency models.

Finding Orphaned Nodes in a Graph Database

The Problem



Orphaned records represent missed opportunity



Regardless of the data store, unrelated data cannot be mined or analyzed and often goes undetected until there is a problem.

The Solution



Profile your data



Data profiling is probably the most under utilized practice is all of data management.  I can't think of a data related issue where the resolution methodology doesn't start with profiling the data.  Detecting and resolving orphaned data is no exception.


In the relational database world there are many ways to profile data and detect orphans.  From packaged solutions with dashboard reports to simple SQL queries and Excel exports, profiling relational data is a mature practice.


But what if you are working with graph database technologies?  Prepackaged solutions for graph technology is a growing market, but not as mature as the RDMS market and those SQL queries are not exactly going to translate well into the graph world.



The Situation [ref]No, not some abdominal flashing kid from Jersey.  I'm referring to a scenario.[/ref]


Let's say you have a graph database which you are using to analyze which of your customers are placing orders.  You will likely have a graph model that looks something like the figure below.



In this model you have a customer node [ref] in graph terms a node is a record [/ref]  and an order node, related to each other by the placed relationship [ref] in graph terms a relationship joins to nodes and describes their relationship [/ref].

Using this model you can determine (quickly) what orders a customer has placed with very little code.

Cypher Query


 match (c:Customer), (o:Order) where (c)-[:PLACED]->(o) return c,o;

 Graph Result




Here we can see that customer 0 placed orders 1133 and 1132.  Not insightful enough?  Here is the result in a more descriptive table form.



So Bill Smith placed order number 47 on 10/7/2013 and order number 90 on 10/9/2013.

This is great when the nodes are all related correctly, however, this post is about when everything is in fine working order.  This post is about orphans, right?

Orders without Customers


Let's look at the code required to find what orders are not related to a customer.
match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct o;

Here we are saying give me the distinct list of orders which are not placed by a customer.



Here is the more readable table layout results


Summary


So by using the relationship between customers and orders and specifying where it does not exist, we can find the 'orphaned' orders. [ref] Here is the cypher query in order to find customers without orders.

match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct c;

Technically a customer should be someone who places an order of some kind (otherwise they'd be a prospect). [/ref]

 

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...