The Data Quality Chronicle
My thoughts on data
Thursday, December 26, 2019
What data quality is (and what it is not)
Monday, March 23, 2015
Data Validation: Data Quality's Doorman
Assumptions and Issues Abound
Check it out the rest of this article on LinkedIn here
Monday, February 16, 2015
Data Quality (Data Profiling) Tutorial: Talend Open Studio for Data Qual...
Talend is an interesting vendor. I've watched them grow and harden all their product offerings over the past few years.
I wanted to share this tutorial from youtube as an introduction to their data quality offering.
Personally I have used the profiler and it is hard to beat the no configure, out-of-the-box reports.
Enjoy!
Sunday, October 12, 2014
No budget, No problem! - Data Quality on the cheap
With some SQL skills and creativity, a data quality program can thrive. It can even have slick dashboards that measure quality trends.
The main components of a data quality program are business rules
So now that we've talked the talk, let's walk the walk. As I mentioned, all you really need is a rule.
- Data Quality Dimension: Integrity is the dimension we will use in our example
- Metric: That's the name we give to the rule
- Rule: Our essential component
- Total Records: In our example this would be the total number of bill records
- Total Records Violating the rule: In our example this would be the total number of bill records not associated to a customer
- % Violating: Violations / Total Records
- Target: Number of violating records we are willing to accept
- Target Description: Explanation of the target derivation
- Count Trend: Are the number of violating records increasing or decreasing
- % Trend: Is the percentage of violating records increasing or decreasing
- Only 0.13 % of Bills are not associated to a customer. This provides a measure of the severity of the issue relative to the number of bills
- The count trend is heading in the wrong direction indicated by the red traffic light icon. 59 bills were processed this week that are not associated with a customer
- The % trend is headed in the right direction in that more bills are being processed each week and the overall percentage of bills not associated with a customer is not growing
- In the end, there is an issue. It is a small one. The issue is not growing relative to the growth in bills processed
Thursday, September 4, 2014
Graph DB Comparison
OrientDB vs Neo4j
1. One master server which creates a bottleneck on write operations when distributed
2. Cypher. Easy to learn but who needs to constantly covert sql to cypher?
3. Inability to reclaim storage space on deletes without a restart
4. Schema less architecture. Inhibits design for a lot of use cases
Sunday, June 22, 2014
Graph Database Observations
I really like modeling for graph databases. It is much more intuitive and direct in terms of moving toward a solution. I can see where graph dbs fit into the Agile development methodology really well (about time something fit into Agile well).
Even as a novice, data modeling for a graph database feels intuitive, fast and fun!
I have seen some promise in Talend Big Data Integration tool in migrating data into Neo4j. I think that is a HUGE plus! Way to go Talend!!
However what I do see, at least initially, is a large amount of coding to migrate data from database to a graph database. Maybe this is a good thing? Graph dbs seem to shrink the design time, document in an intuitive manner and require more time for developers to code the solution. For years I have felt strongly that too much time is wasted designing and documenting solutions and not enough time is spent actually coding them. Graph databases seem to tip the scale in favor of the developer with regard to this dynamic.
Graph databases seem to reduce design and documentation durations and give that time back to the developer!
Here are five things I have learned so far that are worth passing on:
- Nodes = vertices = records (in the relational world)
- Relationships = edges = constraints (in the relational world)
- Nodes/Vertices = records = you will have a lot of nodes/vertices (learn how to create 'em, you'll need it)
- Nodes/Vertices have properties which hold values (kind of like table attributes)
- Relationship/edges have direction and properties allowing for the developer to program the strength of the relationship and make the relationship bi or uni directional
I mentioned modeling for graph databases so I will share my first graph model. This model is just a draft and a high level depiction of a customer centric solution. However I feel it is a good example of how intuitive they are to read, develop from and iterate through. This only took about 15 minutes to work out.
[caption id="attachment_3491" align="aligncenter" width="747"] high level graph model of a customer centric solution[/caption]
That's it for now. I'll be adding to this topic as I develop more graph skills!
Helpful links
http://nosql.mypopescu.com/post/10162152437/what-is-a-graph-database#about-blog
http://thinkaurelius.github.io/titan/
https://github.com/thinkaurelius/titan/wiki/Getting-Started
http://www.neo4j.org/
http://www.neo4j.org/learn
Wednesday, April 16, 2014
Eventual Consistency Models: Release Consistency
Before we get into the gritty details let's define some key terms.
- Node: In a distributed system a node represents hardware which contains data and code
- Memory Object: Medium for data storage. Could be a JSON document, graph node or cached memory location
In this post, let's look at the release consistency model. This model enforces, or maybe implements is the better word, consistency through two main operations; acquire and release.
In order to write data, a node must acquire a memory object. When the write is complete, the node releases the memory object.
The model is implemented by distributing all the writes across the system before another node can acquire the memory object.
There are two methods to implement release consistency; eager and lazy. In the eager implementation writes are distributed after the node releases the memory object. In the lazy implementation writes are distributed before a node acquires the memory object.
Eager release consistency is (potentially) more costly to the system in that it is data not necessarily required by the current users. The upside of eager release consistency is that when the user does request the data, there is a low latency period.
Lazy release consistency essentially represents the reverse scenario. Only required data is distributed, conserving system resources, however request latency can be high.
Hope you enjoyed the post! I enjoyed writing it. Stay tuned for more write ups on eventual consistency models.
Finding Orphaned Nodes in a Graph Database
The Problem
Orphaned records represent missed opportunity
Regardless of the data store, unrelated data cannot be mined or analyzed and often goes undetected until there is a problem.
The Solution
Profile your data
Data profiling is probably the most under utilized practice is all of data management. I can't think of a data related issue where the resolution methodology doesn't start with profiling the data. Detecting and resolving orphaned data is no exception.
In the relational database world there are many ways to profile data and detect orphans. From packaged solutions with dashboard reports to simple SQL queries and Excel exports, profiling relational data is a mature practice.
But what if you are working with graph database technologies? Prepackaged solutions for graph technology is a growing market, but not as mature as the RDMS market and those SQL queries are not exactly going to translate well into the graph world.
The Situation [ref]No, not some abdominal flashing kid from Jersey. I'm referring to a scenario.[/ref]
Let's say you have a graph database which you are using to analyze which of your customers are placing orders. You will likely have a graph model that looks something like the figure below.
In this model you have a customer node [ref] in graph terms a node is a record [/ref] and an order node, related to each other by the placed relationship [ref] in graph terms a relationship joins to nodes and describes their relationship [/ref].
Using this model you can determine (quickly) what orders a customer has placed with very little code.
Cypher Query
match (c:Customer), (o:Order) where (c)-[:PLACED]->(o) return c,o;
Graph Result
Here we can see that customer 0 placed orders 1133 and 1132. Not insightful enough? Here is the result in a more descriptive table form.
So Bill Smith placed order number 47 on 10/7/2013 and order number 90 on 10/9/2013.
This is great when the nodes are all related correctly, however, this post is about when everything is in fine working order. This post is about orphans, right?
Orders without Customers
Let's look at the code required to find what orders are not related to a customer.
match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct o;
Here we are saying give me the distinct list of orders which are not placed by a customer.
Here is the more readable table layout results
Summary
So by using the relationship between customers and orders and specifying where it does not exist, we can find the 'orphaned' orders. [ref] Here is the cypher query in order to find customers without orders.
match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct c;
Technically a customer should be someone who places an order of some kind (otherwise they'd be a prospect). [/ref]
What data quality is (and what it is not)
Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...
-
Answer by Alex Kamil: Prerequisites Unix shell basics: http://www.amazon.com/Uni x-Progr... C: http://www.amazon.com/Pro grammin... OS basic...
-
While most organizations have data quality issues, not every organization has a budget for software to monitor, report and remedy data qua...