Wednesday, April 16, 2014

Finding Orphaned Nodes in a Graph Database

The Problem



Orphaned records represent missed opportunity



Regardless of the data store, unrelated data cannot be mined or analyzed and often goes undetected until there is a problem.

The Solution



Profile your data



Data profiling is probably the most under utilized practice is all of data management.  I can't think of a data related issue where the resolution methodology doesn't start with profiling the data.  Detecting and resolving orphaned data is no exception.


In the relational database world there are many ways to profile data and detect orphans.  From packaged solutions with dashboard reports to simple SQL queries and Excel exports, profiling relational data is a mature practice.


But what if you are working with graph database technologies?  Prepackaged solutions for graph technology is a growing market, but not as mature as the RDMS market and those SQL queries are not exactly going to translate well into the graph world.



The Situation [ref]No, not some abdominal flashing kid from Jersey.  I'm referring to a scenario.[/ref]


Let's say you have a graph database which you are using to analyze which of your customers are placing orders.  You will likely have a graph model that looks something like the figure below.



In this model you have a customer node [ref] in graph terms a node is a record [/ref]  and an order node, related to each other by the placed relationship [ref] in graph terms a relationship joins to nodes and describes their relationship [/ref].

Using this model you can determine (quickly) what orders a customer has placed with very little code.

Cypher Query


 match (c:Customer), (o:Order) where (c)-[:PLACED]->(o) return c,o;

 Graph Result




Here we can see that customer 0 placed orders 1133 and 1132.  Not insightful enough?  Here is the result in a more descriptive table form.



So Bill Smith placed order number 47 on 10/7/2013 and order number 90 on 10/9/2013.

This is great when the nodes are all related correctly, however, this post is about when everything is in fine working order.  This post is about orphans, right?

Orders without Customers


Let's look at the code required to find what orders are not related to a customer.
match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct o;

Here we are saying give me the distinct list of orders which are not placed by a customer.



Here is the more readable table layout results


Summary


So by using the relationship between customers and orders and specifying where it does not exist, we can find the 'orphaned' orders. [ref] Here is the cypher query in order to find customers without orders.

match (o:Order), (c:Customer) where not ((c)-[:PLACED]->(o)) return distinct c;

Technically a customer should be someone who places an order of some kind (otherwise they'd be a prospect). [/ref]

 

No comments:

Post a Comment

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...