What's an Orphan?
An orphan transaction is a transaction in a "child" table without an associated transaction in a "parent" table. For instance, an address record in the Address table without a link to a customer record in the Customer table. Orphaned records lead to various issues in business operations like marketing and business analytics.
Challenges orphans pose to the business
An address without an associated customer record will pose a challenge to marketing initiatives when trying to engage the customer. How do you address the marketing mailer, determine previous purchasing history or determine gender based segmentation?
In order to confront this issue it is crucial to understand if you have applications that are creating orphans. There are various ways to conduct this analysis manually with SQL or in an automated fashion using data quality tools.
Orphan Analysis Methods
In order to manually investigate orphans a series of SQL queries need to be created. Since these queries need to be executed periodically, it is most likely best if they are setup as stored procedures. For the purposes of this post, let's just concentrate on the substance of the queries.
Orphans are all about linkages, so the basic information required is to find what unique identifiers are present in the child table and not present in the parent table. Using the customer/address example, this translates into finding what customer ids are in the address table but not in the customer table.
The figure illustrates a straight forward relationship where addresses are linked to customers using the Cust ID field. An orphaned address would then be defined as a record where the Address CustID exists in Address but not in the Customer table. In SQL you can write a query that would determine how many of these tranactions exists by coding something like this:
Select Address.AddressID, Address.CustID, Address.CreateDate, Address.ModifiedDatefrom Addresswhere Address.CustID not in (select CustID from Customer)
Technically you don't need fields other than the unique identifiers, but it helps to have create and modified dates to audit trail why the orphan exists. In fact you may want to pull all the attributes of an address as part of the audit exercise.
To do this type of analysis with a data quality tool is a bit more graphical and includes some nice reports. I use Informatica v9 to do this and I'll explain the steps required to accomplish this. Let's assume you already have the customer and address tables defined. Simply follow the steps below to generate the analysis, report, and result set (a subtle value add of using the tool is that you get the report and results with less coding and all the re-usability):
- Right click on the Profiles folder
- Select New Profile
- Give your profile a name
- Add data sources (Customer and Address)
- Select Join Analysis
- Click Next
- Select the fields you want to include from each source (be sure to include the unique identifiers for each source so you can audit the orphans)
- Define the relationship required to join the two tables (Address.CustID and Customer.CustID in the case of our example)
- Click Finish to execute the analysis
In Informatica a report such as the one depicted below is produced.
From the report you can see that it has identified 333,211 Addresses without an associated Customer record and 632 Customers without an Address record, along with 858,222 records that are correctly associated. Armed with this type of analysis, as well as graphical report, an analyst can quickly and accurately convey this message to the business and begin to define remediation plans.
From this analysis you can draw the following conclusions:
- If you are attempting to do a customer consolidtion effort, using name and address methods will not satisfy your needs 100% so you'll need another strategy like name and phone number to augment the consolidation (you may even be able to use the name/phone analysis to remedy the broken address linkages)
- If you are performing a data migration effort, this analysis will explain testing anomolies like record count discrepencies and possibly why some migration jobs fail to load
- If you are looking at data management strategy, you may look into the current status of constraints on the Address table
- If you are looking into a Data Governance program, you may look at this and determine a need for governance process to deal with Customer and Addressd management
Summary
Whether you use a tool or manual SQL efforts, looking into orphaned transactions is a best practice for many different types of data management initiatives. From customer consolidation to data governance, it is important to know the state of transactions in your operational systems. It is my opinion that a data quality tool, like Informatica v9, offers a rich set of reports that helps translate the results to business users in a digestable manner.
I'll examine some remediation methods in the next post covering orphan analysis.
Great article. We face orphan issues in insurance, but the ones I see mostly are orphan claims - those claims that don't have a corresponding policy associated with them. I am hoping to explore the data issues associated with it in a future article myself.
ReplyDeletegood one RT @dqchronicle Data Migration Best Practice: Orphan Analysis http://t.co/ZrYVP7f #dataquality #dqchronicle
ReplyDelete@dqchronicle [blog post]: Data Migration Best Practice: Orphan Analysis - What's an Orphan? An orphan... http://t.co/7YRbq0K #dataquality
ReplyDelete