The Data Quality Chronicle: The role of data quality in ETL design: DQETL

Introduction

Data integration is nothing new. Since the concept of data warehousing, data integration has been a major initiative for most large organizations. On the most common obstacles of integrating data into a warehouse has been the fact that assumptions about the state of the source data have been either false or flawed at best. One of the reasons for this is that very little investigation, or data profiling, is performed on the source data prior to design and execution of the data extraction routines.

With all there is to plan for and deliver on data warehousing projects, this oversight is understandable. However, I believe, it is time for data quality to assume the role of reporting on and remdiating the current state of all source data to be migrated into a warehouse.

Turning assumptions into facts

If source data was profiled what was once assumptions about data can be transformed into facts about the state of the data . Data profiling consists of scanning data and typcially delivers measures such as the frequency of nulls, the extent to which data is unique, and ranges of actual values within each fields included. With data quality tools such as Informatica's Data Quality v9, Global ID's data profiler or Talend's data profiler, these basic reports can be compiled with a few clicks on the mouse. Furthermore, this products offer portals where dashboards detailing the current state of the data can be delivered to both a technical and business oriented audience.

Data Profiling 101

As I mentioned, creating a data profile can be done with a few mouse clicks. Typically the steps are as follows:

Define a connection to the data source

Define the data source

Define which fields are to be included in the profile

Define any business related rules to be included in the profile

Schedule the profile for execution

Connections

Defining the connection to the data source usually involves a few simple steps. Connections are typcially either to a database or to a file. While connecting to a file includes parameters such as delimiters, field names, and data types and lengths, connecting to a database usually involves location parameters like host and authetication credentials like username and password.

Whether you connect directly to the database or use a flat file extract typically depends on circumstances like resources for a test environment and ability to procure the required credentials. Either way, the lion's share of the work is setting up the connection.

Candidates

The next logical step in creating a data profile is to define what fields to include. Within the context of validating ETL for data integration, this step would heavily depend on those fields nicluded inthe migration, and even more by those fields requiring transformation.

Bercause not all data is migrated, data profiling is best of limited to the tables that are required. Primary and foreign keys are almost always included to ensure uniqueness as well as timestamp fields to ensure completeness and date format conformity.

Rules

When we talk about data profiling most of the metrics are technical (percedntage of uniqueness, percentage of nulls, etc ...), however, once fo the most beneficial practices is to include rules in the data profile that are based on business rules. Some examples of these rules are to validate that certain chronological events are in order (i.e. ship date does not proceed order date) and logical assumptions (i.e. that individuals who are indicated as male do not have postive pregnancy outcomes) are indeed valid.

Constructing business rules often involve participation of a business domain subject matter expert, however they can also be formed from the conceptual deisng of the ETL. By reverse engineering the transformation logic, it is possible to derive, at least, one rule that needs to be tested. Afterall, transformation logic would be negatively affected by things such as nulls, data type nonconformity and values outside the expceted range.

The upside of yet another step

Because of its complex nature and multi-step requirements, adding another step to data integration and migration is rarely a welcomed effort. However, once viewed within the context of reducing ETL redesign and increasing transfer success rates, it is frequently, albeit begrudgingly, accepted.

Including data profiling in the data maigration suite of perations, indeed, can deliver these desired outcomes. When data is profiled prior to ETL design and execution, data states that would otherwise cause ETL loads to fail can be identified and remedied. An example can be found in the all important date related fields. A simple data profile can detect date formats that are not supported by the target repository. Dates in data warehousing are crucial to track transaction lineage and if not configured correctly can be the undoing of an ETL design.

Summary

While I have just touched the surface of the role of data profiling in data integration this is, none the less, an important concept to adopt. For increases in successful loads and decreases in ETL test and troubleshooting will save time and resources and paint a more positive image of the data integration team and their capabilities.

The Data Quality Chronicle

Monday, June 13, 2011

The role of data quality in ETL design: DQETL