In my last post I made the statement that one of the uses for data discovery was to produce better ETL design. I wanted to backup that statement with a follow up post on why I feel this way, some supporting research and how to go about achieving this enhanced design.

Why data discovery leads to better ETL design

Let’s start with why I feel this way. Before I’d even heard of data quality I was doing it on a daily basis. You see I spent several years as an ETL developer on many data warehousing implementation projects.

Typically after a couple of briefing meetings, I’d start developing ETL mappings. Like any development effort that was followed by some unit testing where I would discover that although my ETL was written to specifications, the load didn’t “look right”.

After some digging I usually found the culprit was the fact that the source data did not match the expectations I had going into the development effort and it was time to, at the very least, add some transforms to the mapping to accommodate for the discrepancies. In effect, I was performing two critical functions left out of the original development plan, data profiling and enhancement. I feel strongly that had these two processes not been left out, I would have had a more complete and accurate ETL development experience from the get-go.

Unfortunately this was not an isolated event and, in fact, happened on almost every ETL project. First hand experience is why I feel so strongly that data discovery leads to a better development process and, ultimately, outcome.

Supporting Research for Data Discovery in ETL Design

In a fairly recent polling exercise the Passioned Group, an analyst and consultancy company, based in The Netherlands, specializing in Business Intelligence, Data Integration and ETL tools, conducted a polling of 2,000 participants where they ask what they thought were the most important requirements when choosing an ETL tool. The results demonstrated just how important data discovery is to ETL developers.

As you can see aside from performance, data profiling was the most important feature. My intuition tells me that the people who responded to the poll had similar experiences to mine when developing ETL solutions.

Way back in 2001 William Laurent of Information Management wrote a piece entitled, Best Practices for Data Warehouse Database Developers. The number one best practice was make sure you are provided with a usable data dictionary before starting heavy-duty development. Data discovery can help build that data dictionary without relying on assumptions and assertions made by business analysts and database administrators.

In defining what ETL is the Passioned Group mentions data profiling by explaining how it can help build a system that

that is robust and has a clear structure.

The Data Warehouse Information organization , a site “Powered by "DWH Professionals", "DWH Enthusiasts" and People alike” graphically depicts data profiling in their recommended ETL design process.

Here is an important statement they make about the benefits of data profiling during ETL design.

Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart

So how do I use data discovery to achieve a better ETL design?

As I mentioned in my previous post, I recommend starting with the following question:

What are the critical data domains we are looking to integrate into the target?

The reason I start with this seemingly basic question is so that you can build true discovery processes into the ETL design. True discovery finds data unbeknownst to the data consumer that also needs to be included in the target. To me, this is one of the most value added services that the ETL team can provide to the data consumers. Here is an example, taken from my previous experience, that demonstrates what I mean.

I had a marketing client that was looking to build a repository from which they could perform campaign management and analytics. They had done a fair mount of quality due diligence and identified what they felt were the required sources.

When I asked my generic question there was a fair amount of dissent in the room and some even pointed to the source to target matrix (STTM) as my source of information. However, I pressed on and discovered that some of the more executive users of the analytics were interested in performing analysis on customers were were marketed to but the address of record, for which the source systems was included in the STTM, was not deliverable (or was returned by the USPS).

As it turns out, this information was not stored in a source system but rather kept in a spreadsheet (of course) by one of the marketing administrators. Of course knowing this allowed me to incorporate the spreadsheet in the ETL sources but it also help us build in another process which discovered and profiled address data in critical business applications which were then included in an enrichment process so that undeliverable addresses could be updated with the proper addresses (where applicable).

Data discovery is a simple process once you know where to point the discovery tool. This focus is obtained by asking the general but effective question I mentioned above. Data domains, like address, help you ask more intelligent and specific questions like …

what critical applications store, collect or consume address data?

Once this is uncovered, data discovery works much the same way that data profiling works. You define the source, build a connection, define and execute the profile jobs and decipher the results.

Data Discovery for ETL Tips

Here are a few tricks I use when performing data discovery for an ETL design proof of concept.

Profile early and often

Translate data profiles into a metadata dictionary

Identify data anomalies

Never develop an ETL map from a specification, do it based on profile results

Communicate where metadata and data distributions do not match the businesses expectations and look for the root cause

I know this list seems basic, but you’d be surprised how often it does not happen and how much rework and cost is incurred as a result.

Your thoughts?

The Data Quality Chronicle

Monday, April 9, 2012

Data Discovery: a path to better ETL development

Why data discovery leads to better ETL design

So how do I use data discovery to achieve a better ETL design?

Data Discovery for ETL Tips

No comments:

Post a Comment

What data quality is (and what it is not)