Why data discovery leads to better ETL design
Let’s start with why I feel this way. Before I’d even heard of data quality I was doing it on a daily basis. You see I spent several years as an ETL developer on many data warehousing implementation projects.
Typically after a couple of briefing meetings, I’d start developing ETL mappings. Like any development effort that was followed by some unit testing where I would discover that although my ETL was written to specifications, the load didn’t “look right”.
After some digging I usually found the culprit was the fact that the source data did not match the expectations I had going into the development effort and it was time to, at the very least, add some transforms to the mapping to accommodate for the discrepancies. In effect, I was performing two critical functions left out of the original development plan, data profiling and enhancement. I feel strongly that had these two processes not been left out, I would have had a more complete and accurate ETL development experience from the get-go.
Unfortunately this was not an isolated event and, in fact, happened on almost every ETL project. First hand experience is why I feel so strongly that data discovery leads to a better development process and, ultimately, outcome.
Supporting Research for Data Discovery in ETL Design
In a fairly recent polling exercise the Passioned Group, an analyst and consultancy company, based in The Netherlands, specializing in Business Intelligence, Data Integration and ETL tools, conducted a polling of 2,000 participants where they ask what they thought were the most important requirements when choosing an ETL tool. The results demonstrated just how important data discovery is to ETL developers.
As you can see aside from performance, data profiling was the most important feature. My intuition tells me that the people who responded to the poll had similar experiences to mine when developing ETL solutions.
Way back in 2001 William Laurent of Information Management wrote a piece entitled, Best Practices for Data Warehouse Database Developers. The number one best practice was make sure you are provided with a usable data dictionary before starting heavy-duty development. Data discovery can help build that data dictionary without relying on assumptions and assertions made by business analysts and database administrators.
In defining what ETL is the Passioned Group mentions data profiling by explaining how it can help build a system that
that is robust and has a clear structure.
The Data Warehouse Information organization , a site “Powered by "DWH Professionals", "DWH Enthusiasts" and People alike” graphically depicts data profiling in their recommended ETL design process.
Here is an important statement they make about the benefits of data profiling during ETL design.
Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart
So how do I use data discovery to achieve a better ETL design?
As I mentioned in my previous post, I recommend starting with the following question:
What are the critical data domains we are looking to integrate into the target?
The reason I start with this seemingly basic question is so that you can build true discovery processes into the ETL design. True discovery finds data unbeknownst to the data consumer that also needs to be included in the target. To me, this is one of the most value added services that the ETL team can provide to the data consumers. Here is an example, taken from my previous experience, that demonstrates what I mean.
I had a marketing client that was looking to build a repository from which they could perform campaign management and analytics. They had done a fair mount of quality due diligence and identified what they felt were the required sources.
When I asked my generic question there was a fair amount of dissent in the room and some even pointed to the source to target matrix (STTM) as my source of information. However, I pressed on and discovered that some of the more executive users of the analytics were interested in performing analysis on customers were were marketed to but the address of record, for which the source systems was included in the STTM, was not deliverable (or was returned by the USPS).
As it turns out, this information was not stored in a source system but rather kept in a spreadsheet (of course) by one of the marketing administrators. Of course knowing this allowed me to incorporate the spreadsheet in the ETL sources but it also help us build in another process which discovered and profiled address data in critical business applications which were then included in an enrichment process so that undeliverable addresses could be updated with the proper addresses (where applicable).
Data discovery is a simple process once you know where to point the discovery tool. This focus is obtained by asking the general but effective question I mentioned above. Data domains, like address, help you ask more intelligent and specific questions like …
what critical applications store, collect or consume address data?
Once this is uncovered, data discovery works much the same way that data profiling works. You define the source, build a connection, define and execute the profile jobs and decipher the results.
Data Discovery for ETL Tips
Here are a few tricks I use when performing data discovery for an ETL design proof of concept.
- Profile early and often
- Translate data profiles into a metadata dictionary
- Identify data anomalies
- Never develop an ETL map from a specification, do it based on profile results
- Communicate where metadata and data distributions do not match the businesses expectations and look for the root cause
I know this list seems basic, but you’d be surprised how often it does not happen and how much rework and cost is incurred as a result.
Your thoughts?
No comments:
Post a Comment