The Data Quality Chronicle: The many uses of data discovery

Bloor research defines data discovery as …

the discovery of relationships between data elements, regardless of where the data is stored.

If you expand your mind beyond the conventional relational database meaning of relationships, I agree with this definition. Relationships in this context, or rather the context I chose to apply, means much more than a primary – foreign key relationship.

In this context relationships is defined as commonality. This commonality can be of a data type, value pattern, or business use. If you can profile data and understand the relationships you can set yourself up for more efficient data management practices in the areas of ETL, MDM, and application lifecycle (or application retirement). Let’s take each of these and examine how a data discovery can increase the quality of the effort.

ETL and Data Discovery

Classical ETL takes data from a source and loads it to a target. If you perform data discovery profiling on the sources before you build the ETL mapping you can achieve the following:

a more accurate picture of the required data type of the attribute
- By examining the profile you can determine if the assigned data type is most appropriate for the data element

a more accurate specification for the type of transform required
- If the data and metadata are not 100% coordinated you can build transforms to accommodate for this

identification of data anomalies and outliers which require further investigation for possible remediation prior to the migration of data
- this leads to a more robust error handling and exception handling process

the identification of data, previously unknown, that meets the business requirements and needs to be migrated
- discovery can lead to uncovering data that was previously undefined or unobtainable for data migrations

As discover tools mature, it may also be possible to generate ETL mappings directly from the tool. If the target is more richly defined in the discovery tool and the sources are more accurately identified, it makes sense to me that a discovery tool can build a better ETL mapping.

This will require a tight coupling between the discovery and ETL tool, however, there are vendors in the market with this type of coupling available to them.

MDM and Data Discovery

In the same way that data discovery can aid ETL, so too can it aid the efforts of an MDM implementation. Since MDM implementations are so dependent on ETL, the same leverage is available and can lead to a better MDM hub definition and ETL specification.

Here too can a feature to generate a data mapping be particularly useful. With so much configuration required for match and merge rules, cutting some development form the scope of the effort would only add benefit.

Another particularly interesting feature would be the ability to generate candidate schemas for the MDM hub based on the data and metadata obtained in the profiles of the sources.

Application Lifecycle and Data Discovery

Finally, during a data discovery investigation it is possible to segment data by data ranges derived from last create / update dates. This can be leveraged to perform application and/or data lifecycle management which would basically archive data past a certain date line or retire an application which has not be accessed in a predetermined, business driven date.

Here is another use for dynamically generated data mappings which would migrate the retired data to a target or archive destination.

Discovery is only the first step

As you can see from this quick summary, there are many uses for data discovery and as the tools mature there are many more things that can be done to leverage a discovery effort.

Your thoughts?

The Data Quality Chronicle

Monday, April 2, 2012

The many uses of data discovery

ETL and Data Discovery

MDM and Data Discovery

Application Lifecycle and Data Discovery

Discovery is only the first step

No comments:

Post a Comment

What data quality is (and what it is not)