Showing posts with label Data Discovery. Show all posts

Friday, April 13, 2012

Model Citizen: Should Data Discovery Tools include modeling functionality?

Data models: Data Management 101

In my years as a consultant implementing data management solutions, my first question to a client would be …

Can I see the data model?

I have long felt that gaining a better understanding of an organization’s data landscape involves to primary artifacts; a data model and a data profile. This is because, in a lot of cases, these two artifacts represent the two most fundamental states of an organizational data landscape:

what the data should look like and what it actually looks like

With knowledge of these two states, I felt armed with the ability to quickly and easily identify areas of conformity and areas of anomaly. These two perspectives tended to be the basis of most Information Technology questions I was there to solve. In this way, I looked at a data model as the starter kit to a data management strategy, or a 101 crash course to an organization’s data management state.

If there were a lot of anomalies, I knew they would require a lot of data quality strategy and remediation, as well as a robust data governance initiative. If there was a lot of conformity, I knew the organization was mature enough to handle new data management initiatives like Master Data Management or Big Data implementation.

The sad reality is that most organizations either did not have data models for critical applications or felt that the data model was so out-of-date that it was not going to be very helpful to me in my quest for understanding.

Lack of Data Models: Data Management 100

Without a viable data model I was unable to reach these valuable conclusions quickly and was forced to, in a sense, reverse engineer profiling results which was time-consuming and based on some brash assumptions.

Here are some activities which helped me to mitigate the risks of performing educated guessing:

Perform Orphan Analysis
1. Analyzing orphans can help you determine the validity of a data model, how users are adding or deleting data and whether referential integrity constraints are even in place in production (which happens more than most will admit during interviews)

Analyze Documented versus Actual Data Types
1. Again, this addresses the validity of the design and how users are entering data (very often data is entered in formatted form, entry fields are used for purposes other than the original intention and developers build architectures without really understanding the scenarios that the app is required to support)

Analyze most and least commonly occurring values
1. This can help create a profile of how often standards are conformed to, word-of-mouth work-arounds in place and areas that are conforming and do not need attention (as valuable as identifying areas that do need attention)

Data Modeling Profiles: Data Management 102?

Having been through this many times and knowing how much time and effort this requires (often not accounted for in project plans), I feel strongly about developing a tool that can turn data profiles into a data model. Most of the functionality is there already, someone just needs to make a case for it (just call me somebody).

Such a solution could take profile results, which include actual data types and inferred relationships, and create a data model that supports data management best practices like data governance and data quality.

In addition, a profiling-to-model function could go a long way in reducing the amount of time and error involved in building an MDM hub. After all, profiling all the contributing sources is one the best practices in defining an MDM hub, why not take it the next step and bake that in?

I completely agree that there are going to be cases where there were design considerations based on performance and that a profile is not always the most accurate source for a model’s design, but there are many cases where a profile-to-model function would increase accuracy and performance and decrease error and time required to model data landscapes.

What do you think?

[polldaddy poll=6135039]

Monday, April 9, 2012

Data Discovery: a path to better ETL development

In my last post I made the statement that one of the uses for data discovery was to produce better ETL design. I wanted to backup that statement with a follow up post on why I feel this way, some supporting research and how to go about achieving this enhanced design.

Why data discovery leads to better ETL design

Let’s start with why I feel this way. Before I’d even heard of data quality I was doing it on a daily basis. You see I spent several years as an ETL developer on many data warehousing implementation projects.

Typically after a couple of briefing meetings, I’d start developing ETL mappings. Like any development effort that was followed by some unit testing where I would discover that although my ETL was written to specifications, the load didn’t “look right”.

After some digging I usually found the culprit was the fact that the source data did not match the expectations I had going into the development effort and it was time to, at the very least, add some transforms to the mapping to accommodate for the discrepancies. In effect, I was performing two critical functions left out of the original development plan, data profiling and enhancement. I feel strongly that had these two processes not been left out, I would have had a more complete and accurate ETL development experience from the get-go.

Unfortunately this was not an isolated event and, in fact, happened on almost every ETL project. First hand experience is why I feel so strongly that data discovery leads to a better development process and, ultimately, outcome.

Supporting Research for Data Discovery in ETL Design

In a fairly recent polling exercise the Passioned Group, an analyst and consultancy company, based in The Netherlands, specializing in Business Intelligence, Data Integration and ETL tools, conducted a polling of 2,000 participants where they ask what they thought were the most important requirements when choosing an ETL tool. The results demonstrated just how important data discovery is to ETL developers.

As you can see aside from performance, data profiling was the most important feature. My intuition tells me that the people who responded to the poll had similar experiences to mine when developing ETL solutions.

Way back in 2001 William Laurent of Information Management wrote a piece entitled, Best Practices for Data Warehouse Database Developers. The number one best practice was make sure you are provided with a usable data dictionary before starting heavy-duty development. Data discovery can help build that data dictionary without relying on assumptions and assertions made by business analysts and database administrators.

In defining what ETL is the Passioned Group mentions data profiling by explaining how it can help build a system that

that is robust and has a clear structure.

The Data Warehouse Information organization , a site “Powered by "DWH Professionals", "DWH Enthusiasts" and People alike” graphically depicts data profiling in their recommended ETL design process.

Here is an important statement they make about the benefits of data profiling during ETL design.

Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart

So how do I use data discovery to achieve a better ETL design?

As I mentioned in my previous post, I recommend starting with the following question:

What are the critical data domains we are looking to integrate into the target?

The reason I start with this seemingly basic question is so that you can build true discovery processes into the ETL design. True discovery finds data unbeknownst to the data consumer that also needs to be included in the target. To me, this is one of the most value added services that the ETL team can provide to the data consumers. Here is an example, taken from my previous experience, that demonstrates what I mean.

I had a marketing client that was looking to build a repository from which they could perform campaign management and analytics. They had done a fair mount of quality due diligence and identified what they felt were the required sources.

When I asked my generic question there was a fair amount of dissent in the room and some even pointed to the source to target matrix (STTM) as my source of information. However, I pressed on and discovered that some of the more executive users of the analytics were interested in performing analysis on customers were were marketed to but the address of record, for which the source systems was included in the STTM, was not deliverable (or was returned by the USPS).

As it turns out, this information was not stored in a source system but rather kept in a spreadsheet (of course) by one of the marketing administrators. Of course knowing this allowed me to incorporate the spreadsheet in the ETL sources but it also help us build in another process which discovered and profiled address data in critical business applications which were then included in an enrichment process so that undeliverable addresses could be updated with the proper addresses (where applicable).

Data discovery is a simple process once you know where to point the discovery tool. This focus is obtained by asking the general but effective question I mentioned above. Data domains, like address, help you ask more intelligent and specific questions like …

what critical applications store, collect or consume address data?

Once this is uncovered, data discovery works much the same way that data profiling works. You define the source, build a connection, define and execute the profile jobs and decipher the results.

Data Discovery for ETL Tips

Here are a few tricks I use when performing data discovery for an ETL design proof of concept.

Profile early and often

Translate data profiles into a metadata dictionary

Identify data anomalies

Never develop an ETL map from a specification, do it based on profile results

Communicate where metadata and data distributions do not match the businesses expectations and look for the root cause

I know this list seems basic, but you’d be surprised how often it does not happen and how much rework and cost is incurred as a result.

Your thoughts?

Monday, April 2, 2012

The many uses of data discovery

Bloor research defines data discovery as …

the discovery of relationships between data elements, regardless of where the data is stored.

If you expand your mind beyond the conventional relational database meaning of relationships, I agree with this definition. Relationships in this context, or rather the context I chose to apply, means much more than a primary – foreign key relationship.

In this context relationships is defined as commonality. This commonality can be of a data type, value pattern, or business use. If you can profile data and understand the relationships you can set yourself up for more efficient data management practices in the areas of ETL, MDM, and application lifecycle (or application retirement). Let’s take each of these and examine how a data discovery can increase the quality of the effort.

ETL and Data Discovery

Classical ETL takes data from a source and loads it to a target. If you perform data discovery profiling on the sources before you build the ETL mapping you can achieve the following:

a more accurate picture of the required data type of the attribute
- By examining the profile you can determine if the assigned data type is most appropriate for the data element

a more accurate specification for the type of transform required
- If the data and metadata are not 100% coordinated you can build transforms to accommodate for this

identification of data anomalies and outliers which require further investigation for possible remediation prior to the migration of data
- this leads to a more robust error handling and exception handling process

the identification of data, previously unknown, that meets the business requirements and needs to be migrated
- discovery can lead to uncovering data that was previously undefined or unobtainable for data migrations

As discover tools mature, it may also be possible to generate ETL mappings directly from the tool. If the target is more richly defined in the discovery tool and the sources are more accurately identified, it makes sense to me that a discovery tool can build a better ETL mapping.

This will require a tight coupling between the discovery and ETL tool, however, there are vendors in the market with this type of coupling available to them.

MDM and Data Discovery

In the same way that data discovery can aid ETL, so too can it aid the efforts of an MDM implementation. Since MDM implementations are so dependent on ETL, the same leverage is available and can lead to a better MDM hub definition and ETL specification.

Here too can a feature to generate a data mapping be particularly useful. With so much configuration required for match and merge rules, cutting some development form the scope of the effort would only add benefit.

Another particularly interesting feature would be the ability to generate candidate schemas for the MDM hub based on the data and metadata obtained in the profiles of the sources.

Application Lifecycle and Data Discovery

Finally, during a data discovery investigation it is possible to segment data by data ranges derived from last create / update dates. This can be leveraged to perform application and/or data lifecycle management which would basically archive data past a certain date line or retire an application which has not be accessed in a predetermined, business driven date.

Here is another use for dynamically generated data mappings which would migrate the retired data to a target or archive destination.

Discovery is only the first step

As you can see from this quick summary, there are many uses for data discovery and as the tools mature there are many more things that can be done to leverage a discovery effort.

Your thoughts?

Thursday, March 22, 2012

The Data Explosion: Opportunities and Challenges Abound

It is an interesting time to be in data management. There are more sources of data in so many varied formats than ever before. There are new tools continuously evolving at light speed. There is the promise of opportunity and with it enormous challenges.

With regard to the opportunities, one of the most interesting things I see developing is increased access to customers. From traditional to mobile platforms, there are more avenues to interact with customers, presenting an opportunity for product and service providers new ways to measure their effectiveness. I have starting researching things like sentiment analysis which is an example of how access to customers and data explosion provides insight into product / service perception.

With regard to the challenges, performing analysis on this data requires tools, methodologies, and resources that are very unique and unconventional. For most organizations, it will take some time to align the resources to perform meaningful analysis. That is not even taking into the account the budget that needs to be set aside for this activity.

While the technology industry is thrilled with their new story filled with magical elephants and all the promise of a new reality, the boots-on-the-ground in data management must feel like a deer caught in the headlights of an on coming 18 wheeler at 90 mph. To some the data explosion must feel like fireworks against the warm summer sky, to others the explosion must feel like the pounding of cannon fire against the office wall.

What I think it is very important to realize is that this data explosion is really both at the same time. We need to be realistic and remember that while there is a lot of data out there and with it comes the promise of gaining new insights, this presents significant challenges to organizations in just how they are going to roll this into the mix of things that already need to do.

I intend on keeping my eye on what organizations come out winners and, maybe even more so, what organizations come out as losers in this new data frontier. One of the things I intend on paying particular attention to is ROI. What it costs to do this well and what it produces.

Until I see what that looks like, I am going to hold off getting giddy about big data / no sql … what about you? Are you “all in” or waiting to see how this goes?

Technorati Tags: big data,data management,data mining,ROI,customer analysis,CRM,sentiment analysis

Wednesday, April 20, 2011

Data Discovery. The first step toward data management.

Introduction

Recently on a data discovery project I observed something that I wanted to share. Data discovery efforts, and the tools that support them, are well suited for those organizations who've had data explosive growth. With this kind of growth the data landscape expands to the point where in-depth knowledge of data, and more importantly metadata, details becomes unobtainable. This is where a product suite like Global IDs data transparency suite can enable effective data management strategies.

Data Transparency

What's in a data transparency suite?

The GIDS Data Transparency Product Suite is a suite a of 15 applications that provides companies with a broad set of capabilities to scan and inventory their data landscape. Using these applications, organizations can perform the following tasks.

Scan their data environment (structured data, un-structured data, semi-structured data)

Create and populate a Metadata Repository that can be searched by business and technical users

Profile their structured databases to create a semantic understanding of their data

[ref] Global IDs http://www.globalids.com/products/product-suites/ddp [/ref]

I can speak from experience when I say that these three functions present a complete picture of a data landscape. With metadata, profiling results and sematic taxonomies, a master data management / data quality / data governace solution is within reach.

Now I'm no stranger to metadata or data profiles. I understand the value in them. In fact, I am delivering value to organizations most weeks of the year with these reports. However, to me, the thing that sets Global IDs apart is the autogenerating semantic taxonomies.

Looking for privacy data across 300 databases?

This product suite can aggregate privacy data elements, such as social security numbers, from throughout your enterprise. I've used this feature and was able to create a customer data hub from 30 databases in support of a major MDM initiative.

Timing is everything

What I took away from my experiences with Global IDs is that is best used in a complex data environment, as I mentioned in the intro. It can turn confusion and disarray into organized information with relatively little development. On the flip side, it may be overkill for a less complex, more defined environment.

In other words, if you are Haliburton, GIDS is for you. If you are Smith's Accounting Service, GIDS is probably not for you.

Some Examples of Global IDs reports

http://www.globalids.com/ddp/screenshots

Sunday, April 17, 2011

Know Your Data - Data Profiling

Having data issues is a lot like having any other issue; you just want to get it fixed and move on with life. I talk with a lot of business users who feel this way.

They have stories for me when I arrive. From 62 different states codes to 5 different versions of one of their most high profile customers, they tell me these stories and I patiently listen. Well, for the most part. When they finish I know a few things I didn’t before I walked into their office. I have a good idea of the relative impact of the data issues, I have an approximated idea of the data domain(s) I need to concentrate on and I know a lot about my client and their passion for the data issues. One thing I don’t know is very much about the data itself.

In one of my previous posts, Know Your Data, I talked about the importance of knowing data from the perspective of their relationships. Another import aspect of knowing data is its structure and content; sort of from the inside out. The way you get this perspective is to profile the data.

There are several types of profiling, but I am specifically referring to metadata and data profiling. Metadata profiling provides details about the data’s structure and data profiling provides details about the actual data values. While at first this seems obvious, it is a hard practice to implement. It’s hard because of two main reasons:

After listening to business users describe their issues and the urgency around getting it fixed, the tendency is get to work on the resolution

Lack of a tools that provide adequate profiling functionality

However, when data quality practitioners are equipped with tools that can perform these profiling functions, they can reach more accurate and useful conclusions. In a recent blog post Dalton Cervo highlighted the importance of performing data analysis. In it he mentions how a data quality team engaged in multiple discussions regarding how duplicate records will be consolidated. However, without profiling the data, as he points out, there was no way of determining how duplicates would be identified. In other words, without profiling the data there is no way to know what data you can use to identify duplicates so there is no point in engineering a solution until you know your data.

Typically I use Informatica Data Quality Developer and Analyst to profile data if I know exactly what data I am to concentrate on. If I do not have that advantage, I use Global IDs data discovery functionality. The combination of these two powerful tools allows me to gain a lot of knowledge in a short period of time.

Once armed with the details of my insight, I begin to have clarity around the following:

A high level idea around the conformity of the data. That is to say, is the data defined correctly?

A high level idea around the completeness of the data. That is to say, are the required data elements present and populated?

A high level idea around the integrity of the data. That is to say, are there orphaned records?

A high level idea around the consistency of the data. That is to say, are the same attributes in separate systems represented uniformly?

What types of remediation need to be performed to bring the data into a higher state of conformity

What types of remediation need to be performed to bring the data into a higher state of completeness

What types of remediation need to be performed to bring the data into a higher state of integrity

What data I have at my disposable to determine data accuracy and duplication

A high level idea of the data taxonomies present

A high level idea of the present data attributes to be incorporated into a canonical model

That’s a lot of information from a few fairly straight forward exercises. Again, these exercises are made simple through the power of the aforementioned tools. However, rather than jumping into remediation, I suggest you get to know your data by performing data profiling.

Saturday, November 6, 2010

Data Discovery: What it is and what it isn't

What Data Discovery is ...

On a recent engagement I was tasked with performing extensive data discovery on a large amount of data in various systems. While the normal practice of a data quality initiative is to work toward an established business goal, in this case we were not immediately sure what that goal would include. The impact of that condition is that we were effectively "fishing" to determine where the data stood. In essence, we were building a "current state" definition of the data which could then be used to determine what types of data quality goals needed to be established.

Data Discovery is useful in determining a current state of the data environment

As a result of the discovery process, we were able to build profiles of tables that included various aspects of the attributes like data types, field lengths and patterns, and uniqueness. With standard patterns, lengths and types established, outlier reports were created identifying which attributes required data cleansing and more stringent data governance. From this analysis, the framework of a data quality program began taking shape.

Data Discovery is useful in developing a data quality framework

What Data Discovery is not ...

Even though the data discovery played an essential role in the development of the current state assesment and framework of the data quality program, it's important to realize that it did not provide everything required to develop these. While data discovery can describe numerous aspects of the current state of the data, it cannot determine what the optimal state should be. Discovery does not have a vision of what should be, it can only describe what is. Data discovery should not be viewed as a way to replace engaging the business about how they want the data to look. It is a data tool that can help data quality practioners get up-to-speed qucikly on the current state of the data landscape. At best, it helps the data quality practioner make suggestions about what types of data cleansing might be required.

Data Discovery is not able to determine the optimal state of the data environment

Data Discovery is not a replacement for business knowledge and vision

Conclusion

While this post is not a particularly detailed post, I feel like it is an important topic to cover and discuss. If you listen to the sales hype, it is easy to get the impression that data discovery is your answer to having meetings with the business in order to build data quality goals. This is such a dangerous prospect that I have begun to state this at the beginning of all my data discovery conversations. Don't get me wrong, I value data discovery tools. I recognize their importance. However, I'm a data quality guy and not a bsuiness owner.

Does a business owner care about data patterns and lengths? I doubt it. It's critical to be able to present this type of information in a business context that means something to a business owner or business user. Ultimately this conversation starts with some type of business goal. For instance, when performing data discovery on email addresses it would be more effective to explain how direct marketing will be affected due to the fact that 10% of the data cannot be used to send electronic marketing materials rather than 10,000 values do not contain an "@" symbol.

In summary, data discovery is useful in gaining insight as to the current state of the data landscape, however, data discovery is not a "silver bullet solution" to data quality, master data mangement or data governance initiatives.

The Data Quality Chronicle