What Data Discovery is ...
On a recent engagement I was tasked with performing extensive data discovery on a large amount of data in various systems. While the normal practice of a data quality initiative is to work toward an established business goal, in this case we were not immediately sure what that goal would include. The impact of that condition is that we were effectively "fishing" to determine where the data stood. In essence, we were building a "current state" definition of the data which could then be used to determine what types of data quality goals needed to be established.
Data Discovery is useful in determining a current state of the data environment
As a result of the discovery process, we were able to build profiles of tables that included various aspects of the attributes like data types, field lengths and patterns, and uniqueness. With standard patterns, lengths and types established, outlier reports were created identifying which attributes required data cleansing and more stringent data governance. From this analysis, the framework of a data quality program began taking shape.
Data Discovery is useful in developing a data quality framework
What Data Discovery is not ...
Even though the data discovery played an essential role in the development of the current state assesment and framework of the data quality program, it's important to realize that it did not provide everything required to develop these. While data discovery can describe numerous aspects of the current state of the data, it cannot determine what the optimal state should be. Discovery does not have a vision of what should be, it can only describe what is. Data discovery should not be viewed as a way to replace engaging the business about how they want the data to look. It is a data tool that can help data quality practioners get up-to-speed qucikly on the current state of the data landscape. At best, it helps the data quality practioner make suggestions about what types of data cleansing might be required.
Data Discovery is not able to determine the optimal state of the data environment
Data Discovery is not a replacement for business knowledge and vision
Conclusion
While this post is not a particularly detailed post, I feel like it is an important topic to cover and discuss. If you listen to the sales hype, it is easy to get the impression that data discovery is your answer to having meetings with the business in order to build data quality goals. This is such a dangerous prospect that I have begun to state this at the beginning of all my data discovery conversations. Don't get me wrong, I value data discovery tools. I recognize their importance. However, I'm a data quality guy and not a bsuiness owner.
Does a business owner care about data patterns and lengths? I doubt it. It's critical to be able to present this type of information in a business context that means something to a business owner or business user. Ultimately this conversation starts with some type of business goal. For instance, when performing data discovery on email addresses it would be more effective to explain how direct marketing will be affected due to the fact that 10% of the data cannot be used to send electronic marketing materials rather than 10,000 values do not contain an "@" symbol.
In summary, data discovery is useful in gaining insight as to the current state of the data landscape, however, data discovery is not a "silver bullet solution" to data quality, master data mangement or data governance initiatives.
While agreeing with the points you make, I feel that the range of information you describe is not "data discovery" but instead is "data profiling".
ReplyDeleteI wouldn't say this is "fishing", as all of it can be supplied "out of the box" by a good profiling tool.
Data Discovery is more about ad-hoc investigation to understand the relationships that exist in the data, based on equal values,
similarly formatted values, embedded values, and so on. All this is neccesary to develop a DQ framework.
By associating the rule which uncovers the error with a "measure" it is also possible to provide "objective" business context.
In your example the invalid email could be associated with the "average sales per customer" to calculate the potential lost opportunity.
Derek,
ReplyDeleteThanks for stopping by and commenting on the post! I agree 100% with your assertion that profiling is required for a solid DQ framework.
I think there is a gray area in the industry as to what exactly is the difference between "discovery" & "profiling". To me, they are one in the same. If you are profiling data for specifics, you will discover, hopefully, some new insight.
The tone of the post, and specifically the "fishing" comment, is heavily influenced by the common use of discovery and profiling as a substitution for business insight into the data.
That's my focal point that I want to communicate. No amount of discovery and profiling will be adequate alone to form a DQ framework. You still need business insight into what you've discovered.
Thanks again,
William
Great subject - and I like the length of the post to be just enough to trigger a good discussion. However, I hate the use of "task" as a verb.
ReplyDeleteHi William,
ReplyDeleteInteresting post.
Interesting use of the term "Data Discovery". I thought you meant technology that would "discover" data stores, processes that update / reference them etc. A precursor to the start of the data profiling process.
Rgds Ken
Hi Ken,
ReplyDeleteThanks for the comment! In a way I do mean discovering data stores. You "point" the discovery software at a server and it will profile metadata and data, as well as build data attribute taxonomies.
Check out www.globalids.com for more information!
Data Discovery: What it is and what it isn't. #in http://t.co/KQE8SaDX #data_discovery #data_quality
ReplyDelete+1 "@DataMgmtWonk: Older article, but still very relevant - "Data Discovery: What it is and what it isn’t" dqchronicle http://t.co/3gGdFGFV
ReplyDelete+1 "@DataMgmtWonk: Older article, but still very relevant - "Data Discovery: What it is and what it isn’t" @dqchronicle http://t.co/4fEbvhYi
ReplyDelete