The Data Quality Chronicle: customer data integration

Showing posts with label customer data integration. Show all posts

Thursday, September 26, 2013

Would you read my Big Data book?

As you my know, I am pretty critical of the "big data" hype in the information technology media. One of my criticisms is the very term "big data". Frankly, I hate the term as much as I hate the outrageous claims that are out there.

I prefer to talk about distributed systems which is really the differentiator in this space, in my humble [ref] degrees of humility can easily be argued[/ref] opinion.

As I researched distributed systems (DS), I came across a topic that struck a chord with my passion for data quality. The topic was the CAP Theorem and notion of eventual consistency. At this point, I am becoming obsessed with the notion of eventual consistency.

As a result, I am convinced I need to write a book on the topic and its implications on deploying distributed systems to a business and its decisions.

However, before I get too far down the line, I want to know is ...

would you read my book about eventual consistency and how that would impact a distributed system's ability to aid in the process of making business decisions?

Please vote and let me know!

[polldaddy poll=7427832]

Thursday, March 24, 2011

Data Quality Polls: Troubled domains and what to fix

[caption id="attachment_1047" align="aligncenter" width="630" caption="With which data domain do you have the most quality issues?"]

[/caption]

As expected, customer data quality remains at the top of list with regard to having the most issues. Ironically, this domain has been at the forefront of the data quality industry since its inception.
One reason for the proliferation of concerns about customer data quality could be its direct link to revenue generation.
Whatever the reason, this poll seems to indicate that services built around the improvement of customer data quality will be well founded.

[caption id="attachment_1049" align="aligncenter" width="630" caption="What would you improve about your data?"]

[/caption]

Once again there are no surprises when looking at what data improvements are desired. Data owners seem to be interested in a centralized, synchronized, single view of their data, most notably customer.

The good news that can be gathered from these polls is that as an industry, data quality is focused on the right data and the right functionality. Most data quality solutions are built around the various aspects of customer data quality and ways to improve it so there is a master managed, single version of a customer. The bad news is we've had that focus for quite some time and data owners are still concerned.

In my opinion, this is due to the nature of customer data. Customer data is at the core of every business. It is constantly changing both in definition and scope, it is continuously used in new and complex ways, and it is the most valuable asset that an organization manages.

One thing not openly reflected in these polls is that it is likely that the same issues and concerns that are present in the customer domain are also present in the employee and contact domains. However, they tend not to "bubble up" to the top of list due to lack of linkage to revenue and profit.

I'd encourage comments and feedback on this post. If we all weigh in on topics like this, we can all learn something valuable. Please let me know your thoughts on the poll results, my interpretation of the results and opinions.

Wednesday, February 2, 2011

Data Quality Basic Training

Recently a reader asked me if I had any posts on "data quality basics". Turns out, I didn't. So I've decided to put together a series of posts that covers what I feel are the basic essentials to a data quality program.

The many sides of data quality

It is generally accepted in the data quality world that there are seven categories by which data quality can be analyzed. These include the following:

Conformity

Consistency

Completeness

Accuracy

Integrity

Duplication

Timeliness

Conformity - Analyzing data for conformity measures adherence to the data definition standards. This can include determining if data is of the correct type and length

Consistency - Analyzing data for consistency measures that data is uniformly represented across systems. This can involve comparing the same attribute across the various systems in the enterprise

Completeness - Analyzing data for completeness measures whether or not required data is populated. This can involve one or more elements and is usually tightly coupled with required field validation rules

Accuracy - Analyzing data for accuracy measures if data is nonstandard or out-of-date. This can involve comparing data against standards like USPS deliverability and ASCII code references

Integrity - Analyzing data for integrity measures data references that link information like customers and their addresses. Using our example, this analysis would determine what addresses are not associated with customers

Duplication - Analyzing data for duplication measures the pervasiveness of redundant records. This involves determining those pieces of information that uniquely define data and identifying the extent to which this is present

Timeliness - Analyzing data for timeliness measures the availability of data. This involves analyzing the creation, update or deletion of data and its dependent business processes

On Cloud 9!

Apologies ...

I've been in the clouds lately, in more ways than one. I've been on the road performing another data quality assessment on an island in the Pacific. This translates into the fact that I'm gaining status on multiple airlines and becoming increasingly appreciative of noise canceling technologies.

I'm also gaining an appreciation for another technology, cloud based data quality solutions! I am leveraging Informatica's latest data quality platform, IDQ v9. IDQ v9 brings to mind a favorite 80's commercial of mine where peanut butter and chocolate are combined into one tasty treat! For sure there is a little PowerCenter in your Data Quality and a little Data Quality in your PowerCenter ...

50,000 foot level view

I won't try to cover the upgrade in one post, but rather just wet your appetite on what's inside the wrapper. We can binge on the details in the coming months. For now let me just highlight what I feel are the most significant enhancements of v9.

Quick start solution: the cloud based solution of v9 almost eliminates the previous installation requirements of IDQ 8.x

Data Explorer and Data Quality are now one product: This cuts installation and repository management by 50% (at least)

PowerCenter integration means ETL and DQ have tied the knot!: Now you can stop using TCL & SQL scripting and leverage PowerCenter's integration components. This includes the consolidation component, a particularly important component to master data management and customer data integration initiatives

Inline data viewing: now you can unit test your transforms without a full run on your mappings, saving time and increasing productivity

Inline data profiling: now you can report on data quality processing without leaving your development environment and share it with client via a URL!

As I mentioned, I'll dive deeper when I'm not in the middle seat of row 42 somewhere over the Pacific. For now my big take-aways are the time-saving features that are almost everywhere and the integration with PowerCenter takes data quality integration to the next level.

No more afternoons installing client and scripting repositories. No more SQL development to valid and analyze results. No more TCL script (I love that!). No more SQL consolidation nightmares!

Up to speed

As for a learning curve on the new look and feel? It took me a few hours, of which most were productive hours I was able to leverage into real work. Hopefully I can translate what I've learned and cut your learning curve down with posts in the coming months. For now, the drink cart is coming my way and I've got cash handy!

Check back next month when I go over how I was able to deliver more analysis in a shorter time frame and look good doing it!

Friday, October 2, 2009

Data Quality and Microsoft Dynamics CRM

First I'd like to thank all of those who submitted blog entries to the August edition of the IAIDQ Festival del IDQ Bloggers. I enjoyed reviewing the submissions and look forward to hosting more IAIDQ related material.

This month I'd like to talk about my recent experienecs with some of the data quality features of Microsoft's Dynamics CRM package and how to put them to use in the typical enterprise environment.

One of my favorite features is one that attempts to prevent duplication from occurring by executing a match scan on create or update of a record. One of the reasons this is a favorite of mine is that it is proactive and aims to prevent duplication where every data quality experts recommends; at the beginning. One of the issues I have seen with this feature is that in high data volumes it causes significant delays in record creation.

The feature can be enabled by checking on the "When a rcord is created or updated" option in the Duplicate Detection Settings panel of the Data Management console.

[caption id="attachment_276" align="alignleft" width="380" caption="Data Management Console"]

[/caption]

Configuration setting for preventing the dupliction of data upon create or update

In the event that duplicates do get created in your data, Microsoft Dynamics CRM also has some reactive features that allow you to identify these records and consolidate them.

One of these features are duplicate detection rules which are a set of criteria used for matching records. For instance, it is very common for organizations to accumulate more than one record for a single customer. In this example, a duplicate detection rule can be built to identify all customer transactions that have a match on customer's first and last name as well as their zip code. As you may already know, I am a diligent advocate of requiring more than first and last name to truely identify an individual. Based on research regarding change of address (COA), it makes practical sense to limit this criteria to a geographic area like those from which postal codes are generated. You may also want to throw in more qualifying data elements such as street name and number but for the purposes of this posting, we'll stick to a customer's full name and zip code.

This rule will identify each set of records that share the same values for these three data elements. It is possible to require an exact match on the data values or a substring of the characters of the values. Again, I am an advocate of using partial matches on data like last name due to the frequency of data entry errors.

Once you create your match criteria you'll want to intialize, or publish, the rule. You can publish a rule, with the proper permissions, by clicking on the greeen arrowed icon labeled "Publish" on the tool bar after the rule is saved. Be forewarned some rules take quite a long time and a lot of resources to publish so you may want to perform this action as part of your off-hours operations.

Publish icon

Once the rule is published, it is time to schedule when it will be utilized. This is done by building a duplicate detection job which includes the start time, a setting for execution reoccurance, and an option to provide an email for notification of when the job finishes. The following is a snapshot of the interface for developing detection jobs.

[caption id="attachment_280" align="aligncenter" width="463" caption="Duplicate Detection Job "] clip_image010_2

[/caption]

Once you have your rules and jobs created, you completed the basic steps to remove duplicates from your data. After the job completes and you receive your email you'll want to review the duplicate matches.

This can be done by opening the duplicate detection job from the System Jobs queue and double-clicking on it. Once the job is open you'll see an option labeled "View Duplictes". View Duplicates interface

Next month, I'll dive deeper into the details on how to remove the duplicates with a posting on the merge feature. I hope this was informative and enough to get most of you started. I'll address detailed questions if you have them, so please feel free to comment!

Tuesday, August 4, 2009

The DQ Two Step!

Howdy Folks! I'd like to sing you a tune about matching customer data if you don't mind? It's called the "data quality two step".

Heck! You can even grab your partner if the mood strikes you?

[caption id="attachment_203" align="alignleft" width="85" caption="Feels good!"]

[/caption]

I've recently cleansed some customer data for a great bunch of folks that I love calling my client.

We had a consolidation ratio of 1:6 or around 17%. Which equated to roughly a million duplicates removed. That's a lot savings on postage stamps for the marketing department so they were psyched! We validated over 90% of the addresses and built reports to identify those that did not meet the requirements for a valid address. Not too shabby if I don't say so myself!

Now that we've deployed the data into User Acceptance (UAT) I find myself in a familiar place; the business logic.

[caption id="attachment_202" align="alignleft" width="100" caption="You missed something"]

[/caption]

You can spend all the time you need, or even care to, on rules for consolidation but it usually is not until the data hits the screen that the ramifications are easily understood by the average business user.

Case in point, I recently received an email from a stakeholder asking me to look over some data with him. I was curious what I'd find when I reached his office as I analyzed this data and the processing code more than a few times by now. On my walk over I went through many possible scenarios in my head.

Was it something I missed? Surely not. I've performed several test runs in order to validate the business logic. With my curiosity peaked I rounded the corner and into his office I went.

[caption id="attachment_204" align="alignleft" width="150" caption=""Good point!""]

[/caption]

After a little chit-chat, like I said I love this client, we got down to business. He proceeded to type a few parameters in the search utility and I waited with anticipation.

However after a second, maybe less, my anticipation was replaced with relief and more than a little disbelief. I'd been over this a time or two which is why I was in a state of shock. Not to mention my client was not someone who needed "Data for Dummies".

With identities masked to protect the innocent, below is a sample of the records he was concerned about and wanted me to see.

[caption id="attachment_208" align="aligncenter" width="468" caption="Duplicate?"]

[/caption]

So if you've been wondering about this two step thing, here it comes.

In order to positively identify a non-unique individual you need to pair their name with an additional piece of identifying information, usually an address.

In other words, it is a two part match on name and address that can, with a realtively high confidence level, identify a true duplicate.

If we only used a match on name to identify duplicate, we'd consolidate all the John Smith's in the dataset to one customer. Talk about lost opportunity! This approach could turn millions of customers into thousands in an instant.

One brief glance in the local phone directory will be enough to demonstrate how non-unique names really are.

[caption id="attachment_211" align="aligncenter" width="468" caption="They've been through this before!"]

[/caption]

Go a step further and ask your local DBA to run some counts on first-last name combinations and you'll be surprised at the results.

Just in case this little story wasn't sufficient enough to remind you here is that tune I promised you:

The two step matching ditty goes a little like this ...

Grab your partner's name and twirl it around

Make sure the nickname's proper equal is found

Then grab you their address and scrub with the care

Make sure their mail can be delivered there

Don't get rid of your partner until you are sure

That you've got a match on more

Than the name or the door

lyrics by Data Pickins

music by YouToo?

The Data Quality Chronicle

Thursday, September 26, 2013

Would you read my Big Data book?

would you read my book about eventual consistency and how that would impact a distributed system's ability to aid in the process of making business decisions?

Thursday, March 24, 2011

Data Quality Polls: Troubled domains and what to fix

Wednesday, February 2, 2011

Data Quality Basic Training

The many sides of data quality

Tuesday, May 4, 2010

On Cloud 9!

Apologies ...

50,000 foot level view

Up to speed

Friday, October 2, 2009

Data Quality and Microsoft Dynamics CRM

Tuesday, August 4, 2009

The DQ Two Step!

What data quality is (and what it is not)