The Data Quality Chronicle: The DQ Two Step!

Tuesday, August 4, 2009

The DQ Two Step!

Howdy Folks! I'd like to sing you a tune about matching customer data if you don't mind? It's called the "data quality two step".

Heck! You can even grab your partner if the mood strikes you?

[caption id="attachment_203" align="alignleft" width="85" caption="Feels good!"]

[/caption]

I've recently cleansed some customer data for a great bunch of folks that I love calling my client.

We had a consolidation ratio of 1:6 or around 17%. Which equated to roughly a million duplicates removed. That's a lot savings on postage stamps for the marketing department so they were psyched! We validated over 90% of the addresses and built reports to identify those that did not meet the requirements for a valid address. Not too shabby if I don't say so myself!

Now that we've deployed the data into User Acceptance (UAT) I find myself in a familiar place; the business logic.

[caption id="attachment_202" align="alignleft" width="100" caption="You missed something"]

[/caption]

You can spend all the time you need, or even care to, on rules for consolidation but it usually is not until the data hits the screen that the ramifications are easily understood by the average business user.

Case in point, I recently received an email from a stakeholder asking me to look over some data with him. I was curious what I'd find when I reached his office as I analyzed this data and the processing code more than a few times by now. On my walk over I went through many possible scenarios in my head.

Was it something I missed? Surely not. I've performed several test runs in order to validate the business logic. With my curiosity peaked I rounded the corner and into his office I went.

[caption id="attachment_204" align="alignleft" width="150" caption=""Good point!""]

[/caption]

After a little chit-chat, like I said I love this client, we got down to business. He proceeded to type a few parameters in the search utility and I waited with anticipation.

However after a second, maybe less, my anticipation was replaced with relief and more than a little disbelief. I'd been over this a time or two which is why I was in a state of shock. Not to mention my client was not someone who needed "Data for Dummies".

With identities masked to protect the innocent, below is a sample of the records he was concerned about and wanted me to see.

[caption id="attachment_208" align="aligncenter" width="468" caption="Duplicate?"]

[/caption]

So if you've been wondering about this two step thing, here it comes.

In order to positively identify a non-unique individual you need to pair their name with an additional piece of identifying information, usually an address.

In other words, it is a two part match on name and address that can, with a realtively high confidence level, identify a true duplicate.

If we only used a match on name to identify duplicate, we'd consolidate all the John Smith's in the dataset to one customer. Talk about lost opportunity! This approach could turn millions of customers into thousands in an instant.

One brief glance in the local phone directory will be enough to demonstrate how non-unique names really are.

[caption id="attachment_211" align="aligncenter" width="468" caption="They've been through this before!"]

[/caption]

Go a step further and ask your local DBA to run some counts on first-last name combinations and you'll be surprised at the results.

Just in case this little story wasn't sufficient enough to remind you here is that tune I promised you:

The two step matching ditty goes a little like this ...

Grab your partner's name and twirl it around

Make sure the nickname's proper equal is found

Then grab you their address and scrub with the care

Make sure their mail can be delivered there

Don't get rid of your partner until you are sure

That you've got a match on more

Than the name or the door

lyrics by Data Pickins

music by YouToo?

6 comments:

Henrik Liliendahl S&August 4, 2009 at 6:15 PM
Awesome post.

About James Bond. I have a way of categorising party data
here

I think the 2 rows are not 2 ‘C’ duplicates but 2 separate instances of type ‘I’.

My Data Quality 2.0
system thought that out by mimic a human.

The name ‘James Bond’ was found in the table with ‘Comic names’ with the possibility weight 33% (‘Donald Duck’ is 100%).

‘Secret Place’ nor ‘Hidden Drive’ (with stated numbers) in ‘Brooklyn’ or ‘New York’ was not found in the table ‘US Thoroughfares’ or ‘All addresses of the World’.

Some 3.0 day the system will also recognize a misplaced connection between the name ‘James Bond’ and the number ‘007’ and street element ‘Secret’.
ReplyDelete
Replies
Daragh O BrienAugust 5, 2009 at 1:57 AM
Great post William. It clearly illustrates the fun we can have trying to match entities. I tend to look for at least two other "facts" in a match (same telephone number, same data of birth) to build as robust a picture of why two things are the same thing.

Acceptable error rates in matching can vary between industries. At the first Information Quality conference we held in Dublin back in 2005, a presenter was talking about error rates in telco and financial services matching. A delegate from Healthcare stood up and challenged the "acceptableness" of the thresholds being talked about. His point: at those error rates his team would kill 300 people a year.
ReplyDelete
Replies
Jim HarrisAugust 5, 2009 at 3:14 AM
Excellent post William,

I always enjoy a data quality song!

I do have one minor criticism, however.

I have an issue with a two part match being referred to as a true duplicate, especially when the two parts are name and postal address.

In my Data Quality Pro article Identifying Duplicate Customers (Part 3): False Positives , I show a few examples of common scenarios specific to personal name matching (see Keys 431-433 and Keys 441-443) where nearly identical names at the exact same address are not duplicates.

Furthermore, as someone who has been the victim of identity theft – where using only my name and my postal address incorrectly matched me to thousands of dollars of fraudulent medical expenses by someone who has the same name as me and looked up my postal address in the phone book – I know firsthand that even a two part exact match on name and postal address is not necessarily a true duplicate.

Even when more parts are available (telephone number, date of birth, tax identifiers), the possibility of a false positive still exists. In fact, in my identity theft, some of the fraudulent medical records had all of my information because a hospital worker performed duplicate consolidation using only personal name and postal address between my actual medical records and my identity thief.

Best Regards…

Jim
ReplyDelete
Replies
dqchronicle authorAugust 5, 2009 at 1:05 PM
Excellent point Jim, I should have noted that it needs to be at least two pieces of identifying information and that the possibility of false positives always exists.
ReplyDelete
Replies
Today’s DQChronicle’s Twitter FriendFeed June 3, 2011 | The Data Quality ChronicleJune 5, 2011 at 5:48 AM
[...] Two Step! – Howdy Folks! I'd like to sing you a tune about matching customer… http://thedataqualitychronicle.org/… #dataquality 6 seconds [...]
ReplyDelete
Replies
DataQualityChronicleDecember 16, 2011 at 11:21 AM
The DQ Two Step!. #in http://t.co/xfKbqAuw #customer_data_integration #data_matching #data_quality #master_data_management
ReplyDelete
Replies

Add comment

Tuesday, August 4, 2009

The DQ Two Step!

6 comments:

What data quality is (and what it is not)