Monday, January 25, 2010

Data Quality Tips & Tricks: Using delimiters to your advantage

Introduction

While I am doing research on my next matching algorithm post, the Jaro-Winkler algorithm, I have decided to throw together some of my favorite "lessons learned" which I have discovered during my practice with Informatica Data Quality (IDQ) Workbench.  This eclectic bunch of tricks has helped me carry out various tasks such as more comprehensive data profiling and more accurate matching.  Some of the tips are generic and apply to any data quality software; however, some require detailed knowledge of IDQ.
If you'd like a more details about the tips below, please feel free to leave a comment and I'll follow-up with you directly.

Data profiling and the use of delimiters


One of the most valuable profiling and analysis components Informatica supplies is the Token Labeler.  This component categorizes the data format of the values in a given field.  It does so through the use of tokens and delimiters. These tokens are essentially labels that will be assigned to each string in a field that are separated by a delimiter. 


The significance of the delimiter is that it is often an indication of the "noise" characters that need to be removed from or converted in the data. For instance, on a recent engagement I was tasked with performing duplication consolidation on a company's accounts data. During analysis I was able to identify that there were some inconsistencies with how the word and was represented in the data. At times an account could be entered as Slapping Zombie & Fish Co., and at times it was entered as Slapping Zombie and Fish Co. In order to identify the extent of the variability, I wanted to profile the pervasiveness of the ampersand (&) in the data.


There was only one small issue. The ampersand character is not one of the twelve standard delimiters that are built into IDQ. In fact, with 32 possible special characters, odds are you may need to know how to add a delimiter while using IDQ to profile your data.


The first thing you need is access to the Informatica Data Quality repository which is a relational database named idq_repo. Once you have access, all you need to do is write some SQL code to add to the delimiter to the t_delimiters table in the repository. In figure 1 below I have illustrated what this looks like when using Microsoft SQL Server for the IDQ repository.





Figure 1 Snapshot of the idq_repo repository & the t_delimiters table



The required SQL is very straight forward once you know what table the delimiters are stored in and the structure of the table. In Figure 2 below I illustrate with the SQL code is to add the ampersand to the list of available delimiters.





Figure 2 SQL Code to add a delimiter to IDQ



Once you run this statement, you'll be able to add the ampersand to your IDQ plan immediately without the need to close the application. As you can see in figure 3 below, I have added this along with several other delimiters to assist in my analysis, standardization and match data quality routines.





Figure 3 an example of the addition of delimiters in the IDQ token labeler



I hope this helps those of you who use IDQ in your data quality routines and who are looking for a way to analyze and standardize special characters not already available in the list provided by Informatica.


Data Parsing in IDQ


This next tip also involves the use of delimiters, however this time we can use a built-in delimiter in the token parser component to help us parse data combined in one data element. In my experience it is not uncommon to see cases where customer first and last name are collected and stored in one data element. For various reasons this causes a challenge when standardizing and matching the data. As a result, you need to parse this data into two separate data elements. This can be performed by utilizing the space delimiter. In the sample data in figure 4 below you can see customer name data stored in one data element.





Figure 4 Name data stored in one data element



Using a simple standardization plan in IDQ we can transform this data into two elements so that further standardization and more accurate matching can occur. In figure 5 the token parser is used to parse the name components into two data elements.





Figure 5 IDQ plan using the Token Parser



Let's take a closer look at how the Token Parser needs to be configured to achieve this result. First an input value needs to be selected from the Inputs tab by selecting a checkbox next to the data label as shown in figure 6 below.





In the Components pane, double-clicking on the generic Token Parser # will provide the opportunity to change the name of the instance of the token parser. This helps organize each instance of parsing by data element and is a beneficial practice.





In the parameters pane, you can selected the delimiter that will be used to parse the input data. You can select more than delimiter from the list and data will be  parsed if any of the delimiters are encountered.  As we mentioned in the previous tip, the delimiters are stored in the t_delimiters table of the idq_repo and can be edited with a simple SQL statement.





Lastly, outputs are added by right-clicking on the appropriate output type pane on the Output tab. In our sample case, we are looking for a text-based output.  As a result, we right-click in that pane and select the add option to include an output of a given type. Notice there is a predefined selection for an output overflow value. In our example, this value will hold any data encountered after the space delimiter. This is a particularly useful placeholder and it is a good practice to rename this value something meaningful so that you can easily track what data is in that value.





Once the Token Parser is configured and the IDQ plan has been executed you get an output the looks something similar to the output file depicted in figure 9 below.





Figure 9 Output of IDQ plan which parses out First and Last Name



While this example focuses on parsing customer first and last name, another useful implementation I have used this method for is parsing the street number and street name.  This is done in order to enable grouping of customers by street name during matching analysis. Other similar parsing implementations use the dash (-) delimiter to parse the Postal Code data element into Zip Code and Zip plus 4 components. These implementations follow the same general process documented above.

Summary


These are two simple methods that help you leverage delimiters in your data profiling and cleasning routines.  By expanding the special characters that you can profile, you gain greater insight into what types of cleansing you need to perform.  By using delimiters in parsing routines, you increase data conformity, consistency, integrity and accuracy.  You also can enrich and exapnd on the data by more properly aligning it.  The most advantageous benefits of these methods are increased data matching volumes and accuracy which ultimately leads to reduced operational costs.

I hope these little tidbits were valuable and help someone out there struggling with a way to better cleanse their data.  I'll be posting more of these in future, so keep your eyes peeled!

7 comments:

  1. Excellent. This was very useful.

    ReplyDelete
  2. Brian,

    Thanks for stopping by and commenting on the post! I am glad you enjoyed the tips. I know they came in handy for me and I am happy to pass it along. Feel free to contact me on Twitter (@dqchronicle) if you have any other questions!

    ReplyDelete
  3. [...] Data Quality Tips & Tricks: Using delimiters to your advantage January 2010 2 comments 4 [...]

    ReplyDelete
  4. Thanks William for sharing this. It very useful and informative.

    Is there any article of yours where you have discussed the Data quality basics ?

    Thanks Again !!

    ReplyDelete
  5. [...] Tips & Tricks: Using delimiters to your advantage – Introduction While I am… http://thedataqualitychronicle.org/… #dataquality 6 seconds [...]

    ReplyDelete
  6. Thanks. Very good article. quick question. I am not able to see those snapshots/figures.

    ReplyDelete
  7. Yeah, I lost the images in a Wordpress migration and it is way too much work to remediate at this point. Thanks for reading and commenting on the post! If you have questions or suggestions for follow articles, please feel free to comment again

    ReplyDelete

What data quality is (and what it is not)

Like the radar system pictured above, data quality is a sentinel; a detection system put in place to warn of threats to valuable assets. ...