The Data Quality Chronicle: Informatica Data Quality workbench

Showing posts with label Informatica Data Quality workbench. Show all posts

Sunday, July 3, 2011

Data Migration Best Practice: Orphan Analysis

What's an Orphan?

An orphan transaction is a transaction in a "child" table without an associated transaction in a "parent" table. For instance, an address record in the Address table without a link to a customer record in the Customer table. Orphaned records lead to various issues in business operations like marketing and business analytics.

Challenges orphans pose to the business

An address without an associated customer record will pose a challenge to marketing initiatives when trying to engage the customer. How do you address the marketing mailer, determine previous purchasing history or determine gender based segmentation?

In order to confront this issue it is crucial to understand if you have applications that are creating orphans. There are various ways to conduct this analysis manually with SQL or in an automated fashion using data quality tools.

Orphan Analysis Methods

In order to manually investigate orphans a series of SQL queries need to be created. Since these queries need to be executed periodically, it is most likely best if they are setup as stored procedures. For the purposes of this post, let's just concentrate on the substance of the queries.

Orphans are all about linkages, so the basic information required is to find what unique identifiers are present in the child table and not present in the parent table. Using the customer/address example, this translates into finding what customer ids are in the address table but not in the customer table.

The figure illustrates a straight forward relationship where addresses are linked to customers using the Cust ID field. An orphaned address would then be defined as a record where the Address CustID exists in Address but not in the Customer table. In SQL you can write a query that would determine how many of these tranactions exists by coding something like this:

Select Address.AddressID, Address.CustID, Address.CreateDate, Address.ModifiedDate

from Address

where Address.CustID not in (select CustID from Customer)

Technically you don't need fields other than the unique identifiers, but it helps to have create and modified dates to audit trail why the orphan exists. In fact you may want to pull all the attributes of an address as part of the audit exercise.

To do this type of analysis with a data quality tool is a bit more graphical and includes some nice reports. I use Informatica v9 to do this and I'll explain the steps required to accomplish this. Let's assume you already have the customer and address tables defined. Simply follow the steps below to generate the analysis, report, and result set (a subtle value add of using the tool is that you get the report and results with less coding and all the re-usability):

Right click on the Profiles folder

Select New Profile

Give your profile a name

Add data sources (Customer and Address)

Select Join Analysis

Click Next

Select the fields you want to include from each source (be sure to include the unique identifiers for each source so you can audit the orphans)

Define the relationship required to join the two tables (Address.CustID and Customer.CustID in the case of our example)

Click Finish to execute the analysis

In Informatica a report such as the one depicted below is produced.

From the report you can see that it has identified 333,211 Addresses without an associated Customer record and 632 Customers without an Address record, along with 858,222 records that are correctly associated. Armed with this type of analysis, as well as graphical report, an analyst can quickly and accurately convey this message to the business and begin to define remediation plans.

From this analysis you can draw the following conclusions:

If you are attempting to do a customer consolidtion effort, using name and address methods will not satisfy your needs 100% so you'll need another strategy like name and phone number to augment the consolidation (you may even be able to use the name/phone analysis to remedy the broken address linkages)

If you are performing a data migration effort, this analysis will explain testing anomolies like record count discrepencies and possibly why some migration jobs fail to load

If you are looking at data management strategy, you may look into the current status of constraints on the Address table

If you are looking into a Data Governance program, you may look at this and determine a need for governance process to deal with Customer and Addressd management

Summary

Whether you use a tool or manual SQL efforts, looking into orphaned transactions is a best practice for many different types of data management initiatives. From customer consolidation to data governance, it is important to know the state of transactions in your operational systems. It is my opinion that a data quality tool, like Informatica v9, offers a rich set of reports that helps translate the results to business users in a digestable manner.

I'll examine some remediation methods in the next post covering orphan analysis.

Monday, June 13, 2011

The role of data quality in ETL design: DQETL

Introduction

Data integration is nothing new. Since the concept of data warehousing, data integration has been a major initiative for most large organizations. On the most common obstacles of integrating data into a warehouse has been the fact that assumptions about the state of the source data have been either false or flawed at best. One of the reasons for this is that very little investigation, or data profiling, is performed on the source data prior to design and execution of the data extraction routines.

With all there is to plan for and deliver on data warehousing projects, this oversight is understandable. However, I believe, it is time for data quality to assume the role of reporting on and remdiating the current state of all source data to be migrated into a warehouse.

Turning assumptions into facts

If source data was profiled what was once assumptions about data can be transformed into facts about the state of the data . Data profiling consists of scanning data and typcially delivers measures such as the frequency of nulls, the extent to which data is unique, and ranges of actual values within each fields included. With data quality tools such as Informatica's Data Quality v9, Global ID's data profiler or Talend's data profiler, these basic reports can be compiled with a few clicks on the mouse. Furthermore, this products offer portals where dashboards detailing the current state of the data can be delivered to both a technical and business oriented audience.

Data Profiling 101

As I mentioned, creating a data profile can be done with a few mouse clicks. Typically the steps are as follows:

Define a connection to the data source

Define the data source

Define which fields are to be included in the profile

Define any business related rules to be included in the profile

Schedule the profile for execution

Connections

Defining the connection to the data source usually involves a few simple steps. Connections are typcially either to a database or to a file. While connecting to a file includes parameters such as delimiters, field names, and data types and lengths, connecting to a database usually involves location parameters like host and authetication credentials like username and password.

Whether you connect directly to the database or use a flat file extract typically depends on circumstances like resources for a test environment and ability to procure the required credentials. Either way, the lion's share of the work is setting up the connection.

Candidates

The next logical step in creating a data profile is to define what fields to include. Within the context of validating ETL for data integration, this step would heavily depend on those fields nicluded inthe migration, and even more by those fields requiring transformation.

Bercause not all data is migrated, data profiling is best of limited to the tables that are required. Primary and foreign keys are almost always included to ensure uniqueness as well as timestamp fields to ensure completeness and date format conformity.

Rules

When we talk about data profiling most of the metrics are technical (percedntage of uniqueness, percentage of nulls, etc ...), however, once fo the most beneficial practices is to include rules in the data profile that are based on business rules. Some examples of these rules are to validate that certain chronological events are in order (i.e. ship date does not proceed order date) and logical assumptions (i.e. that individuals who are indicated as male do not have postive pregnancy outcomes) are indeed valid.

Constructing business rules often involve participation of a business domain subject matter expert, however they can also be formed from the conceptual deisng of the ETL. By reverse engineering the transformation logic, it is possible to derive, at least, one rule that needs to be tested. Afterall, transformation logic would be negatively affected by things such as nulls, data type nonconformity and values outside the expceted range.

The upside of yet another step

Because of its complex nature and multi-step requirements, adding another step to data integration and migration is rarely a welcomed effort. However, once viewed within the context of reducing ETL redesign and increasing transfer success rates, it is frequently, albeit begrudgingly, accepted.

Including data profiling in the data maigration suite of perations, indeed, can deliver these desired outcomes. When data is profiled prior to ETL design and execution, data states that would otherwise cause ETL loads to fail can be identified and remedied. An example can be found in the all important date related fields. A simple data profile can detect date formats that are not supported by the target repository. Dates in data warehousing are crucial to track transaction lineage and if not configured correctly can be the undoing of an ETL design.

Summary

While I have just touched the surface of the role of data profiling in data integration this is, none the less, an important concept to adopt. For increases in successful loads and decreases in ETL test and troubleshooting will save time and resources and paint a more positive image of the data integration team and their capabilities.

Monday, January 4, 2010

Hamming Distance Matching Algorithm

Introduction

Last month we examined, at a high level, each of the matching algorithms available in Informatica's data quality tool, Data Quality Workbench or IDQ. In this month's edition we'll dive deeper into one of those options, the Hamming distance algorithm.

As we touched on last month, the Hamming algorithm score measures the minimum number of substitutions required to change one string into the other. Often this algorithm is used when analyzing strings which represent numeric strings.

A Little Background

The Hamming distance is named after Richard Hamming. Hamming was an American mathematician whose accomplishments include many advancements in Information Science. Not the least of which was his work on The Manhattan Project in 1945. One notable contribution of his work is his philosophy on scientific computing which appeared as the preface to his book in 1962 on numerical methods (Numerical Methods for Scientists and Engineers) which read:

The purpose of computing is insight, not numbers

A Little about the algorithm

Perhaps as a result of Hamming's time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers. However the advantages of the algorithm are applicable to various types of strings and is not limited to numeric strings.

There is one condition that needs to be adhered to when using this algorithm that is worth noting. The strings analyzed need to be of the same length. Since the Hamming distance algorithm is based on the "cost" of transposing one string into another, strings of unequal length will result in high penalties for transposition. If you are using the Hamming algorithm to analyze telephone numbers it is critically important to cleanse the data before analyzing it. For instance, if not all telephone numbers include an area code than the area codes that are in the data need to be parsed out before analysis.

Hamming Distance based checks

Hamming distance based checks determine the number of errors between two strings. If we want to detect the number of errors (x) in a string we can map every string (y) into a bigger y+x+1 string so that the minimum Hamming distance between each valid mapping is x+1.

Hamming Distance in Informatica Data Quality Workbench

In Data Quality workbench implementing the Hamming Distance algorithm is very simple and is done via adding the Hamming component from the Component Palette. Refer to the figure below for a look at the Hamming icon.

[caption id="attachment_388" align="alignnone" width="76" caption="Informatica Data Quality Workbench Hamming Distance icon"]

[/caption]

Once the Hamming component has been added the your IDQ plan it is time to configure it. Configuration options include the specification of Inputs, Parameters, & Outputs. Refer to the figure below for a look at the configuration options.

[caption id="attachment_389" align="alignleft" width="792" caption="Configuration Options of the Hamming Distance Component"]

[/caption]

As the name indicates the Inputs tab is where to specify what data elements will be matched in the component. It is important to note that when configuring the inputs there is a requirement to select two data elements for each match desired. Without going into extensive details when the data is grouped Informatica IDQ formats the data so that each element will have two instances. One instance is labeled "x_1" and the other is labeled "x_2" where x is the data element name. Both the x_1 and the x_2 data element need to be selected.

The parameters tab is where you configure some important options. The Reverse Hamming option is a check box that can be selected that configures the component to read the string from right to left instead of the default left to right. The remaining two options dictate what the match score will be when null values exist in the pair of strings. The Single Null Match Value setting indicates the match score when one field in the pair of matched values is null. The Both Null Match Value setting indicates the match score when both fields are null.

The Outputs tab is where you can define the name of the output value. Pretty straight forward there.

Now for the good part, the results! Refer to the to the figure below for a look at what results look like when using the Hamming distance component.

[caption id="attachment_396" align="alignleft" width="780" caption="Results of the Hamming distance match analysis in Informatica Data Quality workbench"]

[/caption]

I've masked the results to ensure data privacy however it is still useful in describing the results. As you can see in the sample above, row two resulted in a Hamming distance score of 1 because both values are identical. However the data in row one is not identical and consequently resulted in a Hamming distance score of approximately 0.93. This is due to the transposition "cost" of turning "ZBL ZSSXCNZTES" into "ZBT ZSSXCNZTES". The penalty of 0.07 was due entirely to the "ZBL" into "ZBT".

Summary

Among Richard Hamming's many accomplishments is the development of an algorithm to compare various types of strings of the same length to determine how different they are. Due to the requirement of equal length, the algorithm is primarily used to detect differences in numeric strings but can be used with textual data as well.

Informatica has incorporated the Hamming algorithm into the data quality workbench tool in order to produce a match score. The Hamming component requires the selection of at least two inputs, it can be configured to handle data with nulls and will output a match score. In IDQ a Hamming match score of one (1) indicates a perfect match while a Hamming match score of zero (0) indicates that there was no correlation between the two values being analyzed.

I've used the Hamming component in IDQ to analyze match possibilities in telephone numbers and postal codes. I've found it to be reliable in detecting true positive matches and sensitive enough to detect even slight differences (as indicated in the sample data above). I hope this review will help those of you interested in using the Hamming component in IDQ or those just interested in developing knowledge of the algorithm.

Thank you for reading this month's edition of The Data Quality Chronicle. Stop by again next month when I detail the Jaro-Winkler algorithm and it's implementation in Informatica IDQ.

Thursday, December 10, 2009

Informatica Data Quality Workbench Matching Algorithms

I'd like to begin a multi-part series of postings were I detail the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post I'll start by giving a quick overview of the algorithms available and some typical uses for each. In subsequent postings I'll get more detailed and outline the math behind the algorithm. Finally I'd like to finish up with some baseline comparisons using a single set of data.

IDQ Workbench enables the data quality professional to select from several algorithms in order to perform matching analysis. Each of these serve a different purpose or are tailored toward a specific type of matching. These algorithms include the following:

Hamming Distance

Jaro-Winkler

Edit Distance

Bigram or Bigram frequency

Let's look at the differences and main purpose for each of these algorithms.

The Hamming distance algorithm, for instance, is particularly useful when the position of the characters in the string are important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.

The Jaro-Winkler algorithm is well suited for matching strings where the prefix of the string is of particular importance. Examples include strings like company names (xyz associates vs. abc associates). The Jaro-Winkler algorithm is a measure of how similar two strings are by calculating the number of matching characters and number of transpositions required.

The Edit Distance algorithm is an implementation of the Levenshtein distance algorithm where matches are calculated based on the minimum number of operations needed to transform one string into the other. These operations can include an insertion, deletion, or substitution of a single character. This algorithm is well suited for matching fields containing a short text string such as a name or short address field.

The Bigram algorithm is one of my favorites due to its thorough decomposition of a string. The bigram algorithm matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score. This algorithm is useful in the comparison of long text strings, such as free format address lines.

Informatica provides several options for matching data out-of-box with Data Quality (IDQ) Workbench. Although some will argue the ability of another algorithm to detect with greater strength, Informatica has provided some very robust methods to match various types of strings. With this flexibility the data quality professional is enabled to handle various types of data elements in their match routines. As with any tool, it is not a replacement for the research required to use the right method in the right way. This is one of the aspects I'll cover in the subsequent postings where we take each algorithm and get more detailed.

Drop by next month for more about the Hamming distance algorithm and some real word examples of how it can be implemented!

Sunday, May 24, 2009

GUI or command line? Where to run an IDQ plan.

Recently on a data quality project I stumbled across an anomoly that I thought I share with the data quality / Informatica community. It involves the use of Informatica Data Quality (IDQ) and the use of certain types of queries. I discovered this anomoly when I recieved the following error message:

[caption id="attachment_124" align="aligncenter" width="414" caption="Image 1: Failure to initialise plan"]

[/caption]

One of the root causes of this error is a "behind-the-scenes" process that involves the import of data into an IDQ plan. When importing data into an IDQ plan, via SQL inside the source component, IDQ wraps a select count () around the SQL in the original query. At times this causes IDQ to fail while initializing and returns the error message above. Most commonly this error occurs when using a query that is beyond the standard select statement. I was using a union in my query when I got this error.

If you are using Oracle you have the option of editing your SQL or running the plan from the command line. However if you are using Microsoft SQL Server, you'll need to get the to command line.

The SQL edit in Oracle is simple and straight-forward so let's address that first. If you get an error where the IDQ plan fails to initialise try wrapping the SQL as follows:

```
Select * from (<your query here>)
```

As for Microsoft SQL Server the command line is the only option but it is very straight-forward. You could even argue that the command line is more enterprise robust since the jobs could then be scheduled and thus integrated into existing data services more easily.

Regardless of the perspective here is how you deploy the plan in realtime using the command line. An IDQ plan can be executed from one of two locations; the IDQ repository or the file location of the XML version of the plan. On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin folder of the Data Quality Server installation. It is also a requirement to specify the location of the plan in the command line utility.

To deploy a plan from an XML file the -f switch is used. So the command to deploy an XML version of a plan, call it myplan, in the C:/idqplans directory would be as follows:

```
athanor-rt -f c:/idqplans/myplan.xml
```

To deploy a plan from the repository the -p switch is used. Consequently the command to deploy the myplan dq plan from the repository would be as follows:

```
athanor-rt -p c:/idqplans/myplan
```

There are other useful switches to utilize while deploying your IDQ plans at the command line.

For instance to run an IDQ plan along with a parameter file, which allows you to override plan variables with those in the parameter file, use the -c switch. The command for executing a plan along with a parameter file would be as follows:

athanor-rt -c <parameter file name> c:/idqplans/myplan

Another helpful switchs allows you to receive feedback as to the status of the plan execution. This switch is -i.

The -i switch will provide feedback in the command line window. The feedback occurs at the specified interval in the command line. For instance, the command -i 200 will provide feedback every 200 records for each field in the process. The following is an example of the feedback provided in the command line with the -i switch:

cmdlineexample

With these basic switches you can deploy any IDQ plan regardless of the query required to source the data. I hope this post helps someone avoid hours of debugging!

Check back later for more details about scheduling IDQ plans using the command line.

Sunday, July 3, 2011

Data Migration Best Practice: Orphan Analysis

What's an Orphan?

Challenges orphans pose to the business

Orphan Analysis Methods

Summary

Monday, June 13, 2011

The role of data quality in ETL design: DQETL

Introduction

Turning assumptions into facts

Data Profiling 101

Connections

Candidates

Rules

The upside of yet another step

Summary

Monday, January 4, 2010

Hamming Distance Matching Algorithm

Introduction

A Little Background

A Little about the algorithm

Hamming Distance based checks

Hamming Distance in Informatica Data Quality Workbench

Summary

Thursday, December 10, 2009

Informatica Data Quality Workbench Matching Algorithms

Sunday, May 24, 2009

GUI or command line? Where to run an IDQ plan.

What data quality is (and what it is not)