The Data Quality Chronicle: opinion

Showing posts with label opinion. Show all posts

Sunday, March 9, 2014

Why should I use an existing ETL vs writing my own in Python for my data warehouse needs?

Answer by William Sharp:

Unless you want to enter the ETL software market, you will spend a lot of time writing software instead of migrating / managing data (which I am assuming is your real job). If this tool is going to be used by many developers in an enterprise setting, you will spend a lot of time writing code to manage the ETL code. Probably more than you think

This is classic buy over build ... buy because there are people dedicated to solving issues you haven't even thought of yet. Build if your needs are so unique that there isn't a solution out there (extremely rare)

View Answer on Quora

What things does a data scientist do?

Answer by William Sharp:

Put succinctly it is data management and statistics. collect, clean and migrate data (just like everybody else does) and perform statistics on that data. data science is a new marketing term that goes along with big data. I'm not down playing statistics. I love stats. the 'science' part is a little (a lot) pretentious imho

View Answer on Quora

Wednesday, October 23, 2013

relationship fail

trust me

this is not a relationship strategy

Friday, August 23, 2013

comprehension

If you can't explain it simply, you don't understand it well enough.

Thursday, May 23, 2013

how to find the "bad for business" rows

In my previous post I made the statement ...

Data quality is about finding rows that are bad for business

Super. So how does one go about doing that?

Prior to this week I probably would have taken a different approach to explaining this. However, I learned something this week that I want to pass along. Essentially ...

Teach someone how to spot a counterfeit by making them an expert on the real thing

- Brad Melton (@BradEMelton)

Following that logic, the best way to determine how to find the "bad for business" rows is to know what the "good for business rows" look like. Translating this to the dimensions of data quality, these rows would be the most complete, consistent, conformed, accurate, integral and least duplicated rows in the enterprise.

A good example, that I often use, is one from a very successful data quality project on which I participated. [ref] The project directly related to marketing and the rules on it supported a mailing campaign.[/ref] As long as you have a postal code, house number and street name, you can mail something to someone. So .... a complete address for a mail-based marketing campaign is one with a house number, street name and postal code.

Therefore, a row in an address table with, at least, these columns present is a complete address row and is "good for business". Extending this example, validating the accuracy of the address information (@Loqate maybe?) is another way of measuring how good that row is for business (probably even more so).

Moral of the story, you ask?

Before you set out to fix the broken information in your enterprise, become learned on all aspects of the "unbroken" information.

Tuesday, May 21, 2013

Communication is ...

Communication is the key to customer satisfaction. And communication is 50% listening and 50% empathy

Monday, April 1, 2013

Data Management Poll Results

Here is a summary of the results from each of the polls open on the sound off page ... disagree? vote here

[gallery type="slideshow" ids="2761,2759,2760,2758,2757" orderby="rand"]

Thursday, March 29, 2012

Justin Bieber abuses Twitter but proves how similar phone numbers are

Look, I can't believe I have managed to work Justin Bieber into a data management blog, but I have.

When Bieber tweeted 9 digits of his phone number and asked his twitter followers to guess the 10th digit and call him, he set two unsuspecting victims phones on fire. He also proved how similar phone numbers are and why they are bad candidates for match strategies.

Using phone numbers in match strategies is, in my opinion, a waste of time. You are only going to increase your chances of generating false positives (unless you do an exact match). My issue with the exact matches is that it is very easy to make a "fat-finger" error and still identify a false positive.

If you care, here is the link to the Bieber event. If you are a Bieber fan and found your way to my blog by some search engine failure, my apologies (please don't terrorize me).

Justin Bieber abuses Twitter with phone gag, may get sued - Technolog on msnbc.com.

Monday, March 26, 2012

Clues to a Great Business Intelligence (Data) Story

I highly recommend reading Cindy Harder's interesting piece on telling stories with data. Even though she was specifically referencing BI stories, she does elude to doing the same with data analysis and data discovery in the article.

I was just thinking about this topic of making data discovery fun last night so this article really spoke to me. I think Cindy is dead-on when she reminds us to make sure we engage our audience with a storyboard around data analysis that makes the user want to know the ending.

This is often forgotten, albeit challenging, with regard to presenting data analysis. I struggle, at times, with how to make presentations interesting. And let's face it snazzy icons in a PowerPoint deck does not count as entertaining.

What Cindy emphasizes in the article is to engage users with data stories using these five basic principles when writing your data story:

Refresh your data often

Build a complete dashboard, but don't over complicate it (tricky and important!)

Encourage further investigation with data discoveries (my favorite!)

Analyzing data is fun, not just a job

Draw conclusions with your analysis that are accurate and meaningful

To me, points 3 & 5 are the real important concepts here. I think you can facilitate further investigation with meaningful results. And the key lies in the term meaningful. To do this effectively, you need to bear in mind your audience. Meaningful to a controller is not the same as it is to a DBA. However, data discovery activities can support both these roles and you need to be sure to deliver something they care about in your story. I think, if you do, that individual will be compelled to conduct further investigations which is where Cindy's point of being accurate is important. Make sure you are on point in your analysis!

Read the referenced article here:

Clues to a Great Business Intelligence Story | Visual Data Group.

Check out Cindy on Twitter: @CindyBHarder

Monday, March 12, 2012

Next Hadoop confirms data as a platform | Business Intelligence - InfoWorld

I just finished reading this brief article and was quite interested in the implications of Hadoop maturing into a platform for data driven applications.

In the article, Brian Proffitt of IT World, details Hadoop VP Aran Murthy's Strata presentation which describes how Hadoop is expanding the types of applications that Hadoop's MapReduce will be able to support.

This expansion was compared to an operating system. While I can't quite see that analogy all the way through, I do see the strategic impact of being able to build big data applications against a Hadoop framework.

This type of expansion could free software developers to concentrate on the more front-end, user facing aspects of application features. Something I have long thought would be a significant challenge.

I am eager to see this framework in action. I'll hold off on judgement until then ...

Next Hadoop confirms data as a platform | Business Intelligence - InfoWorld.

Tuesday, March 6, 2012

Which technology looks promising to tackle Big Data Challenges?

Which technology looks promising to tackle Big Data Challenges? 2 answers on Quora

Which technology looks promising to tackle Big Data Challenges?

Monday, December 26, 2011

Can Big Data and Social Media help HR in efficient/quality hiring?

Here is a great question I found and answered on Quora. I thought I would post it so we can stay up-to-date on the answers:
Can Big Data and Social Media help HR in efficient/quality hiring?

Sunday, December 18, 2011

Big Data: Big Topic, Big Confusion

Big Data Confusion

Big Data has been marketed as the solution to increase profits and aid in the discovery of all kinds of new associations from fraud detection to patient health care. Big data seems to be linked to websites like Amazon, Google and Twitter. Big Data also seems to be linked to solutions like Hadoop and BigTable. Big Data has a very different approach to data storage which includes the lack of a set schema. Big Data not only encompasses traditional data, but also includes the storage of unstructured data like documents and web page content. Because of all this the message of big data seems to get diluted, possibly even lost.

Questions seem to be more abundant than answers. Questions like :

Is Big Data only suitable for web based solutions?

Is Big Data only for social media?

How can a "traditional business model" leverage Big Data?

Who can I engage for the purchase of my Big Data software?

How do I leverage unstructured data?

Big Data Breakdown

Big data is such a complex topic that I feel the need to start with some basic concepts and break down the details in several posts. First, I will attempt to define the big data concept, then move on to the basic principles. Last I will finish with how all these concepts drive the implementation. Keep in mind that each of these topics could cover several pages of content. Who wants to read that much? I know I don't, so I'll try and strip down these components to the bare necessity. Clarity with brevity!

Big Data 101

Simply put Big Data is a catch-all phrase. Like most catch-all phrases, Big Data is vague and often applied differently from person to person without being very descriptive. So let me define how I use the term Big Data. I try to stay close to Gartner's definition which is;

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. These data growth challenges as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources).

Big Data 102

There are three main principles to a big data solution. These are depicted in the illustration to the left. These principles are:

Tolerance

Availability

Consistency

Tolerance refers to the ability to add more resources ,i.e., hardware, to the solution. Data growth is well documented and is a condition that needs to be incorporated into any big data solution.

Availability refers to the fact that Big Data solutions need to be able to guarantee that each request receives a response. Big Data solutions are often applied to web-based applications which tend to have millions of concurrent users.

Consistency refers to concept that each of the data stores in a big data solution contain data in the same state. With millions of concurrent users and the demand for availability, a consistent data state is probably the most difficult principle to ensure.

Big Data 103

These three characteristics drive the way that a Big Data solution is architected. In order to facilitate the volume and velocity, Big Data solutions need to be distributed over many environments with independent but identifical hardware and software resources. This architecture is commonly referred to as massive parallel processing, or MPP. Distributed architectures spread data over many different data stores according to which environment is available.

This leads to one of Big Data's most contraversial consequences; eventual consistency.

Eventual consistency boils down to the fact that, with a Big Data solution, there are points in time when the data in the environments are not consistent. Environments are eventually synchronized when resources can be made available to do so.

Eventual consistency will be a topic that I plan on elaborating on in future posts as I feel it is the tie between big data and data quality.

Summary

Big Data is not about a website, vendor or social media alone. It is a unique way to store and continuously deliver massive amounts of diverse data. It is predicated on three principles, often referred to as CAP, which ensure that the solution is scalable and available, but tends to sacrifice data consistency. In order to do this, Big Data solutions require a distributed set of resources. While this may sound like a watered down explaination, there are many details that I will elaborate on in future posts.