The Data Quality Chronicle: Big Data: Big Topic, Big Confusion

Big Data Confusion

Big Data has been marketed as the solution to increase profits and aid in the discovery of all kinds of new associations from fraud detection to patient health care. Big data seems to be linked to websites like Amazon, Google and Twitter. Big Data also seems to be linked to solutions like Hadoop and BigTable. Big Data has a very different approach to data storage which includes the lack of a set schema. Big Data not only encompasses traditional data, but also includes the storage of unstructured data like documents and web page content. Because of all this the message of big data seems to get diluted, possibly even lost.

Questions seem to be more abundant than answers. Questions like :

Is Big Data only suitable for web based solutions?

Is Big Data only for social media?

How can a "traditional business model" leverage Big Data?

Who can I engage for the purchase of my Big Data software?

How do I leverage unstructured data?

Big Data Breakdown

Big data is such a complex topic that I feel the need to start with some basic concepts and break down the details in several posts. First, I will attempt to define the big data concept, then move on to the basic principles. Last I will finish with how all these concepts drive the implementation. Keep in mind that each of these topics could cover several pages of content. Who wants to read that much? I know I don't, so I'll try and strip down these components to the bare necessity. Clarity with brevity!

Big Data 101

Simply put Big Data is a catch-all phrase. Like most catch-all phrases, Big Data is vague and often applied differently from person to person without being very descriptive. So let me define how I use the term Big Data. I try to stay close to Gartner's definition which is;

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. These data growth challenges as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources).

Big Data 102

There are three main principles to a big data solution. These are depicted in the illustration to the left. These principles are:

Tolerance

Availability

Consistency

Tolerance refers to the ability to add more resources ,i.e., hardware, to the solution. Data growth is well documented and is a condition that needs to be incorporated into any big data solution.

Availability refers to the fact that Big Data solutions need to be able to guarantee that each request receives a response. Big Data solutions are often applied to web-based applications which tend to have millions of concurrent users.

Consistency refers to concept that each of the data stores in a big data solution contain data in the same state. With millions of concurrent users and the demand for availability, a consistent data state is probably the most difficult principle to ensure.

Big Data 103

These three characteristics drive the way that a Big Data solution is architected. In order to facilitate the volume and velocity, Big Data solutions need to be distributed over many environments with independent but identifical hardware and software resources. This architecture is commonly referred to as massive parallel processing, or MPP. Distributed architectures spread data over many different data stores according to which environment is available.

This leads to one of Big Data's most contraversial consequences; eventual consistency.

Eventual consistency boils down to the fact that, with a Big Data solution, there are points in time when the data in the environments are not consistent. Environments are eventually synchronized when resources can be made available to do so.

Eventual consistency will be a topic that I plan on elaborating on in future posts as I feel it is the tie between big data and data quality.

Summary

Big Data is not about a website, vendor or social media alone. It is a unique way to store and continuously deliver massive amounts of diverse data. It is predicated on three principles, often referred to as CAP, which ensure that the solution is scalable and available, but tends to sacrifice data consistency. In order to do this, Big Data solutions require a distributed set of resources. While this may sound like a watered down explaination, there are many details that I will elaborate on in future posts.

The Data Quality Chronicle

Sunday, December 18, 2011

Big Data: Big Topic, Big Confusion