25 May 2012

As with any new technology, great inventions pose great questions. This isn’t any different for the concept of bigdata. One of the questions that come back a lot these days sounds a bit like “why would we need bigdata, we are not google”. Assuming you are indeed not google, it might seem a bit overkill to implement a solution created by the most well known search company out there. I hope this article gives you some more insight in why bigdata might be interesting for you and your company.

The nature of information.

Companies are (or should) be interested in improving themselves, and one of the ways to know how to do that is by tracing back your steps and try to discover which decisions resulted in which outcome. To do so historical information is being stored in what’s called data warehouses and powerful algorithms are being unleashed on that data to gain insights in how well the company is doing, or how the market is behaving.

A data warehouse is an important tool within the domain of business intelligence, where a BI expert models the data and decides which algorithms to unleash on which sets of data. Since the warehouse needs some kind of datamodel, the experts need to make the right decisions about which data to store and which data to ignore. Ignored data is lost forever, so the decision can better be a good one. If at some point in time a new metric is being added to the data warehouse, only the information available from the point the metric was added is taken into account. Let me explain that with an example.

Company A has been doing business for the last decade and as has a bi devision in charge of analyzing the business as such. That bi devision uses a datawarehouse to analyze the data and generate reports. They have decided to take the number of sold items as the key metric to calculate their success. Several months later it seems that metric doesn’t provide that much information value, so the team decides to use a new metric, the total value of the sold items, as the key metric. Since the team only now decides to measure the given metric, no historical information will be available. bigdata to the rescue

So how can bigdata help? Well to start, bigdata allows us to store not only aggregated results, but all information in an unstructured way. So instead of processing information as we retrieve it and only storing the result, we will store the raw information and extract it only when we need it. Are we mad doing so? It will generate enormous amounts of data! Shouldn’t we aggregate the information?

Well it is called bigdata for a reason and please, please don’t aggregate your master-data. Aggregating will cause precious information to get lost limiting your analysis capabilities right from the start. Instead collect as much information as possible so your view on things is as clear as possible. Before the age of bigdata, storing all information just wasn’t an option since it would require heavy, very expensive servers backed by an even more expensive SAN to do so. Bigdata environments thrive on the idea of using commodity hardware to build large clusters, even at a fraction of the price. Sure you can build a cluster out of very meaty and heavy servers, but the point is you don’t have to. Any low to mid-range server will do well as a node.

a new way of looking at data

BigData allows us to look differently at data. Where in the past it was the business intelligence expert deciding which the important metrics were, we can now let the data speak for itself. Use all information available to discover relationships instead of assuming them.

Nathan Marz has written a truly magnificent article on how to beat the CAP paradigm, allowing us to look at data in a totally different way.

Offcourse I am a bit exaggerating when saying any server can do. Depending on which services you want to run, you will need to focus on different aspects of your hardware. For example, to fullfill Hbase’s memory hunger, you will need quite some amount of memory for your region servers. On the other hand, invoking lots of mapreduce jobs puts a lot of pressure on your IO controllers. But as an example, we use a 4 node cluster with each node having one I7, 16gb of memory and two Sata-3 1tb disks. We can easilly do anything we want with it.

blog comments powered by Disqus