petabytes - JF Bélisle

This post is the first of a series of posts related to Big Data, since I thought it was worth going in-depth with this topic. Big Data is a big word, a big buzzword, some might even call it big bullshit, since many components revolving around Big Data, and especially the ones on the analytics/methodology side, that we can label Big Data Analytics, have been around since more than a decade.

Monday June 18th 2012, I went to the Big Data Montreal event #5 as it is written on my Foursquare feed (yes, I used it sometimes!). The event involved presentations mainly on programming and on what where the best software frameworks to use to correctly tabulate all of these data. The conversation was about software frameworks such as Apache Hadoop, Pig, HiveQL and Ejjaberd, all software frameworks I’ve never programmed with, and that have for objective of cleaning the mess in unstructured data. Personally, this is a part of Big Data I’m less familiar with, and what I’m better at is what follows these steps in the process, “Big Data Analytics”. But what is really Big Data?

As defined in “Big Data: The next frontier for innovation, competition, and productivity”, a 143-page report that will become a classic report, the well-respected consulting firm McKinsey suggests that “Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze” (p.1). So what does this mean? It means that Big Data is only a term that refers to a big dataset, and what revolves around this database are only supporting concepts to Big Data.

Why Should We Care About Big Data?

Yes, Big Data are everywhere, similarly to cheesy teenager’s pop bands that all sounds the same. But do you remember the sentence: “You can’t manage what you can’t measure” by management Professor Peter Drucker? If you don’t, then you should from now on. However, in this Big Data era, the competitive advantage should emerge from the following sentence: “you can manage what you can measure with the right method and the right software”. A little longer and less sexy than the one by Drucker, but at least it is a great follow-up.

Theorizing vs Observing?

Is Big Data killing science? Is it killing theory? Psychologists create and develop theories by testing on small sample sizes. Big Data analysis is based on the model of Physics which suggests that a different pattern may emerge from any Big Data, which means that there is no point of having a new theory, what we care about is the pattern that is specific to a particular case. In 2008, in a provocative article entitled “The End of Theory”, Chris Anderson, Wired editor-in-chief, made a statement about how Big Data are becoming more and more important in many fields of study. Related to this point of view, I completely agree even though Chris Anderson might be biased since he’s a physicist by training. In anyways, I think that a pattern that emerges from Big Data might be explained by a theory or a series of theories, which would reconcile both points of views.

Conclusion

Big Data may sound simply like a buzzword for many of us. However, even if many of the components that revolve around the concept have been around since many years, as previously mentioned, I agree that the software and programming software used for big data extraction are way much newer than the analytics methods associated with Big Data. It’s so hot at home, I’ll enjoy a big glass of water and prepare for big work tomorrow. I raise my big glass to Big Data!

Cheer up!

Jean-Francois