This post is the first of a series of posts related to Big Data, since I thought it was worth going in-depth with this topic. Big Data is a big word, a big buzzword, some might even call it big bullshit, since many components revolving around Big Data, and especially the ones on the analytics/methodology side, that we can label Big Data Analytics, have been around since more than a decade.
Monday June 18th 2012, I went to the Big Data Montreal event #5 as it is written on my Foursquare feed (yes, I used it sometimes!). The event involved presentations mainly on programming and on what where the best software frameworks to use to correctly tabulate all of these data. The conversation was about software frameworks such as Apache Hadoop, Pig, HiveQL and Ejjaberd, all software frameworks I’ve never programmed with, and that have for objective of cleaning the mess in unstructured data. Personally, this is a part of Big Data I’m less familiar with, and what I’m better at is what follows these steps in the process, “Big Data Analytics”. But what is really Big Data?
As defined in “Big Data: The next frontier for innovation, competition, and productivity”, a 143-page report that will become a classic report, the well-respected consulting firm McKinsey suggests that “Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze” (p.1). So what does this mean? It means that Big Data is only a term that refers to a big dataset, and what revolves around this database are only supporting concepts to Big Data.
Why Should We Care About Big Data?
Yes, Big Data are everywhere, similarly to cheesy teenager’s pop bands that all sounds the same. But do you remember the sentence: “You can’t manage what you can’t measure” by management Professor Peter Drucker? If you don’t, then you should from now on. However, in this Big Data era, the competitive advantage should emerge from the following sentence: “you can manage what you can measure with the right method and the right software”. A little longer and less sexy than the one by Drucker, but at least it is a great follow-up.
Theorizing vs Observing?
Is Big Data killing science? Is it killing theory? Psychologists create and develop theories by testing on small sample sizes. Big Data analysis is based on the model of Physics which suggests that a different pattern may emerge from any Big Data, which means that there is no point of having a new theory, what we care about is the pattern that is specific to a particular case. In 2008, in a provocative article entitled “The End of Theory”, Chris Anderson, Wired editor-in-chief, made a statement about how Big Data are becoming more and more important in many fields of study. Related to this point of view, I completely agree even though Chris Anderson might be biased since he’s a physicist by training. In anyways, I think that a pattern that emerges from Big Data might be explained by a theory or a series of theories, which would reconcile both points of views.
Big Data may sound simply like a buzzword for many of us. However, even if many of the components that revolve around the concept have been around since many years, as previously mentioned, I agree that the software and programming software used for big data extraction are way much newer than the analytics methods associated with Big Data. It’s so hot at home, I’ll enjoy a big glass of water and prepare for big work tomorrow. I raise my big glass to Big Data!
I spent the last few weeks digging deeper into time series-related methods and data mining methods. For this post, I have decided to write a broad introduction related to the latter (data mining), since it may look more practical and also trendier than time-series (but watch out with the emergence of Particle Filtering and other Sequential Monte Carlo methods (SMC)).
Introduction to Data Mining and Criss Angel
So what is data mining? Data mining can be defined as the process but also the “art” of discovering/mining patterns, meaning and insights in large datasets by using statistical and computational methods. In other words, a data miner is like a Criss Angel (You can pick any other magician here!) that will make appear from your messy ocean of data, insights that will be valuable to your company and may give you a competitive advantage compared to your competitors; simply read Tom Davenport’s bestselling book “Competing on Analytics: The New Science of Winning” if you’re not convinced yet about the power of analytics and by extension of data mining. Furthermore, data mining related tasks are also considered as part of a more general process called Knowledge discovery in databases (KDD) which includes the “art” of collecting the right data as well as organizing and cleaning these data, which are also extremely important tasks prior to analyzing the data.
Some Brief History and a Link to Business Intelligence
Data mining methods can be divided in multiple ways. However, most books on the topic, and especially those related to marketing and business intelligence, will generally divide data mining methods into two types, the ones related to supervised learning and the ones related to unsupervised learning.
Supervised learning is often more associated to scientific research as it includes tasks where the data miner needs to describe or predict the relationship between a set of independent variables (also referred to as inputs, features) and a dependent variable (also referred to as outcome, output or a target variable). Moreover, the dependent variable can be categorical (i.e. churn rate or classes of customers) or continuous (i.e. money earned from that customer) while the independent variables may be of any type but needs to be coded properly (i.e. dividing the categorical variables into separate binary variables). From a marketing and business intelligence perspective, I will divide supervised learning into two interrelated tasks: (1) supervised classification tasks and (2) Predictive Analysis.
Supervised Classification tasks: Supervised classification tasks occurred when you want to predict correctly to which class/category (this is the dependent variable) belong the new observations (i.e. customers) based on results from an already known training dataset. Generally, you will achieve this task by using: (1) a training dataset, (2) a validation dataset and (3) a test dataset. Most known methods I’m using for these tasks are the following:
Predictive Analysis: I’ve decided to include the expression “Predictive Analysis” here, since it’s a buzzword in the web community nowadays. However, any task related to supervised classification involve a so-called “Predictive Analysis”. However, “Predictive Analysis” is a broader expression that also includes tasks related to the prediction of a continuous dependent variable rather than a categorical variable. Additional methods which can’t be used to conduct classification analyses may be used for predictive analyses with continuous variables and vice-versa.
Unsupervised learning is when the data miner task is to detect patterns based only on independent variables. It is generally presented more from an algorithmic fashion rather than from a purely statistical fashion. Well-known methods applied to marketing includes: (1) Market Basket Analysis and (2) Clustering.
Market Basket Analysis: Market basket analysis (also abbreviated as MBA to confuse you even more) is certainly one of the most known and easier task relating data mining and marketing. It is considered more as a typical marketing application rather than as a data mining method. It can be simplified as a simple Amazon recommendation algorithm showing as an association rule that “the probability that customers who bought item A also bought item B is 56%”. The classic urban legend about Market Basket Analysis is the “beer” and “diapers” association where a large supermarket chain, most people will say Walmart, did a Market Basket Analysis of customers’ buying habits and found an association between beer purchases and diapers purchases. It was theorized that the reason for this was that fathers were stopping off at Walmart to buy diapers for their babies, and since they could no longer go to bars and pubs as often as before, they would buy beer as well. As a result of this finding, the supermarket chain managers have placed the diapers next to the beer in the aisles, resulting in increased sales for both products.
Clustering: The method of Clustering is defined as the assignment of a set of observations (customers) in subsets (clusters) where customers in a cluster are similar to each other while they are different from other customers in other clusters. Clustering is often used in marketing for segmentation tasks. However, even though segmentation may be achieved through “clustering”, more modern supervised methods such as Bayesian Mixture Models, which I must say are not really part of the data mining field, are used by the few practitioners who can actually understand how to program this method (this is one method I am programming these days). For more about segmentation, I would refer anyone to the book “Market Segmentation: Conceptual and Methodological Foundations” by Michel Wedel and Wagner A. Kamakura, both are professors and well-known authorities on the topic.
Some Top References
I must say without a doubt that the best book I know about data mining is surely “Elements of Statistical Learning” by Stanford Professors Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman which covers broadly and nearly every type of methods you can use to conduct data mining tasks. However I must admit that this book has a focus on the statistics behind the methods (but it’s extremely clear) rather than on the software tools (No, it’s not a cookbook) you could use to conduct these analyses, and it may also lack of marketing applications for a marketer. Furthermore, to get some updates about the data mining world, KD Nuggets, administered by Gregory Piatetsky-Shapiro, is actually THE reference for the data mining world.
Some Top Software
Here is a description of some software I would recommend for data mining tasks, feel free to propose your own software in the comments section:
1. R: R is actually my favorite software. I have been using the software for mainly all of my statistics-related tasks for the last 2 years. Its free, open source, it has an extensive and very knowledgeable community, it’s extremely intuitive and it can be learned more easily if you have knowledge of software such as C++, Python and/or GAUSS. Furthermore, there are a lot of useful packages available to facilitate the coding. However, I must say that compared to C++ or SAS, sometimes R can be slow for data mining tasks involving a heavy load of data.
2. rattle: rattle, which stands for the R Analytical Tool To Learn Easily, is a “point and click” data mining interface related to R and developed by Graham Williams of Togaware. Frankly, I must admit that this software rocks even though I generally don’t like “point and click” software. It’s extremely complete and quite easy to use.
3. SAS Enterprise Miner: SAS Enterprise Miner, a module in SAS, was the first software I used for performing data mining tasks. It is extremely fast and user-friendly. However, I must admit that it reminds me software like Amos, now included in PASW (formerly SPSS) for Structural Equation Modeling (SEM) tasks, where you move the “little truck” to build your model and don’t really understand what you’re doing at the end of the day. Furthermore, it costs a lot but to my knowledge, SAS is the only software platform integrating data mining tasks with web analytics and social media analytics.
4. RapidMiner: RapidMiner formerly known as YALE (Yet Another Learning Environment) is considered by multiple data miner as THE software to use to conduct data mining tasks. Similarly to R, the software is open source as well as free of charge for the “Community” version. I haven’t made the switch from R to RapidMiner yet and I am currently testing the software in depth.
5. Salford Systems: I must confess that I never used Salford Systems software but know them by reputation, thus, I can’t have a clear personal opinion on the software. However, statisticians working at Salford Systems are presenting workshops on data mining for the next Joint Statistical Meeting (JSM) in Miami at the end of July 2011 which I might attend.
Waiting time and Conclusion
Whatever the software you’re using, data mining-related tasks will always be demanding in terms of your computer memory. Data Mining in marketing and business intelligence and more broadly KDD is an art that requires strong statistical skills but also a great comprehension of marketing problems. So when you’re waiting for your data mining computations, feel free to come by and read my other cool posts on your other computer! In anyways, enjoy data mining and as one of my friend would say “show some respect to the machine”, but even more to the data miner!
Convergence, convergence and convergence, but what the heck are you talking about? Convergence can take many forms and since it is a buzzword for managers – along with words such as “viral”, “word-of-mouth”, “social media”, “sustainability” or “economic crisis” – there are still many individuals that have difficulties to define what convergence really is. According to the Merriam Webster online dictionary, the word “convergence” refers as: “the merging of distinct technologies, industries, or devices into a unified whole”. Thus, in relation to this definition, the problem with the word “convergence” is that it can take many forms. To be more concise, it can take at least three forms that are useful to know for e-marketers.
The first form of convergence is used in terms of technological tools. In this way, most of useful (and even useless) electronic devices are now integrated in smartphones. In fact, most smartphones include (or will include) features of “traditional” cell phones, but also, devices like cameras, computers (desktop or laptop), electronic agendas, GPS, MP3 players and video game consoles. Moreover, it seems like a matter of time (but perhaps lots of time) before everyone has its own smartphone.
The second form of convergence is translated by the increase in the number of technological tools and transportations that converges to the Internet. Nowadays, it is possible to have access to the Internet in any transportation vehicle (airplanes, cars, boats and trains), as well as via many technological tools (cell phones, computers, interactive digital televisions, interactive kiosks and smartphones). Linking this second form of convergence with the first form leads me to predict that the convergence in terms of technological tools in smartphones will also result in an explosion in the number of smartphones kit available for any type of other technological tool, similar to the iPod car kit.
Finally, the third form is the convergence of the content of media to the Internet. Thus, more and more media such as advertising billboards, magazines, newspapers, radio stations, SMS, and television networks, produce content that includes an expression such as “visit our website at …”, that refers to a specific Internet website. By linking this form of convergence with the other two, media such as advertising billboards, radio stations, SMS, and television networks, will be able (in a near future) to instantaneously converge to the Internet by using smartphones. In the case of magazines and newspapers, it is still hard to predict what will happen, but the decreasing number of subscribers who actually read them will tend to convey those two media to concentrate their effort towards niche markets.
In conclusion, if the “future is friendly” as the Telus catchphrase proposes, then everyone will end up in the next few years with a smartphone that they will be able to plug everywhere using some sort of kit. So, do you think the future is that friendly?
The famous expression “Big Brother is watching you”, directly taken from George Orwell’s visionary book entitled 1984, written in 1948 and published in 1949, hits the field of marketing one more time. And this time, in-store advertisers are the “evil” marketers involved. In this way, in-store advertisers have started to use facial recognition software incorporated into displays to gather consumers’ information such as gender, age and ethnicity (through skin pigmentation) in order to (with a millisecond lag time) target these consumers with personalized interactive ads. This in-store practice is labeled “ad targeting” and parallels the online commonly used practice of “behavioral targeting”, but in an offline store context. This new practice can also be seen as an extension of the detailed procedure described in the well-acclaimed 1999 business best-seller book “Why We Buy – The Science of Shopping” by Paco Underhill, in which the analyst is hiding near the consumer and noting on a track sheet every single characteristic and movement he, the consumer, makes.
A good visual example of in-store displays that target consumers by personalized ads is illustrated in a scene of Steven Spielberg’s 2002 movie “Minority Report” where the character played by Tom Cruise is a fugitive running through a shopping mall populated by interactive ads targeted directly at him. Real-life examples include, as noted in Emily Steel’s Wall Street Journal article, the American restaurant Dunkin’ Donuts, where people ordering a coffee in the morning are exposed, based on their characteristics, to ads at the cash register promoting, for instance, the chain’s hash browns or breakfast sandwiches. Moreover, this practice has also been used in Japan, where interactive billboards have played the same role for consumer as in-store displays.
The practice of in-store ad targeting is a new dream coming true for marketers, like it was the case for behavioral targeting with the Internet at the beginning of the millenium. However, the most relevant question for managers is how will consumers react to this technology – which can help some consumers save time and money, but which can also be perceived as intrusive by others. If computerized in-store displays are to become part of your everyday life, how would you, as a consumer, react to this new technology?