Predicting the future with big data

Telling apart digital fortunetelling and science will never be easy.

"The octopus chose the box of food with the 'correct' flag with 100 percent success rate on the seven matches that Germany played, something that has a 1 in 128 chance of happening by accident," writes Carlos Castillo [Reuters]

In today’s world we are surrounded by predictions. For instance, during political elections the main focus of the media and the public is not on the differences between the candidates’ positions, but rather on the “horse race” aspect of the competition. Issues at stake are secondary compared to the main question: who is going to win?

Professional pollsters bombard us with predictions, all of them carrying footnotes about stratified sampling, confidence values and error margins. Unfortunately, for most people, math classes are not the fondest memory from school years, and for many, even remembering math exams still brings a sort of physical discomfort. It is not surprising that all the “details” about statistics are often overlooked.

However, it is important that we pay more attention to the details of the statistics we are being shown. For instance, do you see anything wrong with this chart from late 2011? (Hint: compare March and November).

Problems with statistics and predictions are not limited to graphic representation and in fact, can be more complicated and challenging, especially with the advent of Big Data and its use in making projections.

Data trends

Five years ago Google launched Flu Trends, showing that the search volume of certain terms was correlated with levels of flu activity. They found that the increase in the usage of flu-related terms happened days before healthcare authorities reported an increase in cases of flu. This study provided a template for several ones that followed suit: Data mining of online content could help predict real-world outcomes.

To put it succinctly: the success at predicting past events is evidence that you are not bad at making predictions, but does not prove that you are good at it.

Researchers turned their attention to data from online searches, blogs, news articles, tweets, etc. Successful predictions were made about the stock market, the market of jobs and cars, the box-office revenue of movies, the sales of music and video-games, etc. And of course, predictions were made about the results of political elections.

Scientists were quick to point out, however, that there are many limitations to these predictions. First, for media products (e.g. music), sometimes there are public data sources, such as Billboard Top 100 lists, that produce predictions of record sales that are similar or better than those obtained with big data. Second, in the political arena if you use the same successful methodology in a different country or on a different election, the results can be completely wrong.

A general problem is that online data sets are big, but there is no guarantee that they are representative of the overall population. Often data from social media tends to over-represent an urban, educated and privileged population. And even if you could predict elections or the stock market, your accurate predictions would change the object of your study in such a way to render your predictions meaningless.

Digital fortunetelling

Telling good predictions apart from bad predictions is a very difficult matter. In 2010 Paul the Octopus rose to fame by predicting the outcome of many matches during the FIFA World Cup. The octopus chose the box of food with the “correct” flag with 100 percent success rate on the seven matches that Germany played, something that has a 1 in 128 chance of happening by accident.

However, most of the press overlooked the fact that Paul was not the only non-human animal making “predictions” at the time in Germany, with at least a porcupine, a pigmy hippopotamus and a tamarin monkey playing the same game with less success.

This is a typical pattern noted by epistemologist Nassim N. Taleb. If you start with a large group making predictions at random, for instance about the stock market, after a while most will have a mix of successes and mistakes, but inevitably, a few will have a very good record. This set of privileged people will exist independently of whether they are truly making an informed prediction, or just guessing.

To put it succinctly: the success at predicting past events is evidence that you are not bad at making predictions, but does not prove that you are good at it.

Statistical truths

However difficult the prediction business can be, we actually use statistical truths in our daily lives. Take for instance the “rare side effects” usually listed in the prospects of medicines. Aspirin may cause mild side effects such as nausea and stomach pain, but it can also cause hallucinations and seizures.

Lists of side effects are almost invariably compiled not from things that may happen, but from things that actually have happened to someone during clinical trials or later on. After reading the prospect, it is the patient’s decision to think about what “rare” means. Or not think about it at all.

There are many methods for measuring the extent of statistical truths. In particular for the case of predictions, what researchers typically look for is the extent to which the observed “causes” inform a prediction that agrees with reality, accompanied by a plausible mechanism explaining this agreement. For instance, in the case of Paul the Octopus the agreement of predictions and outcomes may be there, but the plausible mechanism clearly isn’t.

To complicate things further, measuring rare predictions usually require a different rule of measurement. Accuracy is a pretty good way of conveying information about a prediction of a standard event, for instance the batting accuracy of a baseball or cricket player, but it can be very misleading for rare events.

For instance, one could predict whether it will rain tomorrow in Doha with 97.8 percent accuracy, which sounds very good! However, it rains in Doha on average only 8 days a year. The chance of rain is so low (8/365 = 2.2%) that always saying “it won’t rain” is enough for this level of accuracy.

Instead, we should ask about false positives and false negatives: How often did we say it will rain, but it didn’t? How often did it rain, but we said it wouldn’t? When dealing with rare events, accuracy as a measure is meaningless.

It is hard to make predictions

“Big Data” is a fairly recent buzzword, but its field of study has been around for quite some time. This is attested by international scientific conferences on data mining and machine learning that have existed for 20 and 30 years respectively, as well as newer venues such as IEEE Big Data. What is new is the level of attention that data science is receiving from the public, which is not a bad thing in itself. However, as with all other scientific disciplines, there is a shared responsibility of the scientists, the media and the public to communicate these advances in an honest and clear manner.

For many decades now, Big Data has been successfully used to make predictions in the marketing, insurance, retail, and many other sectors. Online data is being added to the mix with promising results, which as everything else, have to be examined with care and scepticism until they are tested exhaustively.

The late baseball coach of the New York Yankees, Yogi Berra, said once that “it is hard to make predictions … especially about the future.” Interpreting those predictions may be just as hard. Hard, but not impossible.

Carlos Castillo is a Senior Scientist at the Qatar Computing Research Institute in Doha. He is a web miner with a background on information retrieval, and has been influential in the areas of adversarial web search and web content quality and credibility. 

Follow him on Twitter: @ChaToX