Vous êtes sur la page 1sur 6

data science

Big data: are we making


a big mistake?
Economist, journalist and broadcaster Tim Harford delivered the 2014 Significance
lecture at the Royal Statistical Society International Conference. In this article,
republished from the Financial Times, Harford warns us not to forget the statistical
lessons of the past as we rush to embrace the big data future

Five years ago, a team of researchers from Google to sell. Some emphasise the sheer scale of the data
announced a remarkable achievement in one of sets that now exist – the Large Hadron Collider’s
the world’s top scientific journals, Nature. Without computers, for example, store 15 petabytes a year of
needing the results of a single medical check-up, they data, equivalent to about 15,000 years’ worth of your
were nevertheless able to track the spread of influenza favourite music.
across the US. What’s more, they could do it more But the “big data” that interests many companies
quickly than the Centers for Disease Control and is what we might call “found data”, the digital exhaust
Prevention (CDC). Google’s tracking had only a day’s of web searches, credit card payments and mobiles
delay, compared with the week or more it took for the pinging the nearest phone mast. Google Flu Trends
CDC to assemble a picture based on reports from was built on found data and it’s this sort of data that
doctors’ surgeries. Google was faster because it was interests me here. Such data sets can be even bigger
tracking the outbreak by finding a correlation between than the LHC data – Facebook’s is – but just as
what people searched for online and whether they had noteworthy is the fact that they are cheap to collect
flu symptoms. relative to their size, they are a messy collage of data
Not only was “Google Flu Trends” quick, accurate points collected for disparate purposes and they can be
and cheap, it was theory-free. Google’s engineers didn’t updated in real time. As our communication, leisure
bother to develop a hypothesis about what search and commerce have moved to the internet and the
terms – “flu symptoms” or “pharmacies near me” – internet has moved into our phones, our cars and
might be correlated with the spread of the disease itself. even our glasses, life can be recorded and quantified
The Google team just took their top 50 million search in a way that would have been hard to imagine just a
terms and let the algorithms do the work. decade ago.
The success of Google Flu Trends became Cheerleaders for big data have made four
emblematic of the hot new trend in business, exciting claims, each one reflected in the success
technology and science: “Big Data”. What, excited of Google Flu Trends: that data analysis produces
journalists asked, can science learn from Google? uncannily accurate results; that every single data
As with so many buzzwords, “big data” is a vague point can be captured, making old statistical sampling
term, often thrown around by people with something techniques obsolete; that it is passé to fret about what

14 december2014
causes what, because statistical correlation Consultants urge the data-naive to they showed that Google’s estimates of the
tells us what we need to know; and that wise up to the potential of big data. A recent spread of flu-like illnesses were overstated by
scientific or statistical models aren’t needed report from the McKinsey Global Institute almost a factor of two.
because, to quote “The End of Theory”, a reckoned that the US healthcare system could The problem was that Google did not
provocative essay published in Wired in save $300bn a year – $1,000 per American know – could not begin to know – what
2008, “with enough data, the numbers speak – through better integration and analysis of linked the search terms with the spread of flu.
for themselves”. the data produced by everything from clinical Google’s engineers weren’t trying to figure out
Unfortunately, these four articles of faith trials to health insurance transactions to what caused what. They were merely finding
are at best optimistic oversimplifications. smart running shoes. statistical patterns in the data. They cared
At worst, according to David Spiegelhalter, But while big data promise much to about correlation rather than causation. This
Winton Professor of the Public scientists, entrepreneurs and governments, is common in big data analysis. Figuring out
Understanding of Risk at Cambridge they are doomed to disappoint us if we ignore what causes what is hard (impossible, some
University, they can be “complete bollocks. some very familiar statistical lessons. “There say). Figuring out what is correlated with
Absolute nonsense.” are a lot of small data problems that occur what is much cheaper and easier. That is why,
in big data,” says Spiegelhalter. “They don’t according to Viktor Mayer-Schönberger and
disappear because you’ve got lots of the stuff. Kenneth Cukier’s book, Big Data, “causality
The data exhaust
They get worse.” won’t be discarded, but it is being knocked
Found data underpin the new internet Four years after the original Nature paper off its pedestal as the primary fountain
economy as companies such as Google, was published, Nature News had sad tidings of meaning”.
Facebook and Amazon seek new ways to convey: the latest flu outbreak had claimed But a theory-free analysis of mere
to understand our lives through our data an unexpected victim: Google Flu Trends. correlations is inevitably fragile. If you have
exhaust. Since Edward Snowden’s leaks After reliably providing a swift and accurate no idea what is behind a correlation, you have
about the scale and scope of US electronic account of flu outbreaks for several winters, no idea what might cause that correlation
surveillance it has become apparent that the theory-free, data-rich model had lost its to break down. One explanation of the Flu
security services are just as fascinated nose for where flu was going. Google’s model Trends failure is that the news was full of
with what they might learn from our data pointed to a severe outbreak but when the scary stories about flu in December 2012 and
exhaust, too. slow-and-steady data from the CDC arrived, that these stories provoked internet searches

december2014 15
16 december2014
by people who were healthy. Another possible reflect the true views of the population. The to have decided the sampling problem isn’t
explanation is that Google’s own search “margin of error” reported in opinion polls worth worrying about. It is.
algorithm moved the goalposts when it began reflects this risk and the larger the sample, Professor Viktor Mayer-Schönberger of
automatically suggesting diagnoses when the smaller the margin of error. A thousand Oxford’s Internet Institute, co-author of Big
people entered medical symptoms. interviews is a large enough sample for many Data, told me that his favoured definition of
Google Flu Trends will bounce back, purposes and Mr Gallup is reported to have a big data set is one where “N = All” – where
recalibrated with fresh data – and rightly conducted 3,000 interviews. we no longer have to sample, but we have the
so. There are many reasons to be excited But if 3,000 interviews were good, why entire background population. Returning
about the broader opportunities offered to weren’t 2.4 million far better? The answer is officers do not estimate an election result with
us by the ease with which we can gather and that sampling error has a far more dangerous a representative tally: they count the votes
analyse vast data sets. But unless we learn the friend: sampling bias. Sampling error is – all the votes. And when “N = All” there is
lessons of this episode, we will find ourselves when a randomly chosen sample doesn’t indeed no issue of sampling bias because the
repeating it. reflect the underlying population purely by sample includes everyone.
Statisticians have spent the past 200 chance; sampling bias is when the sample isn’t But is “N = All” really a good description
years figuring out what traps lie in wait when randomly chosen at all. George Gallup took of most of the found data sets we are
we try to understand the world through data. pains to find an unbiased sample because considering? Probably not. “I would challenge
The data are bigger, faster and cheaper these he knew that was far more important than the notion that one could ever have all the
days – but we must not pretend that the traps finding a big one. data,” says Patrick Wolfe, a computer scientist
have all been made safe. They have not. and professor of statistics at University
In 1936, the Republican Alfred College London.
Landon stood for election against President An example is Twitter. It is in principle
Franklin Delano Roosevelt. The respected Statisticians have spent the possible to record and analyse every message
magazine, The Literary Digest, shouldered past 200 years figuring out on Twitter and use it to draw conclusions
the responsibility of forecasting the result. It what traps lie in wait when about the public mood. (In practice, most
conducted a postal opinion poll of astonishing researchers use a subset of that vast “fire hose”
ambition, with the aim of reaching 10 million
we try to understand the world of data.) But while we can look at all the
people, a quarter of the electorate. The deluge through data. We must not tweets, Twitter users are not representative of
of mailed-in replies can hardly be imagined pretend that the traps have all the population as a whole. (According to the
but the Digest seemed to be relishing the been made safe Pew Research Internet Project, in 2013, US-
scale of the task. In late August it reported, based Twitter users were disproportionately
“Next week, the first answers from these young, urban or suburban, and black.)
ten million will begin the incoming tide of There must always be a question about
marked ballots, to be triple-checked, verified, The Literary Digest, in its quest for who and what is missing, especially with a
five-times cross-classified and totalled.” a bigger data set, fumbled the question messy pile of found data. Kaiser Fung, a data
After tabulating an astonishing 2.4 of a biased sample. It mailed out forms analyst and author of Numbersense, warns
million returns as they flowed in over two to people on a list it had compiled from against simply assuming we have everything
months, The Literary Digest announced automobile registrations and telephone that matters. “N = All is often an assumption
its conclusions: Landon would win by a directories – a sample that, at least in 1936, rather than a fact about the data,” he says.
convincing 55 per cent to 41 per cent, with a was disproportionately prosperous. To Consider Boston’s Street Bump
few voters favouring a third candidate. compound the problem, Landon supporters smartphone app, which uses a phone’s
The election delivered a very different turned out to be more likely to mail back accelerometer to detect potholes without
result: Roosevelt crushed Landon by 61 per their answers. The combination of those two the need for city workers to patrol the
cent to 37 per cent. To add to The Literary biases was enough to doom The Literary streets. As citizens of Boston download
Digest’s agony, a far smaller survey conducted Digest’s poll. For each person George Gallup’s the app and drive around, their phones
by the opinion poll pioneer George Gallup pollsters interviewed, The Literary Digest automatically notify City Hall of the need to
came much closer to the final vote, forecasting received 800 responses. All that gave them for repair the road surface. Solving the technical
a comfortable victory for Roosevelt. Mr their pains was a very precise estimate of the challenges involved has produced, rather
Gallup understood something that The wrong answer. beautifully, an informative data exhaust that
Literary Digest did not. When it comes to addresses a problem in a way that would
data, size isn’t everything. have been inconceivable a few years ago.
History repeating?
Opinion polls are based on samples of The City of Boston proudly proclaims that
the voting population at large. This means The big data craze threatens to be The the “data provides the City with real-time
that opinion pollsters need to deal with two Literary Digest all over again. Because found information it uses to fix problems and plan
issues: sample error and sample bias. Sample data sets are so messy, it can be hard to figure long term investments.”
error reflects the risk that, purely by chance, a out what biases lurk inside them – and Yet what Street Bump really produces,
randomly chosen sample of opinions does not because they are so large, some analysts seem left to its own devices, is a map of potholes

december2014 17
that systematically favours young, affluent
areas where more people own smartphones. How can statisticians rise to the big data challenge?
Street Bump offers us “N = All” in the sense At the conclusion of his 2014 Significance lecture, Tim Harford was asked for his view on what
that every bump from every enabled phone statisticians need to do to help users of data avoid falling into the big data traps.
can be recorded. That is not the same thing “One of the things we have to do is demonstrate examples where mistakes have been made,
as recording every pothole. As Microsoft and explain how, with the appropriate statistical tools, preparation, wisdom and insight, those
researcher Kate Crawford points out, mistakes would not have been made,” he said.
found data contain systematic biases and Proving the value of statistics would also come from interdisciplinary working; from
it takes careful thought to spot and correct statisticians “teaming up with computer scientists, astronomers, the bioinformatics people –
for those biases. Big data sets can seem anybody else who is working with these large data sets – and showing them that statistics has
comprehensive but the “N = All” is often a a tremendous amount to offer”.
seductive illusion. He concluded: “Statistics has never been cooler; it’s never been more useful. It just seems
Who cares about causation or sampling to me to be a wonderful time to be a statistician.”
bias, though, when there is money to be Brian Tarran
made? Corporations around the world must
be salivating as they contemplate the uncanny
success of the US discount department store Hearing the anecdote, it’s easy to assume It is routine, when examining a pattern
Target, as famously reported by Charles that Target’s algorithms are infallible – that in data, to ask whether such a pattern might
Duhigg in The New York Times in 2012. everybody receiving coupons for onesies and have emerged by chance. If it is unlikely
Duhigg explained that Target has collected so wet wipes is pregnant. This is vanishingly that the observed pattern could have
much data on its customers, and is so skilled unlikely. Indeed, it could be that pregnant emerged at random, we call that pattern
at analysing that data, that its insight into women receive such offers merely because “statistically significant”.
consumers can seem like magic. everybody on Target’s mailing list receives The multiple-comparisons problem
Duhigg’s killer anecdote was of the man such offers. We should not buy the idea arises when a researcher looks at many
who stormed into a Target near Minneapolis that Target employs mind-readers before possible patterns. Consider a randomised
and complained to the manager that the considering how many misses attend each hit. trial in which vitamins are given to some
company was sending coupons for baby In Charles Duhigg’s account, Target primary schoolchildren and placebos are
clothes and maternity wear to his teenage mixes in random offers, such as coupons for given to others. Do the vitamins work? That
daughter. The manager apologised profusely wine glasses, because pregnant customers all depends on what we mean by “work”. The
and later called to apologise again – only to would feel spooked if they realised how researchers could look at the children’s height,
intimately the company’s computers weight, prevalence of tooth decay, classroom
understood them. behaviour, test scores, even (after waiting)
Fung has another explanation: Target prison record or earnings at the age of 25.
mixes up its offers not because it would be Then there are combinations to check: do the
Found data contain systematic weird to send an all-baby coupon-book to a vitamins have an effect on the poorer kids, the
biases and it takes careful woman who was pregnant but because the richer kids, the boys, the girls? Test enough
thought to spot and correct for company knows that many of those coupon different correlations and fluke results will
books will be sent to women who aren’t drown out the real discoveries.
those biases. “N = All” is often pregnant after all. There are various ways to deal with this
a seductive illusion None of this suggests that such data but the problem is more serious in large data
analysis is worthless: it may be highly sets, because there are vastly more possible
profitable. Even a modest increase in the comparisons than there are data points to
accuracy of targeted special offers would be a compare. Without careful analysis, the ratio
prize worth winning. But profitability should of genuine patterns to spurious patterns – of
be told that the teenager was indeed pregnant. not be conflated with omniscience. signal to noise – quickly tends to zero.
Her father hadn’t realised. Target, after Worse still, one of the antidotes to
analysing her purchases of unscented wipes the multiple-comparisons problem is
The multiple-comparisons problem
and magnesium supplements, had. transparency, allowing other researchers to
Statistical sorcery? There is a more In 2005, John Ioannidis, an epidemiologist, figure out how many hypotheses were tested
mundane explanation. “There’s a huge false published a research paper with the self- and how many contrary results are languishing
positive issue,” says Kaiser Fung, who has explanatory title, “Why Most Published in desk drawers because they just didn’t seem
spent years developing similar approaches for Research Findings Are False”. The paper interesting enough to publish. Yet found
retailers and advertisers. What Fung means is became famous as a provocative diagnosis of data sets are rarely transparent. Amazon and
that we didn’t get to hear the countless stories a serious issue. One of the key ideas behind Google, Facebook and Twitter, Target and
about all the women who received coupons Ioannidis’s work is what statisticians call the Tesco – these companies aren’t about to share
for babywear but who weren’t pregnant. “multiple-comparisons problem”. their data with you or anyone else.

18 december2014
New, large, cheap data sets and powerful London. “But nobody wants ‘data’. What they will work by building on the old statistical
analytical tools will pay dividends – nobody want are the answers.” lessons, not by ignoring them.
doubts that. And there are a few cases in To use big data to produce such answers Recall big data’s four articles of faith.
which analysis of very large data sets has will require large strides in statistical methods. Uncanny accuracy is easy to overrate if we
worked miracles. David Spiegelhalter of “It’s the wild west right now,” says Patrick simply ignore false positives, as with Target’s
Cambridge points to Google Translate, Wolfe of UCL. “People who are clever and pregnancy predictor. The claim that causation
which operates by statistically analysing has been “knocked off its pedestal” is fine
hundreds of millions of documents that if we are making predictions in a stable
have been translated by humans and looking environment but not if the world is changing
for patterns it can copy. This is an example Big data do not solve the (as with Flu Trends) or if we ourselves hope
of what computer scientists call “machine problem that has obsessed to change it. The promise that “N = All”,
learning”, and it can deliver astonishing results and therefore that sampling bias does not
with no preprogrammed grammatical rules.
statisticians and scientists matter, is simply not true in most cases that
Google Translate is as close to a theory-free, for centuries: the problem of count. As for the idea that “with enough
data-driven algorithmic black box as we have insight, of inferring what is data, the numbers speak for themselves”
– and it is, says Spiegelhalter, “an amazing going on – that seems hopelessly naive in data sets
achievement”. That achievement is built on the where spurious patterns vastly outnumber
clever processing of enormous data sets. genuine discoveries.
But big data do not solve the problem “Big data” has arrived, but big insights
that has obsessed statisticians and scientists driven will twist and turn and use every have not. The challenge now is to solve new
for centuries: the problem of insight, of tool to get sense out of these data sets, and problems and gain new answers – without
inferring what is going on, and figuring out that’s cool. But we’re flying a little bit blind at making the same old statistical mistakes on a
how we might intervene to change a system the moment.” grander scale than ever.
for the better. Statisticians are scrambling to develop
“We have a new resource here,” says new methods to seize the opportunity of big From The Financial Times © 2014 The Financial Times
Professor David Hand of Imperial College data. Such new methods are essential but they Ltd. All rights reserved

december2014 19

Vous aimerez peut-être aussi