Vous êtes sur la page 1sur 2

2.

Embracing Real-world messiness


In the 3rd chapter of the book, authors Viktor Mayer-Schonberger and Kenneth
Cukier have put forward the idea that looking at erroneous, corrupted data not as
problems to get rid of but as an unavoidable part of real-world data is one of the
fundamental shifts in going to big data from small.
While recognizing the historical importance of precision and exactitude, the
writers of the book claim that imprecision - or messiness -might actually be a
positive feature, not a shortcoming when it comes to Big Data. They argue that
allowing for messiness to have a much bigger set of data is the core idea behind
Big Data. In this context, they put forward the idea of More trumps better.
Although we may be able to overcome the errors if we throw enough resources at
them, the writers claim in many cases it is more fruitful to tolerate error than it
would be to work at preventing it.
An example in the field of natural language processing is provided in the context,
where the performance of grammar-checking algorithm based on machine
learning improved from 75% accuracy to 95% accuracy when the data fed to it
increased from 10 million words to a billion words.
The chapter also presents an example of how a bigger messier data set is more
efficient than a smaller cleaner data set. To support their argument, the authors
have compared the attempts at language translation by two different companies.
IBM used 10 years worth of Canadian Parliamentary transcripts published in
French and English. This set of data contained about three million sentence
pairs. Google took a slightly different approach to the problem of language
translation - it availed itself to a larger but also much messier dataset: the entire
global Internet and more. This amounted to 95 billion English sentences.
Despite the messiness of the input, the writers assert Googles translation
works best.

3. Correlation > Causation


In the fourth chapter of the book titled Correlation, the writers argue that
Knowing what, not why, is good enough. Schonberger and Cukier provide the
example of how the computer might not have known why a customer who read
Ernest Hemingway might also like to buy F. Scott Fitzgerald(on amazon). But that
didnt matter.
Getting insights from Big Data doesnt necessarily have to do with exploring the
causes; the authors write Correlations let us analyze a phenomenon not by
shedding light on its inner workings but by identifying a useful proxy for it.
Making sense of what the data is trying to say is not nearly as important as what
its actually saying. The writers support their argument by an example of how

correlation-based analysis revealed that the vital signs of a premature baby are
very constant prior to a serious infection. This strange finding comes from a
software that captures and processes patient data in real time. While it doesnt
explain the reason behind this correlation, Big Data allows human caregivers do
what they do best.

Vous aimerez peut-être aussi