Big Data and Science

Big Data and Science: Myths and Reality
Muhammad Umar
Department of Computer Science & Information Technology University of Sargodha
University Road Sargodha, Punjab, Pakistan
Email:
Abstract
As Big Data unbending draws attention Exchange generates about one Terabyte
from every part of a society, it has also of new trade data per day.
suffered from many characterizations Or The statistic shows that 500+
that are incorrect. This article will terabytes of new data get ingested into
explores common myths about Big Data the databases of social media
and exposes the truths which are site facebook, every day. This data is
underlying. Big Data impacts nearly mainly generated in terms of photo and
every aspect of our modern society, video uploads, message exchanges and
including business, government, health putting like, comments etc.
care and research in almost every
discipline: life sciences, engineering,
natural sciences, art and humanities. As Types of Big Data
it has drawn much attention and become
economically important, there are many 1. Structured
who have preferred angles on the 2. Unstructured
interpretation of Big Date. At the same 3. Semi-Structured
time, as many have been exposed to the
term with little prior knowledge of Structured Data
computing or technology they are easily Any data that can be stored accessed and
sawayed by the EXPERTS. There is a processed in the form of fixed format is
widely use of the term Big Data in ways termed as a structured data.
that are inappropriate but self serving. In
many cases these erroneous Unstructured Data
interpretations have then been taken up Any data with unknown form or the
and amplified by others, including even structure is classified as unstructured
technically sophisticated people. In this data. In addition to the size being huge
article I will discuss some of the more unstructured data poses multiple
common myths. challenges in terms of its processing for
deriving value out of it.
Introduction
Semi-Structured Data
Big Data is a huge size Data. It is a term Semi-structured data can contain both
used to describe a collection of data in the forms of data. We can see semi-
huge volume and then growing with structured data as a structured in form
time. In short it is such a data which is but it is actually not defined with e.g. a
so large and complex that none of the table definition in relational DBMS.
old or traditional data management tools
are able to store it or process it very
efficiently. For example the New York 1 Big Data Myth: Size is all that
matters focused on size, and proposed the now
famous “3Vs” of Big Data [5]. IBM then
The word “Big” highlight the size. It pushed for adding a 4th V [6], and this
also the case that measures of size are has been accepted by most. So,
very easily conveyed. We all heard theoretically, most technical people will
statements about how high a stack of tell you that Big Data raises issues of
phone books is required to store the data Volume, Velocity, Variety, and Veracity
that is easily kept on one disk drive. So (or at least the first three of these). But
it is not surprising that for many lay then they will immediately go on to
people, Big Data is all about size.One discuss how many Petabytes there are in
would think that technical people would some problem. I have discussed above,
know better. Unfortunately, size also why Volume (or size) gets undue
lends itself to easy measurement. It is attention. Let me turn now to why I
straightforward to count up the number think Variety and Veracity do not get the
of bytes in some data store, and equally attention they deserve. One major reason
easy to plot a sequence of such for this lack of attention is that there is
measurements on a chart no well-accepted measure for either. If
showing exponential growth. In fact, there is no measure, it is hard to track
such charts have become so common progress. If I have a company and
that even many lay people get the develop an innovative system that can
concept. What this leads to, among other handle a slightly larger volume than the
things, is serious people apologetically competition, I can show this off with
saying that they only have a few measurements against some benchmark.
hundred gigabytes of data and so are not If I am an academic and develop an
sure that they really have a Big Data algorithm that scales better than the
problem. This is sad, because we are competition, I know exactly how to
putting off so many people we ought to compare my algorithm against the
be able to help. In spite of the points competition and persuade skeptical
made above, I believe that better sense reviewers. In contrast, consider variety.
would have prevailed in our If I have a product that makes handling
understanding of Big Data if it were not variety a little easier, what technical
for the economic imperatives of the IT claim can I make that doesn't sound like
industry. We have today a huge marketing hype? If I write a paper about
ecosystem of Big Data systems. These a data model that is better at handling
systems are, for the most variety than the current state of the art, I
part,innovative: collectively, they have to think very hard about how I will
constitute a whole new paradigm of compare against the competition and
scaling. There are many who have establish the goodness of my idea.
problems that require this scale and are Progress is hard in things you cannot
amenable to these new architectures. measure, in both industry and academia.
These facts have led to the creation of a Variety may be the hardest of the 4Vs to
new industry segment and beneficial address, but it is the one that people are
many, all of which is good. But the least motivated to speak about. Veracity
tremendous progress made in this space suffers from most of the same problems
has also sucked the Oxygen out of the as Variety. Under very simplistic
air for everything else, as it were. models, we can at least begin to measure
Industry wants to talk about volume, for some things, establish some probabilities
economic reasons. And money and some distributions, and so forth. But
speaks.Several years ago, the Gartner everyone recognizes that these measures
group noticed this undue attention are based on unrealistically simple
models: for instance ones that assume both circular and self-serving: a Big
independence when we know that is not Data problem is one that is best
true. Therefore, measures are taken with addressed using elements drawn from
a grain of salt, and Veracity is scarcely the Big Data “toolbox.” This definition
easier to address than Variety. To is specific because there is general
conclude, Volume and Velocity are agreement about what tools are in the
indeed challenging. But Variety and Big Data toolbox: most tool producers
Veracity are far more challenging. It is self categorize themselves appropriately.
time we focused the conversation around The definition is circular because it
Big Data appropriately. really does not define what goes into the
toolbox. If we did not have an explicit
listing, we would be defining a Big Data
2 Big Data Myth: Central challenge tool as a software system that addresses
with Big Data is that of devising new at least some aspects of a Big Data
computing algorithms and problem, or some such similar
architectures statement. The definition is self-serving
If we consciously thinking about Big because it anoints a set of tools and a
Data in the terms of the 4Vs, we style of system architecture as “the
immediately have a question of solution” to the Big Data problem. This
determining what the thresholds are to definition is wrong because almost
call something “Big”. For Variety and everything in the Big Data toolbox is
Veracity we know this is not even an focused on Volume (frequently in
answerable question, because we do not conjunction with Velocity), with very
have measures in the first place. So let little consideration given to Variety and
us just consider Volume and Velocity. Veracity challenges. I believe that the
The threshold, for some people, is at the cloud, and what is today considered the
limit of what we know how to handle. “Big Data Ecosystem,” has its place in
Obviously, this is a moving target. But it the constellation of relevant
has the advantage of being inspirational. technologies, but is neither a complete
The fatal (in my opinion) drawback is solution in itself nor a required piece of
that it limits the size of the market to 1: every solution. My own threshold for
there is only one largest deployment in Big Data is more (along any of the 4
the world at any time (barring ties). axes) than we know how to handle in
Increasing the size of this deployment is context. The scientist (or manager) faces
definitely a worthwhile challenge, but a Big Data program when she has too
not one that an entire industry can be much data to be able to process using
built around and an entire academic field the spreadsheet program she knows. The
developed. The threshold, in some solution in this case may be as simple as
definitions, then becomes fixed, based moving to a database. But even such an
on the dominant architecture at some apparently simple transition can have
point in time, say 2010. So a data set many hidden issues: the spreadsheet's
qualifies, in terms of Volume, as Big current design may not be suitable for
Data if it is larger than can be handled a relational table (for example, a new
using the “standard” architectures in use column may be added every month),
at the beginning of the Big Data era. there may be inter dependencies with
With the ever-growing popularity other components of some complex
of Map-Reduce style computation, and work flow, and so on. Identifying and
the plethora of systems and tools in “the eliminating such barriers is legitimate
Big Data ecosystem, we then have a Big Data work. See, for example, the
definition that is specific, even if it is National Academies report on “Frontiers
in Massive Data Analysis” [4]. It is also acquire, and how to make the best of
worth noting that we can buy bigger data that is imperfect. Then decisions
systems, more machines, faster CPU, must be made to represent the data in a
and larger disks. But human ability does manner suitable for analysis, possibly
not scale! Moreover, the sizes that after extraction, cleaning, and
become challenging for humans are integration with other data sources. Even
often very small for computers. For in the analysis phase, which has received
example, consider a graph with just 40 much attention, there are poorly
nodes and 200 edges. Try plotting it on understood complexities in the context
screen with your favorite graph-drawing of multi tenanted clusters where several
program and then look for patterns. users' programs run concurrently. The
Even such a small graph is likely to be at final interpretation step is perhaps the
the limit of what we can manage with most crucial, because it cannot be
technology today. Big Data poses huge delegated – someone is responsible for
challenges for human interaction. Many making decisions based on the result of
of the most interesting problems in the the data analysis and this person has to
Big Data space deal with facilitating this understand and trust the results obtained
human interaction. first. Gaining this confidence will often
require provenance and explanation,
may need visualization, may even need
3. Big Data Myth 3: Analytic’s is the sensitivity analyses of various types. All
central problem with Big Data of these have to be planned for and
It is fully understandable that many lay performed effectively for the Big Data
people picture a Big Data System as a analysis to produce any real value.
magic piece of software that takes Big
Data as input and produces deep insights Phases of Big data life cycle
as output. Unfortunately, this miss
perception suits many companies, and
even some academics, very well. This
way, someone who builds a Big Data
system (in the sense described above)
can create the illusion of solving the
whole problem from soup to nuts even if
they are focused on just a piece of it.
The same goes for someone who
develops a novel analysis algorithm. But
Big Data is most definitely not machine
learning on Map Reduce. A group of
leading researchers from across the
United States wrote a white paper to
address this mis perception, see [1]. A The Big Data analysis is pipeline. The
shorter version, making the same main Major steps in the analysis of Big Data
points, appeared in CACM, July shown in the top half of the figure. Note
2014 [2]. Fig. 1 is reproduced from this the possible feedback loops at all stages.
white paper. The main point it makes is The bottom half of the figure shows Big
that there are many steps to the Big Data Data characteristics that make these
analysis pipeline, with crucial decisions steps challenging.
required at each step, and many
challenges to address in each. The first
decision is what data to record or
4 Big Data Myth: Data Reuse is low bare minimum required by the
hanging fruit publication venue or funding agency?
Furthermore, there remains sufficient
We have to collect data for some
diversity even within any one academic
purpose. It should possible to use it for a
sub-discipline that many of these meta
different purpose as well, thereby
data standards do not require details that
eliminating the substantial costs of
may be crucial in some specific case,
collecting data the second time. In fact,
even if not generally applicable. Efforts
reuse may be unavoidable in many
to establish a culture of data citation are
cases, if the second analysis is
crucial to address these problems. Third,
performed at a later time, when there is
data sets found are often not quite in the
no possibility of going back in time to
right form for the desired use.
collect historical data again. While this
Sometimes this is simply a question of
is a compelling opportunity, exploiting it
performing a mapping. But often more
requires addressing multiple challenges.
substantial mismatches have to be
First, the original data set has to be
resolved. One problem that I am
found at the time of the desired reuse. It
currently addressing has to do with
is relatively easy to tag data sets (or
administrative data, which tend to be
even make use of existing labels in the
reported rolled up by administrative
data set, such as attribute and table
jurisdiction. When such data are reused,
names) to find data sets that are on the
they need to be compared to (or joined
topic area of interest. But, in a large
with) data rolled up according to a
universe of data sets, there could be
different administrative hierarchy. If the
hundreds of data sets that are somehow
two hierarchies differ, such matching is
related to the topic of interest with only
not immediately possible. For example,
very few that actually have data on the
it is not straightforward to compare data
relationship of interest measured under
reported by school district with data
conditions of interest. We are only now
reported by county. Our approach to this
beginning to think about how we
problem is to develop innovative
characterize data sets to make them find
interpolation methods. The data reuse is
able. Second, data sets must be
critical to address and holds out great
understood and interpreted for them to
promise. But it also poses many
be reusable. Obviously, this requires
challenging questions, which are only
adequate meta data. Unfortunately, the
now being given the required attention.
word “adequate” in the preceding
sentence is often ignored. If we know
the creator and date, and the schema 5. Big Data Myth 5: Data Science is
declaration, that is insufficient meta data the same as Big Data
in most cases. It is quite likely that it
matters precisely under what conditions The ability to collect and analyze
the data were obtained, using what massive amounts of data is
instruments, after what kind of sample revolutionizing the way scientific
preparation. There is active work research is being conducted [3]. The
on meta data standards in many Sloan Digital Sky Survey [9] has
communities. Adhering to these transformed Astronomy from a field
standards will definitely move us where taking pictures of the sky was a
forward substantially. However, we also large part of an astronomer's job to one
need to address the issue of incentives, where the focus is on discovering
at least in the scientific community: why interesting objects and phenomena from
will a scientist spend time recording the databases. in the Biological
careful meta data. Why not just do the Sciences, there is now a well-established
tradition of depositing scientific data factors such as accuracy, latency, and
into a public repository, and also of cost.
creating public databases for use by
other scientists. The size and the number Comparing this definition of Data
of experimental data sets in many Science with the Gartner definition of
applications are increasing Big Data we saw previously, we
exponentially. Consider, for example, immediately notice that it is possible to
the advent of Next Generation do Data Science without doing Big Data,
Sequencing NGS [7]. The growth rate of and viceversa. Of course, nothing stops
the output of current NGS methods is Data Science from involving Big Data,
faster than the performance increase for and it indeed frequently does. However,
the (SPECint) CPU benchmark, restricting our attention to the
representing increase in computational intersection of the two is needlessly
power due to Moore's law. Both the limiting. Another point to note is that
volume and velocity of data require new Data Science tasks usually involve data
approaches to data management and analysis by a domain expert with limited
analysis. For example, the raw image database expertise. If domain expert is to
data sets in NGS are so large that it is succeed, data must be usable.
impractical today to even consider Undoubtfully, the database systems are
storing them. Rather, these images are very hard to use. There is even an urban
analyzed on the fly to produce the legend about some vendors intentionally
sequence data. Many people use the two keeping them hard to use because they
terms “Data Science” and “Big Data” make so much money from consulting
interchangeably, applying these terms to and support fees. In addition to the
all of the examples listed above. This is systems themselves, there are also the
not completely inappropriate: the analysis tasks – often, we have
primary difference between the two statistically users making unsupported
terms is their perspective: “Big Data” assumptions about the data at hand, e.g.
begins with the data characteristics (and regarding independence or randomness
works up from there), whereas “Data or how representative a data set is. If we
Science” begins with data use (and do not help people make intelligent use
works down from there). However, their of their data, they will get burned and
formal definitions differ in more than they will become opponents of all the
just perspective. The National good that our technology can
Consortium for Data Science, an bring. Database and data
industry and academic partnership analytics usability research is crucial.
established at unc, Chapel hill in 2013,
defines data science as “the systematic
study of digital data using scientific 6. Big Data Myth 6: Big Data is all
techniques of observation, theory hype
development, systematic analysis, The data analysis around for quite a
hypothesis testing, and rigorous while. Databases too. So what has
validation.” A key purpose of data changed. Why now the time to get
science is [8] to use data to describe, excited about Big Data. Is this merely
explain, and predict natural and social some hype cooked up by breathless
phenomena by Creating knowledge journalists given the tremendous
about the properties of large and attention being paid to Big Data, this is a
dynamic data sets Developing methods fair question to ask. But we see that data
to share, manage, and analyze digital collection is cheap today, due to
data; and Optimizing data processes for ubiquitous digitization, business process
automation, the web, and sensor [4]
networks, in a way that it never was Frontiers in Massive Data
before. Data storage is cheap too, due to Analysis, National Academies
falling media prices. In consequence, Press (2013)
nearly every field of endeavor is Google Scholar
transitioning from “data poor” to “data [5]
rich.” So it is not surprising that Pattern-Based Strategy: getting value
everywhere around us we have people from Big Data
asking about the potential of Big Data. Gartner Group press release, available at
At that same time, we have a growing http://www.gartner.com/it/page.jsp?
social understanding of the id=1731916 (July 2011)
consequences of Big Data. We are only Google Scholar
beginning to scratch the surface today in [6]
our characterization of data privacy. Our The 4 V's of Big Data
appreciation of the ethics of data http://www.ibmbigdatahub.com/tag/587
analysis is also in its infancy. Mistakes Google Scholar
and overreach in this regard can very [7]
quickly lead to backlash that could close Scott D. KahnOn the future of
many things down. But barring such genomic data
mishaps, it is safe to say that Big Data Science, 11 (February 2011), pp. 728-
may be hyped, but there is more than 729
enough substance there for it to deserve CrossRefView Record in ScopusGoogle
our attention. Scholar
[8]
Establishing a National Consortium
References for Data Science
available at
[1] http://data2discovery.org/dev/wp-
Challenges and opportunities with Big content/uploads/2012/09/NCDS-
Data Consortium-Roadmap_July.pdf (2012)
a community white paper available at Google Scholar
http://cra.org/ccc/docs/init/bigdatawhite [9]
paper.pdf SDSS-III: Massive Spectroscopic
Google Scholar Surveys of the Distant Universe, the
[2] Milky Way Galaxy, and Extra-Solar
H.V. Jagadish, Johannes Gehrke, Alexan Planetary Systems
dros Labrinidis, Yannis Papakonstantino available at
u, Jignesh http://www.sdss3.org/collaboration/desc
M. Patel, Raghu Ramakrishnan, Cyrus S ription.pdf (Jan. 2008).
hahabiBig data and its technical
challenges
Commun. ACM, 57 (7) (July 2014),
pp. 86-94, 10.1145/2611567
CrossRefView Record in ScopusGoogle
Scholar
[3]
Advancing Discovery in
Science, Engineering Computing
Community Consortium (Spring 2011)
Google Scholar

Big Data and Science

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data and Science

Transféré par

Droits d'auteur :

Formats disponibles

Big Data and Science: Myths and Reality

Vous aimerez peut-être aussi