Académique Documents
Professionnel Documents
Culture Documents
305
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-6, June 2015
Three Vs of Big Data Volume of data: Volume refers IV BIG DATA SOLUTIONS
to amount of data. Volume of data stored in enterprise Hadoop is a Programming framework used to support
repositories have grown from megabytes and the processing of large data sets in a distributed
gigabytes to petabytes. Variety of data: Different computing environment. Hadoop was developed by
types of data and sources of data. Data variety Googles MapReduce that is a software framework
exploded from structured and legacy data stored in where an application break down into various parts.
enterprise repositories to unstructured, semi The Current Appache Hadoop ecosystem consists of
structured, audio, video, XML etc. Velocity of the Hadoop Kernel, MapReduce, HDFS and numbers
data:Velocity refers to the speed of data processing. of various components like Apache Hive, Base and
For time-sensitive processes such as catching fraud, Zookeeper. HDFS and MapReduce are explained in
big data must be used as it streams into your following points.
enterprise in order to maximize its value.There is not
only more data, but it also comes from a wider variety HDFS Architecture Hadoop includes a faulttolerant
of sources and formats. As described in the report by storage system called the Hadoop Distributed File
the Presidents Council of Advisors of Science & System, or HDFS. HDFS is able to store huge
Technology, some data is born digital, meaning that amounts of information, scale up incrementally and
it is created specifically for digital use by a computer survive the failure of significant parts of the storage
or data processing system. Examples include email, infrastructure without losing data. Hadoop creates
web browsing, or GPS location. Other data is born clusters of machines and coordinates work among
analog, meaning that it emanates from the physical them. Clusters can be built with inexpensive
world, but increasingly can be converted into digital computers. If one fails, Hadoop continues to operate
format. Examples of analog data include voice or the cluster without losing data or interrupting work,
visual information captured by phones, cameras or by shifting work to the remaining machines in the
video recorders, or physical activity data, such as cluster. HDFS manages storage on the cluster by
heart rate or perspiration monitored by wearable breaking incoming files into pieces, called blocks,
devices.11 With the rising capabilities of data and storing each of the blocks redundantly across the
fusion, which brings together disparate sources of pool of servers. In the common case, HDFS stores
data, big data can lead to some remarkable insights. three complete copies of each file by copying each
piece to three different servers.
III. BIG DATA MINING
The term Big Data appeared for first time in 1998 in MapReduce Architecture The processing pillar in the
a Silicon Graphics (SGI) slide deck by John Mashey Hadoop ecosystem is the MapReduce framework. The
with the title of Big Data and the Next Wave of framework allows the specification of an operation to
InfraStress [9]. Big Data mining was very relevant be applied to a huge data set, divide the problem and
from the beginning, as the first book mentioning Big data, and run it in parallel. From an analysts point of
Data is a data mining book that appeared also in 1998 view, this can occur on multiple dimensions. For
by Weiss and Indrukya [34] . However, the first example, a very large dataset can be reduced into a
academic paper with the words Big Data in the title smaller subset where analytics can be applied. In a
appeared a bit later in 2000 in a paper by Diebold [8]. traditional data warehousing scenario, thismight entail
The origin of the term Big Data is due to the fact applying an ETL operation on the data to produce
that we are creating a huge amount of data every day. something usable by the analyst. In Hadoop, these
Usama Fayyad [11] in his invited talk at the KDD kinds of operations are written as MapReduce jobs in
BigMine12 Workshop presented amazing data Java. There are a number of higher level languages
numbers about internet usage, among them the like Hive and Pig that make writing these programs
following: each day Google has more than 1 billion easier. The outputs of these jobs can be written back
queries per day, Twitter has more than 250 milion to either HDFS or placed in a traditional data
tweets per day, Facebook has more than 800 million warehouse. There are two functions in MapReduce as
updates per day, and YouTube has more than 4 billion follows: map the function takes key/value pairs as
views per day. The data produced nowadays is input and generates an intermediate set of key/value
estimated in the order of zettabytes, and it is growing pairs reduce the function which merges all the
around 40% every year. Anew large source of data is intermediate values associated with the same
going to be generated from mobile devices, and big intermediate key
companies as Google, Apple, Facebook, Yahoo,
Twitter are starting to look carefully to this data to Hadoop is a free, Java-based programming framework
find useful patterns to improve user experience. Alex that supports the processing of large data sets in a
Sandy Pentland in his Human Dynamics distributed computing environment. It is part of the
Laboratory at MIT, is doing research in finding Apache project sponsored by the Apache Software
patterns in mobile data about what users do, and not Foundation. Hadoop was inspired by Googles
in what people says they do [28] MapReduce Programming paradigm [16]. Hadoop is a
highly scalable compute and storage platform. But on
306
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-6, June 2015
the other hand, Hadoop is also time consuming and system called Hadoop Distributed Filesystem
storage-consuming. The storage requirement of (HDFS). Hadoop allows writing applications that
Hadoop is extraordinarily high because it can generate rapidly process large amounts of data in parallel on
a large amount of intermediate data. To reduce the large clusters of compute nodes. A MapReduce job
requirement on the storage capacity,Hadoop often divides the input dataset into independent subsets that
compresses data before storing it. Hadoop takes a are processed by map tasks in parallel. This step of
primary approach to a single big workload, mapping it mapping is then followed by a step of reducing tasks.
into smaller workloads. These smaller workloads are These reduce tasks use the output of the maps to
then merged to obtain the end result. Hadoop handles obtain the final result of the job. Apache Hadoop
this workload by assigning a large cluster of related projects: Apache Pig, Apache Hive, Apache
inexpensive nodes built with commodity hardware. HBase, Apache ZooKeeper, Apache Cassandra,
Hadoop also has a distributed, cluster file system that Cascading, Scribe and many others. Apache S4:
scales to store massive amounts of data, which is platform for processing continuous data streams. S4 is
typically required in these workloads. Hadoop has a designed specifically for managing data streams. S4
variety of node types within each Hadoop cluster; apps are designed combining streams and processing
these include DataNodes, NameNodes, and elements in real time. Storm: software for streaming
EdgeNodes. The explanations are as follows: data-intensive distributed applications, similar to S4,
NameNode: The NameNode is the central location for and developed by Nathan Marz at Twitter. In Big
information about the file system deployed in a Data Mining, there are many open source initiatives.
Hadoop environment. An environment can have one The most popular are the following: Apache
or two NameNodes, configured to provide minimal Mahout: Scalable machine learning and data mining
redundancy between the NameNodes. The NameNode open source software based mainly in Hadoop. It has
is contacted by clients of the Hadoop Distributed File implementations of a wide range of machine learning
System (HDFS) to locate information within the file and data mining algorithms: clustering, classification,
system and provide updates for data they have added, collaborative filtering and frequent pattern mining.
moved, manipulated, or deleted. b. DataNode: R: open source programming language and software
DataNodes make up the majority of the servers environment designed for statistical computing and
contained in a Hadoop environment. Common visualization. R was designed by Ross Ihaka and
Hadoop environments will have more than one Robert Gentleman at the University of Auckland,
DataNode, and oftentimes they will number in the New Zealand beginning in 1993 and is used for
hundreds based on capacity and performance needs. statistical analysis of very large data sets. MOA:
The DataNode serves two functions: It contains a Stream data mining open source software to perform
portion of the data in the HDFS and it acts as a data mining in real time. It has implementations of
compute platform for running jobs, some of which classification, regression, clustering and frequent item
will utilize the local data within the HDFS. c. set mining and frequent graph mining. It started as a
EdgeNode: The EdgeNode is the access point for the project of the Machine Learning group of University
external applications, tools, and users that need to of Waikato, New Zealand, famous for the WEKA
utilize the Hadoop environment. The EdgeNode sits software. The streams framework provides an
between the Hadoop cluster and the corporate environment for defining and running stream
network to provide access control, policy processes using simple XML based definitions and is
enforcement, logging, and gateway services to the able to use MOA, Android and Storm. SAMOA is a
Hadoop environment. A typical Hadoop environment new upcoming software project for distributed stream
will have a minimum of one EdgeNode and more mining that will combine S4 and Storm with MOA.
based on performance needs iii. IBM InfoSphere Vowpal Wabbit: open source project started at
BigInsights It is an Apache Hadoop based solution to Yahoo! Research and continuing at Microsoft
manage and analyze massive volumes of the Research to design a fast, scalable, useful learning
structured and unstructured data. It is built on an open algorithm. VW is able to learn from terafeature
source Apache Hadoop with IBM big Sheet and has a datasets. It can exceed the throughput of any single
variety of performance, reliability, security and machine network interface when doing linear
administrative features learning, via parallel learning More specific to Big
Graph mining we found the following open source
V.OPEN SOURCE REVOLUTION tools: Pegasus : big graph mining system built on
The Big Data phenomenon is intrinsically related to top of MapReduce. It allows to find patterns and
the open source software revolution. Large companies anomalies in massive real-world graphs. See the paper
as Facebook, Yahoo!, Twitter, LinkedIn benefit and by U. Kang and Christos Faloutsos in this issue.
contribute working on open source projects. Big Data GraphLa: high-level graph-parallel system built
infrastructure deals with Hadoop, and other related without using MapReduce. GraphLab computes over
software as: Apache Hadoop: software for data- dependent records which are stored as vertices in a
intensive distributed applications, based in the large distributed data-graph. Algorithms in GraphLab
MapReduce programming model and a distributed file are expressed as vertex-programs which are executed
307
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-6, June 2015
in parallel on each vertex and can interact with [13] R.D. Schneider, Hadoop for Dummies Special Edition, John
Wiley&Sons Canada, 978-1-118-25051-8, 2012
neighbouring vertices[9-26].
[14] R. Weiss and L.J. Zgorski, "Obama Administration Unveils
VI CONCLUSION Big Data Initiative:Announces $200 Million in new R&D
Big Data is going to continue growing during the next Investments", Office of Science and Technology Policy Executive
Office of the President, March 2012
years, and each data scientist will have to manage
much more amount of data every year. This data is [15] S. Curry, E. Kirda, E. Schwartz, W.H. Stewart and A. Yoran,
going to be more diverse, larger, and faster. We "Big Data Fuels Intelligence Driven Security", RSA Security Brief,
January 2013
discussed in this paper some insights about the topic,
and what we consider are the main concerns, and the [16] S. Madden, "From Databases to Big Data", IEEE Internet
main challenges for the future. Big Data is becoming Computing, June 2012, v.16, pp.4-6
the new Final Frontier for scientific data research and [17] S. Singh and N. Singh, "Big Data Analytics", 2012
for business applications. We are at the beginning of a International Conference on Communication, Information &
new era where Big Data mining will help us to Computing Technology Mumbai India, IEEE, October 2011
discover knowledge that no one has discovered [18] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, "From Data
before. Everybody is warmly invited to participate in Mining to Knowledge Discovery in Databases", American
this intrepid journey. Association for Artificial Intelligence, AI Magazine, Fall 1996, pp.
37- 54
308