Big Data MINING AND TOOLS

Seminar
On
BIG DATA MINING: A CHALLENGE
AND HOW TO MANAGE IT
Submitted To: Submitted By:

Dinesh and Jitender
INTRODUCTION
Big Data is a new term used to identify the datasets that due
to their large size and complexity,
We call them BIG DATA because we can not manage them
with our current methodologies or data mining software tools.

Big Data mining is the capability of extracting useful
information from these large datasets or streams of data, that

due to its volume, variability, and velocity, it was not possible
before to do it.
The Big Data challenge is becoming one of the most exciting
opportunities for the next years.

We present in this issue, a broad overview of the topic, its
current status on Big Data mining.

This paper shows the challenge and tools to manage
heterogeneous information frontier in Big Data mining
research
WHAT IS BIG DATA?
Big Data is similar to small data, but bigger in
size
but having data bigger it requires different

approaches:
Techniques, tools and architecture
an aim to solve new problems or old problems in a

better way
Big Data generates value from the storage and

processing of very large quantities of digital
information that cannot be analyzed with
traditional computing techniques.
WHAT IS BIG DATA
Walmart handles more than 1 million customer
transactions every hour.
Facebook handles 40 billion photos from its user
base.
THREE CHARACTERISTICS OF BIG DATA V3S
Volume Velocity Variety

Data Data Data
quantity Speed Types
1ST CHARACTER OF BIG DATA
VOLUME
A typical PC might have had 10 gigabytes of storage in

2000.
Today, Facebook ingests 500 terabytes of new data every

day.
The smart phones, the data they create and consume;

sensors embedded into everyday objects will soon result in
billions of new, constantly-updated data feeds containing
environmental, location, and other information, including
video.
DATA
VELOCITY(SPEED)
High-frequency stock trading algorithms reflect
market changes within microseconds.
machine to machine processes exchange data between

billions of devices
infrastructure and sensors generate massive log data

in real-time
on-line gaming systemssupport millions of concurrent

users, each producing multiple inputs per second.
VARIETY(DATA TYPES
IMAGES, VIDEO,SOUND)
Big Data isn't just numbers, dates, and strings.
Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.
Traditionaldatabase systems were designed to

address smaller volumes of structured data, fewer
updates or a predictable, consistent data
structure.
Big Data analysis includes different types of data

PROCESSING BIG DATA
Integrating disparate data stores
Mapping data to the programming framework
Connecting and extracting data from storage
Transforming data for processing
Subdividing data in preparation for Hadoop

MapReduce
Employing Hadoop MapReduce

Creating the components of Hadoop MapReduce jobs
Distributing data processing across server farms
Executing Hadoop MapReduce jobs
Monitoring the progress of job flows

THE STRUCTURE OF BIG DATA
Structured
Most traditional data
sources
Semi-structured
Many sources of big data
Unstructured
Video data, audio data
10
WHY BIG DATA
Growth of Big Data is needed
Increase of storage capacities
Increase of processing power
Availability of data(different data types)
Every day we create 2.5 quintillion bytes of data; 90%

of the data in the world today has been created in
the last two years alone
WHY BIG DATA
FB generates 10TB
daily
Twitter generates 7TB

of data
Daily
IBM claims 90% of

todays
stored data was
generated
in just the last two
years.
HOW IS BIG DATA
DIFFERENT?
1) Automatically generated by a machine

(e.g. Sensor embedded in an engine)
2) Typically an entirely new source of

data
(e.g. Use of the internet)
3) Not designed to be friendly

(e.g. Text streams)
13
4) May not have much values
Need to focus on the important part
DATA GENERATION POINTS
EXAMPLES
Mobile Devices
Microphones
Readers/Scanners
Science facilities
Programs/ Software
Social Media
Cameras
BIG DATA ANALYTICS
Examining large amount of data
Appropriate information
Identification of hidden patterns, unknown correlations
Competitive advantage
Better business decisions: strategic and operational
Effective marketing, customer satisfaction, increased

revenue
POTENTIAL VALUE OF BIG
DATA
$300 billion potential

annual value to US health
care.
$600 billion potential

annual consumer surplus
from using personal
location data.
60% potential in retailers

operating margins.
INDIA BIG DATA
Gaining attraction
Huge market opportunities for IT services

(82.9% of revenues) and analytics firms
(17.1 % )
Current market size is $200 million. By 2015 $1

billion
Theopportunity for Indian service providers lies

in offering services around Big Data
implementation and analytics for global
multinationals
BENEFITS OF BIG DATA
Real-time big data isnt just a process for storing of
data in a data warehouse, Its about the ability to
make better decisions and take meaningful actions
at the right time.
Fast forward to the present and technologies like

Hadoop give you the scale and flexibility to store
data before you know how you are going to process
it.
Technologies such as MapReduce,Hive and Impala

enable you to run queries without changing the
data structures underneath.
BENEFITS OF BIG DATA
Our newest research finds that organizations are using
big data to target customer-centric outcomes, tap into
internal data and build a better information
ecosystem.
Big Data is already an important part of the $64

billion database and data analytics market
It offers commercial opportunities of a comparable

scale to enterprise software in the late 1980s
And the Internet boom of the 1990s, and the social

media explosion of today.
WHAT IS BIG DATA?
"Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization (Gartner 2012)
Complicated (intelligent) analysis of data may
make a small data appear to be big
Bottom line: Any data that exceeds our current
capability of processing can be regarded as big
WHAT IS DATA MINING?
Discovery of useful, possibly unexpected, patterns in

data
Extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in
order to discover meaningful patterns
DATA MINING TASKS
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned
a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.
CLUSTERING
income
education
age
24
K-MEANS CLUSTERING
25
ASSOCIATION RULE MINING
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b
26
sales
market-basket
records:
data
Trend: Products p5, p8 often bough together

Trend: Customer 12 likes product p9
BIG VELOCITY
Sensor tagging everything of value sends velocity
through the roof
E.g. car insurance
Smart phones as a mobile platform sends velocity

through the roof
State of multi-player internet games must be

recorded sends velocity through the roof
27
BIG DATA STANDARDIZATION
CHALLENGES (1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data provenance
Application models (e.g. batch, streaming)
Query languages including non-relational queries to support diverse data
types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages
Semantics of eventual consistency
Advanced network protocols for efficient data transfer
General and domain specific ontologies and taxonomies for describing data
semantics including interoperation between ontologies
Source : ISO
28
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking
the analytics to the data) including data and

processing resource discovery and data mining
Data sharing and exchange
Data storage, e.g. memory storage system, distributed
file system, data warehouse, etc.

Human consumption of the results of big data analysis
(e.g. visualization)
Interface between relational (SQL) and non-relational
(NoSQL)
Source : ISO
Big Data Quality and Veracity description and
management
29
TOOLS FOR MANAGING BIG DATA
Hadoop is an open-source framework from Apache that allows to
store and process big data in a distributed environment across
clusters of computers using simple programming models
Hadoop is a large-scale distributed batch processing infrastructure.
While it can be used on a single machine, its true power lies in its
ability to scale to hundreds or thousands of computers, each with
several processor cores. Hadoop is also designed to efficiently
distribute large amounts of work across a set of machines.
Challenges at Large Scale
Performing large-scale computation is difficult. To work with this
volume of data requires distributing parts of the problem to

multiple machines to handle in parallel. Whenever multiple
machines are used in cooperation with one another, the probability
of failures rises. In a single-machine environment, failure is not
something that program designers explicitly worry about very
often: if the machine has crashed, then there is no way for the
program to recover anyway.
R
R programming language is the preferred choice amongst data
analysts and data scientists
There is no doubt that R is the most preferred programming tool for
statisticians, data scientists, data analysts and data architects but it

falls short when working with large datasets.
One major drawback with R programming language is that all objects
are loaded into the main memory of a single machine. Large datasets
of size petabytes cannot be loaded into the RAM memory;
this is when Hadoop integrated with R language, is an ideal solution.
To adapt to the in-memory, single machine limitation of R

programming language, data scientists have to limit their data
analysis to a sample of data from the large data set.
R and Hadoop were not natural friends but with the advent of novel
packages like Rhadoop, RHIVE, and RHIPE- the two seemingly
different technologies, complement each other for big data analytics
and visualization.
STORM
Storm is a distributed real-time computation
system for processing large volumes of high-
velocity data.
Storm is extremely fast, with the ability to process
over a million records per second per node on a

cluster of modest size. Enterprises combine it with
other data access applications in Hadoop to
prevent undesirable events or to optimize positive
outcomes.
Some of specific new business opportunities
include: real-time customer service management,

data monetization, operational dashboards, or
cyber security analytics and threat detection.
APACHE MAHOUT
Apache Mahout is a powerful, scalable machine-learning library that
runs on top of Hadoop MapReduce.
We are living in a day and age where information is available in abundance.
The information overload has scaled to such heights that sometimes it

becomes difficult to manage our little mailboxes! Imagine the volume of
data and records some of the popular websites (the likes of Facebook,
Twitter, and Youtube) have to collect and manage on a daily basis. It is not
uncommon even for lesser known websites to receive huge amounts of
information in bulk.
Normally we fall back on data mining algorithms to analyze bulk data to
identify trends and draw conclusions. However, no data mining algorithm

can be efficient enough to process very large datasets and provide outcomes
in quick time, unless the computational tasks are run on multiple machines
distributed over the cloud.
We now have new frameworks that allow us to break down a computation
task into multiple segments and run each segment on a different machine.
Mahout is such a data mining framework that normally runs coupled with
the Hadoop infrastructure at its background to manage huge volumes of
data
Apache Mahout is an open source project that is
primarily used for creating scalable machine
learning algorithms. It implements popular
machine learning techniques such as:
Recommendation
Classification
Clustering
Apache Mahout started as a sub-project of

Apaches Lucene in 2008. In 2010, Mahout
became a top level project of Apache.
APACHE S4
S4 is a general-purpose, distributed, scalable,
fault-tolerant, pluggable platform that allows
programmers to easily develop applications for
processing continuous unbounded streams of
data.
BIG DATA MINING TOOLS
The Big Data phenomenon is intrinsically related

to the open source software revolution. Large
companies such as Facebook, Yahoo!, Twitter,
LinkedIn benefit and contribute to open source
projects. Big Data infrastructure deals with
Hadoop, and other related software as:
Apache Hadoop : software for data-intensive

distributed applications, based in the MapReduce
programming model and a distributed file system
called Hadoop Distributed Filesystem (HDFS).
Hadoop allows writing applications that rapidly
process large amounts of data in parallel on large
clusters of compute nodes.
A MapReduce job divides the input dataset into
independent subsets that are processed by map

tasks in parallel. This step of mapping is then
followed by a step of reducing tasks. These reduce
tasks use the output of the maps to obtain the
final result of the job.
Apache S4: platform for processing continuous
data streams. S4 is designed specifically for
managing data streams. S4 apps are designed
combining streams and processing elements in
real time.
Storm: software for streaming data-intensive
distributed applications, similar to S4, and
developed by Nathan Marz at Twitter.
In Big Data Mining, there are many open source initiatives.
The most popular are the following:
Apache Mahout: Scalable machine learning and data

mining open source software based mainly in Hadoop. It has
implementations of a wide range of machine learning and
data mining algorithms: clustering, classification,
collaborative filtering and frequent pattern mining.
R: open source programming language and software
environment designed for statistical computing and
visualization. R was designed by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand
beginning in 1993 and is used for statistical analysis of very
large data sets.
MOA: Stream data mining open source software to

perform data mining in real time. It has imple-
mentations of classification, regression, clustering
and frequent item set mining and frequent graph
mining. It started as a project of the Machine
Learning group of University of Waikato, New
Zealand, famous for the WEKA software. The
streams framework provides an environment for
defining and running stream processes using
simple XML based definitions and is able to use
MOA, Android and Storm. SAMOA is a new
upcoming software project for distributed stream
mining that will combine S4 and Storm with MOA.
Vowpal Wabbit: open source project started at
Yahoo! Research and continuing at Microsoft
Research to design a fast, scalable, useful
learning algorithmt can exceed the throughput of
any single machine network interface when doing
linear learning, via parallel learning.
MORE SPECIFIC TO BIG GRAPH MINING WE
FOUND THE FOLLOWING OPEN SOURCE
TOOLS:
Pegasus: Big graph mining system built on top

of MapReduce. It allows to find patterns and
anomalies in massive real-world graphs.
GraphLab: high-level graph-parallel system

built without using MapReduce. GraphLab
computes over dependent records which are
stored as vertices in a large distributed data-
graph.
REFERENCE
REFERENCES
[1] Apache Hadoop, http://hadoop.apache.org.
[2] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big
Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill
Companies,Incorporated, 2011
[3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing
Platform. In ICDM Workshops, pages 170177, 2010
[4] Storm, http://storm-project.net.
[5] Apache Mahout, http://mahout.apache.org.

[6] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
[7] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis
http://moa. cms.waikato.ac.nz/. Journal of Machine Learning Research (JMLR), 2010.
[8] D. Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety.
META Group Research Note, February 6, 2001
[9]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the
Cloud. 2012.
[10]J. Gantz and D. Reinsel. IDC: The Digital Universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East. December 2012.
THANK YOU.

Big Data MINING AND TOOLS

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data MINING AND TOOLS

Transféré par

Droits d'auteur :

Formats disponibles

Seminar

Submitted To: Submitted By:

with our current methodologies or data mining software tools.

information from these large datasets or streams of data, that

opportunities for the next years.

current status on Big Data mining.

but having data bigger it requires different

an aim to solve new problems or old problems in a

Big Data generates value from the storage and

THREE CHARACTERISTICS OF BIG DATA V3S

Volume Velocity Variety

A typical PC might have had 10 gigabytes of storage in

Today, Facebook ingests 500 terabytes of new data every

The smart phones, the data they create and consume;

machine to machine processes exchange data between

infrastructure and sensors generate massive log data

on-line gaming systemssupport millions of concurrent

Traditionaldatabase systems were designed to

Big Data analysis includes different types of data

Connecting and extracting data from storage

Transforming data for processing

Subdividing data in preparation for Hadoop

Employing Hadoop MapReduce

Distributing data processing across server farms

Executing Hadoop MapReduce jobs

Monitoring the progress of job flows

Increase of storage capacities

Increase of processing power

Availability of data(different data types)

Every day we create 2.5 quintillion bytes of data; 90%

Twitter generates 7TB

IBM claims 90% of

1) Automatically generated by a machine

2) Typically an entirely new source of

3) Not designed to be friendly

Examining large amount of data

Identification of hidden patterns, unknown correlations

Better business decisions: strategic and operational

Effective marketing, customer satisfaction, increased

$300 billion potential

$600 billion potential

60% potential in retailers

Huge market opportunities for IT services

Current market size is $200 million. By 2015 $1

Theopportunity for Indian service providers lies

Fast forward to the present and technologies like

Technologies such as MapReduce,Hive and Impala

Big Data is already an important part of the $64

It offers commercial opportunities of a comparable

And the Internet boom of the 1990s, and the social

Discovery of useful, possibly unexpected, patterns in

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Trend: Products p5, p8 often bough together

Smart phones as a mobile platform sends velocity

State of multi-player internet games must be

Application models (e.g. batch, streaming)

Query languages including non-relational queries to support diverse data

Semantics of eventual consistency

Advanced network protocols for efficient data transfer

semantics including interoperation between ontologies

the analytics to the data) including data and

Data storage, e.g. memory storage system, distributed