Académique Documents
Professionnel Documents
Culture Documents
On
BIG DATA MINING: A CHALLENGE
AND HOW TO MANAGE IT
Semi-structured
Many sources of big data
Unstructured
Video data, audio data
10
WHY BIG DATA
Growth of Big Data is needed
FB generates 10TB
daily
Mobile Devices
Microphones
Readers/Scanners
Science facilities
Programs/ Software
Social Media
Cameras
BIG DATA ANALYTICS
Appropriate information
Competitive advantage
Regression [Predictive]
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned
a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.
CLUSTERING
income
education
age
24
K-MEANS CLUSTERING
25
ASSOCIATION RULE MINING
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b
26
sales
market-basket
records:
data
27
BIG DATA STANDARDIZATION
CHALLENGES (1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data provenance
types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages
General and domain specific ontologies and taxonomies for describing data
Source : ISO
28
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking
(e.g. visualization)
Interface between relational (SQL) and non-relational
(NoSQL)
Source : ISO
Big Data Quality and Veracity description and
management
29
TOOLS FOR MANAGING BIG DATA
Hadoop is an open-source framework from Apache that allows to
store and process big data in a distributed environment across
clusters of computers using simple programming models
Hadoop is a large-scale distributed batch processing infrastructure.
While it can be used on a single machine, its true power lies in its
ability to scale to hundreds or thousands of computers, each with
several processor cores. Hadoop is also designed to efficiently
distribute large amounts of work across a set of machines.
Challenges at Large Scale
are loaded into the main memory of a single machine. Large datasets
of size petabytes cannot be loaded into the RAM memory;
this is when Hadoop integrated with R language, is an ideal solution.
R and Hadoop were not natural friends but with the advent of novel
packages like Rhadoop, RHIVE, and RHIPE- the two seemingly
different technologies, complement each other for big data analytics
and visualization.
STORM
Storm is a distributed real-time computation
system for processing large volumes of high-
velocity data.
Storm is extremely fast, with the ability to process
task into multiple segments and run each segment on a different machine.
Mahout is such a data mining framework that normally runs coupled with
the Hadoop infrastructure at its background to manage huge volumes of
data
Apache Mahout is an open source project that is
primarily used for creating scalable machine
learning algorithms. It implements popular
machine learning techniques such as:
Recommendation
Classification
Clustering
[2] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big
Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill
Companies,Incorporated, 2011
[3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing
Platform. In ICDM Workshops, pages 170177, 2010
[4] Storm, http://storm-project.net.