Vous êtes sur la page 1sur 81

Big Data Platforms

Three big data platforms (systems)


General purpose relational database
Analytical database
Hadoop

1. General purpose RDBMS


- Powers first generation DW
Benefits:
- RDBMS already inhouse
- SQL-based
- Trained DBAs

Operational
System

Operational
System

ETL

Warehouse
DataData
Warehouse

ETL

Data
Mart

BI
Server

Reports /
Dashboards

Operational
System

Operational
System

Challenges:
- Cost to deploy and upgrade
- Doesnt support complex analytics
- Scalability and performance
3

2. Analytical platforms
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)

Purpose-built database management systems


designed explicitly for query processing and
analysis that provides dramatically higher
price/performance and availability compared
to general purpose solutions.

3. Hadoop

Ecosystem of open source projects


Hosted by Apache Foundation
Google developed and shared concepts
Distributed file system that scales out on
commodity servers with direct attached
storage and automatic failover.

Hadoop distilled: Whats new?


Benefits

Unstructured data

Distributed File
System

Data scientist

BIG
DATA

Schema at Read

Drawbacks

Open Source $$
No SQL

MapReduce

- Comprehensive
- Agile
- Expressive
- Affordable

- Immature
- Batch oriented
- Expertise

Hadoop ecosystem

Source: Hortonworks

Hadoop adoption rates


No plans

38%

Considering

32%

Experimenting
Implementing
In production

20%
5%
4%

Based on 158 respondents, BI Leadership Forum, April, 2012


8

Which platform do you choose?

Hadoop

Analytic Database

General Purpose
RDBMS

Structured 

Semi-Structured 

Unstructured

Workflows
Capture only whats
needed
Source
Systems

1. Extract, transform, load


Analytical
database
(DW)

Capture in case
its needed
5. Explore data
9. Report and mine data

Analytical tools

6. Parse, aggregate

10

The new analytical ecosystem

Hadoop
Open source (using the Apache license)
Around 40 core Hadoop committers from ~10 companies
Cloudera, Yahoo!, Facebook, Apple, and more
Hundreds of contributors writing features, fixing bugs
Many related projects, applications, tools, etc.

Hadoop History
Hadoop is based on work done by Google in the early 2000s
Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of
distributed computing
Meets all the requirements we have for reliability, scalability etc
Core concept: distribute the data as it is initially stored in the
system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial processing

History Continued..
Doug Cutting and Michael Cafarella created Hadoop in 2005
Yahoo started to create a search engine called Nutch Search Engine
Project (2005)
2006 Yahoo donated Hadoop Project to Apache
Feb 2006 Hadoop splits out of Nutch and Yahoo starts using it.
Dec 2006 Yahoo creating 100-node Webmap with Hadoop
Apr 2007 Yahoo on 1000-node cluster
Jan 2008 Hadoop made a top-level Apache project
Dec 2007 Yahoo creating 1000-node Webmap with Hadoop
Sep 2008 Hive added to Hadoop as a contributed project

Googles Solutions / Projects


Google File System SOSP2003
Map-Reduce OSDI2004
Sawzall Scientific Programming Journal2005
Big Table OSDI2006
Chubby OSDI2006

Open Source Worlds Solution


Google File System Hadoop Distributed FS (HDFS)
Map-Reduce Hadoop Map-Reduce
Sawzall Pig, Hive, JAQL
Big Table Hadoop HBase, Cassandra
Chubby Zookeeper

Current Status of Hadoop


Largest Cluster
2000 nodes (8 cores, 4TB disk)

Used by 40+ companies / universities over the


world
Yahoo, Facebook, etc
Cloud Computing Donation from Google and IBM

Startup focusing on providing services for hadoop


Cloudera, HortonWorks

Hadoop Components : In General


MapReduce
HDFS

Projects

Hadoop Projects
Hive
Hbase
Mahout
Pig
Oozie
Flume
Scoop
..

What Hadoop Does?


Hadoop in set of Tools ( Framework)
Support running Applications on Big Data
Write the Application on Hadoop, it will take the scalability issue
Major concern with traditional applications

Breaks the data into pieces and


send it to multiple computers and later
Combines are results

Hadoop Architecture
MASTER
Applications
Task Tracker

Job Tracker

Data Node

Name Node

SLAVES

Task Tracker
Data Node

Task Tracker
Data Node

Task Tracker
Data Node

Task Tracker
Data Node

Continued..
Applications interact with Master (Batch Process)
Job Tracker (Master): Break the big Task into small pieces send to Task
Tracker in Slave
Name Node (Master): Maintain the index of which segment of data is
residing on which data node
Task Tracker process the small piece of Task
Data Node Manage the piece of data assign to it

The Five Hadoop Components


Hadoop is comprised of five separate daemons:
NameNode
Holds the metadata for HDFS
Secondary NameNode
Performs housekeeping functions for the NameNode
Is not a backup or not standby for the NameNode
DataNode
Stores actual HDFS data blocks
JobTracker
Manages MapReduce jobs, distributes individual tasks
TaskTracker
Responsible for instantiating and monitoring individual Map and Reduce tasks

Fault Tolerance
Built in fault tolerance (by default 3 copies of files in slaves)
Master may create single point of failure
Enterprise Hadoop create two copies of Master as well (Main
Master, Backup Master)
If a node fails, the master will detect that failure and re-assign the
work to a different node on the system

Continued..
Restarting a task does not require communication with nodes
working on other portions of the data
If a failed node restarts, it is automatically added back to the
system and assigned new tasks
If a node appears to be running slowly, the master can redundantly
execute another instance of the same task

For Programmers
Need not to think about
Where the file is located
How to manage failures
How to break computations into pieces
How to program for scalability

Should focus on writing Scale Free Program (can add


thousands of slaves)

Core Hadoop Concepts


Applications are written in high-level code
Developers do not worry about network programming, temporal
dependencies etc
Nodes talk to each other as little as possible
Developers should not write code which communicates between nodes
Shared nothing architecture
Data is spread among machines in advance
Computation happens where the data is stored, wherever possible
Data is replicated multiple times

Hadoop: Very HighHigh-Level Overview


When data is loaded into the system, it is split into blocks
Typically 64MB or 128MB
Map tasks (the first part of the MapReduce system) work on
relatively small portions of data
Typically a single block
A master program allocates work to nodes such that a Map task
will work on a block of data stored locally on that node
Many nodes work in parallel, each on their own part of the overall
dataset

Hadoop Cluster
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
Individual machines are known as nodes
A cluster can have as few as one node, as many as several
thousands
Single Node v/s Multi Node cluster
More nodes = better performance!

Hadoop Components: HDFS


HDFS, the Hadoop Distributed File System, is responsible for storing
data on the cluster
Data files are split into blocks and distributed across multiple nodes
in the cluster
Each block is replicated multiple times
Default is to replicate each block three times
Replicas are stored on different nodes
This ensures both reliability and availability

Hadoop Components: MapReduce


MapReduce is the system used to process data in the Hadoop
cluster
Consists of mainly two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overall dataset
Typically one HDFS data block
After all Maps are complete, the MapReduce system distributes the
intermediate data to nodes which perform the Reduce phase

HDFS Basic Concepts


HDFS is a files ystem written in Java
Based on Googles GFS
Sits on top of a native file system
ext3, xfs etc
Provides redundant storage for massive amounts of data

Where Does Data Come From?


Science
Medical imaging, sensor data, genome sequencing, weather data, satellite
feeds, etc.
Industry
Financial, pharmaceutical, manufacturing, insurance, online, energy, retail
data
Legacy
Sales data, customer behavior, product databases, accounting data, etc.
System Data
Log files, health & status feeds, activity streams, network messages, Web
Analytics, intrusion detection, spam filters

Analyzing Data: The Challenges

Huge volumes of data


Mixed sources result in many different formats
XML
CSV
EDI
LOG
Objects
SQL
Text
JSON
Binary
Etc.

What is Common Across HadoopHadoop-able


Problems?
Nature of the data
Complex data
Multiple data sources
Nature of the analysis
Batch processing
Parallel execution
Spread data over a cluster of servers and take the computation to
the data

Benefits of Analyzing With Hadoop


Previously impossible/impractical to do this analysis
Analysis conducted at lower cost
Analysis conducted in less time
Greater flexibility
Linear scalability

What Analysis is Possible With Hadoop?


Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment

HDFS

Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB

Convenient Cluster Management


Load balancing
Node failures
Cluster expansion

Optimized for Batch Processing


Allow move computation to data
Maximize throughput

HDFS Architecture

HDFS Details
Data Coherency
Write-once-read-many access model
Client can only append to existing files

Files are broken up into blocks


Typically 128 MB block size
Each block replicated on multiple DataNodes

Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode

Single Node Architecture

CPU

Machine Learning, Statistics


Memory

Classical Data Mining


Disk

44

Motivation: Google Example


20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web

~1,000 hard drives to store the web


Takes even more to do something useful
with the data!
Today, a standard architecture for such problems is emerging:
Cluster of commodity Linux nodes
Commodity network (ethernet) to connect them
45

Cloud

MapReduce
Challenges:
How to distribute computation?
Distributed/parallel programming is hard

Map-reduce addresses all of the above


Googles computational/data manipulation model
Elegant way to work with big data

47

What is map reduce


Simple data parallel programming model and framework.
Designed for scalability and fault-tolerance

Pioneered by Google
Processes around 20 petabytes of data per day

Popularized by open-source Hadoop project.


Used at yahoo!, Facebook, Amazon.

Map reduce design goals


Scalability to large data volume
1000s of machines, 10000s of disks

Cost efficient

Commodity machines (cheap, but unreliable)


Commodity network
Automatic fault-tolerance
Easy to use

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem
Disk

Switch

CPU

CPU

Mem

Mem

Disk

Disk

CPU

Mem
Disk

Each rack contains 16-64 nodes


In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

49

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,


http://www.mmds.org

50

Large-scale Computing
Large-scale computing for data mining
problems on commodity hardware
Challenges:
How do you distribute computation?
How can we make it easy to write distributed programs?
Machines fail:
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
People estimated Google had ~1M machines in 2011
1,000 machines fail every day!

51

Idea and Solution


Issue: Copying data over a network takes time
Idea:
Bring computation close to the data
Store files multiple times for reliability

Map-reduce addresses these problems


Storage Infrastructure File system
Google: GFS. Hadoop: HDFS

Programming model
Map-Reduce

52

Storage Infrastructure
Problem:
If nodes fail, how to store data persistently?

Answer:
Distributed File System:
Provides global file namespace
Google GFS; Hadoop HDFS;

Typical usage pattern


Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
53

Distributed File System


Chunk servers

File is split into contiguous chunks


Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node
a.k.a. Name Node in Hadoops HDFS
Stores metadata about where files are stored
Might be replicated

Client library for file access


Talks to master to find chunk servers
Connects directly to chunk servers to access data
54

Distributed File System


Reliable distributed file system
Data kept in chunks spread across machines
Each chunk replicated on different machines
Seamless recovery from disk or machine failure
C0

C1

D0

C1

C2

C5

C5

C2

C5

C3

D0

D1

Chunk server 1

Chunk server 2

Chunk server 3

C0

C5

D0

C2

Chunk server N

Bring computation directly to the data!


Chunk servers also serve as compute servers

55

Map Reduce Programming Model


Input data type: file of key-value records

Example

More Specifically
Input: a set of key-value pairs
Programmer specifies two methods:
Map(k, v) <k, v>*
Takes a key-value pair and outputs a set of key-value pairs
E.g., key is the filename, value is a single line in the file

There is one Map call for every (k,v) pair

Reduce(k, <v>*) <k, v>*


All values v with same key k are reduced together
and processed in v order
There is one Reduce function call per unique key k

58

MapReduce: The Map Step


Input
key-value pairs

Intermediate
key-value pairs
k

map
k

map

59

MapReduce: The Reduce Step


Intermediate
key-value pairs

Output
key-value pairs

Key-value groups
reduce

k
Group
by key

reduce
k

60

MapReduce: Word Counting


MAP:
Read input and
produces a set of
key-value pairs

The crew of the space


shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers of
a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step in
a long-term space-based
man/mache
partnership.
'"The work we're doing now
-- the robotics we're doing - is what we're going to
need ..

Big document

(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
.
(key, value)

Provided by the
programmer
Group by key:

Reduce:

Collect all pairs


with same key

Collect all values


belonging to the
key and output

(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)

(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)

(key, value)

(key, value)

Only
sequential
reads
Sequentially
read the
data

Provided by the
programmer

61

Word Count Using MapReduce


map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

62

MapReduce: Overview
Sequentially read a lot of data
Map:
Extract something you care about

Group by key: Sort and Shuffle


Reduce:
Aggregate, summarize, filter or transform

Write the result


Outline stays the same, Map and Reduce
change to fit the problem
63

Problems Suited for


Map-Reduce

Example: Host size


Suppose we have a large web corpus
Look at the metadata file
Lines of the form: (URL, size, date, )

For each host, find the total number of bytes


That is, the sum of the page sizes for all URLs from that particular host

Other examples:
Link analysis and graph processing
Machine Learning algorithms

65

Example: Language Model


Statistical machine translation:
Need to count number of times every 5-word sequence occurs in a large
corpus of documents

Very easy with MapReduce:


Map:
Extract (5-word sequence, count) from document

Reduce:
Combine the counts

66

Map Reduce Framework

Map-Reduce: Environment
Map-Reduce environment takes care of:
Partitioning the input data
Scheduling the programs execution across a
set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine communication

68

Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

69

Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work

70

Combiners
A combiner is a local aggregation function for repeated keys produced
by same map.

Coordination : Master
Task Status ( Idle , in progress, completed)
Pings the nodes to get the state of nodes
When mapping is complete, this info is send to master to schedule
reducer task

Eight Common HadoopHadoop-able Problems


1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. PoS transaction analysis
5. Analyzing network data to predict failure
6. Threat analysis
7. Search quality
8. Data sandbox

1. Modelling True Risk


Challenge:
How much risk exposure does an organization really have with each customer?
Multiple sources of data and across multiple lines of business
Solution with Hadoop:
Source and aggregate disparate data sources to build data picture
e.g. credit card records, call recordings, chat sessions, emails, banking activity
Structure and analyze
Sentiment analysis, graph creation, pattern recognition
Typical Industry:
Financial Services (banks, insurance companies)

2.Customer Churn Analysis


Challenge:
Why is an organization really losing customers?
Data on these factors comes from different sources
Solution with Hadoop:
Rapidly build behavioral model from disparate data sources
Structure and analyze with Hadoop
Traversing
Graph creation
Pattern recognition
Typical Industry:
Telecommunications, Financial Services

3. Recommendation Engine/Ad Targeting


Challenge:
Using user data to predict which products to recommend
Solution with Hadoop:
Batch processing framework
Allow execution in in parallel over large datasets
Collaborative filtering
Collecting taste information from many users
Utilizing information to predict what similar users like
Typical Industry
Ecommerce, Manufacturing, Retail
Advertising

4. Point of Sale Transaction Analysis


Challenge:
Analyzing Point of Sale (PoS) data to target promotions and manage operations
Sources are complex and data volumes grow across chains of stores and other sources
Solution with Hadoop:
Batch processing framework
Allow execution in in parallel over large datasets
Pattern recognition
Optimizing over multiple data sources
Utilizing information to predict demand
Typical Industry:
Retail

5. Analyzing Network Data to Predict Failure


Challenge:
Analyzing real-time data series from a network of sensors
Calculating average frequency over time is extremely tedious because of the need to
analyze terabytes
Solution with Hadoop:
Take the computation to the data
Expand from simple scans to more complex data mining
Better understand how the network reacts to fluctuations
Discrete anomalies may, in fact, be interconnected
Identify leading indicators of component failure
Typical Industry:
Utilities, Telecommunications, Data Centers

6. Threat Analysis/Trade Surveillance


Challenge:
Detecting threats in the form of fraudulent activity or attacks
Large data volumes involved
Like looking for a needle in a haystack
Solution with Hadoop:
Parallel processing over huge datasets
Pattern recognition to identify anomalies,
i.e., threats
Typical Industry:
Security, Financial Services,
General: spam fighting, click fraud

7. Search Quality
Challenge:
Providing real time meaningful search results
Solution with Hadoop:
Analyzing search attempts in conjunction with structured data
Pattern recognition
Browsing pattern of users performing searches in different categories
Typical Industry:
Web, Ecommerce

8. Data Sandbox
Challenge:
Data Deluge
Dont know what to do with the data or what analysis to run
Solution with Hadoop:
Dump all this data into an HDFS cluster
Use Hadoop to start trying out different analysis on the data
See patterns to derive value from data
Typical Industry:
Common across all industries

Vous aimerez peut-être aussi