Hadoop

Big Data Platforms
Three big data platforms (systems)

General purpose relational database
Analytical database
Hadoop
1. General purpose RDBMS

- Powers first generation DW
Benefits:
- RDBMS already inhouse
- SQL-based
- Trained DBAs
Operational
System
Operational
System
ETL
Warehouse
DataData
Warehouse
ETL
Data
Mart
BI
Server
Reports /
Dashboards
Operational
System
Operational
System
Challenges:
- Cost to deploy and upgrade
- Doesnt support complex analytics
- Scalability and performance
3
2. Analytical platforms
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)
Purpose-built database management systems

designed explicitly for query processing and
analysis that provides dramatically higher
price/performance and availability compared
to general purpose solutions.
3. Hadoop
Ecosystem of open source projects

Hosted by Apache Foundation
Google developed and shared concepts
Distributed file system that scales out on
commodity servers with direct attached
storage and automatic failover.
Hadoop distilled: Whats new?

Benefits
Unstructured data
Distributed File
System
Data scientist
BIG
DATA
Schema at Read
Drawbacks
Open Source $$
No SQL
MapReduce
- Comprehensive
- Agile
- Expressive
- Affordable
- Immature
- Batch oriented
- Expertise
Hadoop ecosystem
Source: Hortonworks
Hadoop adoption rates

No plans
38%
Considering
32%
Experimenting
Implementing
In production
20%
5%
4%
Based on 158 respondents, BI Leadership Forum, April, 2012

8
Which platform do you choose?
Hadoop
Analytic Database
General Purpose
RDBMS
Structured
Semi-Structured
Unstructured
Workflows
Capture only whats
needed
Source
Systems
1. Extract, transform, load

Analytical
database
(DW)
Capture in case
its needed
5. Explore data
9. Report and mine data
Analytical tools
6. Parse, aggregate
10
The new analytical ecosystem
Hadoop
Open source (using the Apache license)
Around 40 core Hadoop committers from ~10 companies
Cloudera, Yahoo!, Facebook, Apple, and more
Hundreds of contributors writing features, fixing bugs
Many related projects, applications, tools, etc.
Hadoop History
Hadoop is based on work done by Google in the early 2000s
Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of
distributed computing
Meets all the requirements we have for reliability, scalability etc
Core concept: distribute the data as it is initially stored in the
system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial processing
History Continued..
Doug Cutting and Michael Cafarella created Hadoop in 2005
Yahoo started to create a search engine called Nutch Search Engine
Project (2005)
2006 Yahoo donated Hadoop Project to Apache
Feb 2006 Hadoop splits out of Nutch and Yahoo starts using it.
Dec 2006 Yahoo creating 100-node Webmap with Hadoop
Apr 2007 Yahoo on 1000-node cluster
Jan 2008 Hadoop made a top-level Apache project
Dec 2007 Yahoo creating 1000-node Webmap with Hadoop
Sep 2008 Hive added to Hadoop as a contributed project
Googles Solutions / Projects

Google File System SOSP2003
Map-Reduce OSDI2004
Sawzall Scientific Programming Journal2005
Big Table OSDI2006
Chubby OSDI2006
Open Source Worlds Solution

Google File System Hadoop Distributed FS (HDFS)
Map-Reduce Hadoop Map-Reduce
Sawzall Pig, Hive, JAQL
Big Table Hadoop HBase, Cassandra
Chubby Zookeeper
Current Status of Hadoop

Largest Cluster
2000 nodes (8 cores, 4TB disk)
Used by 40+ companies / universities over the

world
Yahoo, Facebook, etc
Cloud Computing Donation from Google and IBM
Startup focusing on providing services for hadoop

Cloudera, HortonWorks
Hadoop Components : In General

MapReduce
HDFS
Projects
Hadoop Projects
Hive
Hbase
Mahout
Pig
Oozie
Flume
Scoop
..
What Hadoop Does?

Hadoop in set of Tools ( Framework)
Support running Applications on Big Data
Write the Application on Hadoop, it will take the scalability issue
Major concern with traditional applications
Breaks the data into pieces and

send it to multiple computers and later
Combines are results
Hadoop Architecture
MASTER
Applications
Task Tracker
Job Tracker
Data Node
Name Node
SLAVES
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Continued..
Applications interact with Master (Batch Process)
Job Tracker (Master): Break the big Task into small pieces send to Task
Tracker in Slave
Name Node (Master): Maintain the index of which segment of data is
residing on which data node
Task Tracker process the small piece of Task
Data Node Manage the piece of data assign to it
The Five Hadoop Components

Hadoop is comprised of five separate daemons:
NameNode
Holds the metadata for HDFS
Secondary NameNode
Performs housekeeping functions for the NameNode
Is not a backup or not standby for the NameNode
DataNode
Stores actual HDFS data blocks
JobTracker
Manages MapReduce jobs, distributes individual tasks
TaskTracker
Responsible for instantiating and monitoring individual Map and Reduce tasks
Fault Tolerance
Built in fault tolerance (by default 3 copies of files in slaves)
Master may create single point of failure
Enterprise Hadoop create two copies of Master as well (Main
Master, Backup Master)
If a node fails, the master will detect that failure and re-assign the
work to a different node on the system
Continued..
Restarting a task does not require communication with nodes
working on other portions of the data
If a failed node restarts, it is automatically added back to the
system and assigned new tasks
If a node appears to be running slowly, the master can redundantly
execute another instance of the same task
For Programmers
Need not to think about
Where the file is located
How to manage failures
How to break computations into pieces
How to program for scalability
Should focus on writing Scale Free Program (can add

thousands of slaves)
Core Hadoop Concepts

Applications are written in high-level code
Developers do not worry about network programming, temporal
dependencies etc
Nodes talk to each other as little as possible
Developers should not write code which communicates between nodes
Shared nothing architecture
Data is spread among machines in advance
Computation happens where the data is stored, wherever possible
Data is replicated multiple times
Hadoop: Very HighHigh-Level Overview

When data is loaded into the system, it is split into blocks
Typically 64MB or 128MB
Map tasks (the first part of the MapReduce system) work on
relatively small portions of data
Typically a single block
A master program allocates work to nodes such that a Map task
will work on a block of data stored locally on that node
Many nodes work in parallel, each on their own part of the overall
dataset
Hadoop Cluster
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
Individual machines are known as nodes
A cluster can have as few as one node, as many as several
thousands
Single Node v/s Multi Node cluster
More nodes = better performance!
Hadoop Components: HDFS

HDFS, the Hadoop Distributed File System, is responsible for storing
data on the cluster
Data files are split into blocks and distributed across multiple nodes
in the cluster
Each block is replicated multiple times
Default is to replicate each block three times
Replicas are stored on different nodes
This ensures both reliability and availability
Hadoop Components: MapReduce

MapReduce is the system used to process data in the Hadoop
cluster
Consists of mainly two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overall dataset
Typically one HDFS data block
After all Maps are complete, the MapReduce system distributes the
intermediate data to nodes which perform the Reduce phase
HDFS Basic Concepts

HDFS is a files ystem written in Java
Based on Googles GFS
Sits on top of a native file system
ext3, xfs etc
Provides redundant storage for massive amounts of data
Where Does Data Come From?

Science
Medical imaging, sensor data, genome sequencing, weather data, satellite
feeds, etc.
Industry
Financial, pharmaceutical, manufacturing, insurance, online, energy, retail
data
Legacy
Sales data, customer behavior, product databases, accounting data, etc.
System Data
Log files, health & status feeds, activity streams, network messages, Web
Analytics, intrusion detection, spam filters
Analyzing Data: The Challenges
Huge volumes of data

Mixed sources result in many different formats
XML
CSV
EDI
LOG
Objects
SQL
Text
JSON
Binary
Etc.
What is Common Across HadoopHadoop-able

Problems?
Nature of the data
Complex data
Multiple data sources
Nature of the analysis
Batch processing
Parallel execution
Spread data over a cluster of servers and take the computation to
the data
Benefits of Analyzing With Hadoop

Previously impossible/impractical to do this analysis
Analysis conducted at lower cost
Analysis conducted in less time
Greater flexibility
Linear scalability
What Analysis is Possible With Hadoop?

Text mining
Index building
Graph creation and analysis
Pattern recognition
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment
HDFS
Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Convenient Cluster Management

Load balancing
Node failures
Cluster expansion
Optimized for Batch Processing

Allow move computation to data
Maximize throughput
HDFS Architecture
HDFS Details
Data Coherency
Write-once-read-many access model
Client can only append to existing files
Files are broken up into blocks

Typically 128 MB block size
Each block replicated on multiple DataNodes
Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode
Single Node Architecture
CPU
Machine Learning, Statistics

Memory
Classical Data Mining

Disk
44
Motivation: Google Example

20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web

Takes even more to do something useful
with the data!
Today, a standard architecture for such problems is emerging:
Cluster of commodity Linux nodes
Commodity network (ethernet) to connect them
45
Cloud
MapReduce
Challenges:
How to distribute computation?
Distributed/parallel programming is hard
Map-reduce addresses all of the above

Googles computational/data manipulation model
Elegant way to work with big data
47
What is map reduce

Simple data parallel programming model and framework.
Designed for scalability and fault-tolerance
Pioneered by Google
Processes around 20 petabytes of data per day
Popularized by open-source Hadoop project.

Used at yahoo!, Facebook, Amazon.
Map reduce design goals

Scalability to large data volume
1000s of machines, 10000s of disks
Cost efficient
Commodity machines (cheap, but unreliable)

Commodity network
Automatic fault-tolerance
Easy to use
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack
Switch
Switch
CPU
Mem
Disk
Switch
CPU
CPU
Mem
Mem
Disk
Disk
CPU
Mem
Disk
Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
49
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

http://www.mmds.org
50
Large-scale Computing
Large-scale computing for data mining
problems on commodity hardware
Challenges:
How do you distribute computation?
How can we make it easy to write distributed programs?
Machines fail:
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
People estimated Google had ~1M machines in 2011
1,000 machines fail every day!
51
Idea and Solution

Issue: Copying data over a network takes time
Idea:
Bring computation close to the data
Store files multiple times for reliability
Map-reduce addresses these problems

Storage Infrastructure File system
Google: GFS. Hadoop: HDFS
Programming model
Map-Reduce
52
Storage Infrastructure
Problem:
If nodes fail, how to store data persistently?
Answer:
Distributed File System:
Provides global file namespace
Google GFS; Hadoop HDFS;
Typical usage pattern

Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
53
Distributed File System

Chunk servers
File is split into contiguous chunks

Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Master node
a.k.a. Name Node in Hadoops HDFS
Stores metadata about where files are stored
Might be replicated
Client library for file access

Talks to master to find chunk servers
Connects directly to chunk servers to access data
54
Distributed File System

Reliable distributed file system
Data kept in chunks spread across machines
Each chunk replicated on different machines
Seamless recovery from disk or machine failure
C0
C1
D0
C1
C2
C5
C5
C2
C5
C3
D0
D1
Chunk server 1
Chunk server 2
Chunk server 3
C0
C5
D0
C2
Chunk server N
Bring computation directly to the data!

Chunk servers also serve as compute servers
55
Map Reduce Programming Model

Input data type: file of key-value records
Example
More Specifically
Input: a set of key-value pairs
Programmer specifies two methods:
Map(k, v) <k, v>*
Takes a key-value pair and outputs a set of key-value pairs
E.g., key is the filename, value is a single line in the file
There is one Map call for every (k,v) pair
Reduce(k, <v>*) <k, v>*

All values v with same key k are reduced together
and processed in v order
There is one Reduce function call per unique key k
58
MapReduce: The Map Step

Input
key-value pairs
Intermediate
key-value pairs
k
map
k
map
59
MapReduce: The Reduce Step

Intermediate
key-value pairs
Output
key-value pairs
Key-value groups
reduce
k
Group
by key
reduce
k
60
MapReduce: Word Counting

MAP:
Read input and
produces a set of
key-value pairs
The crew of the space

shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers of
a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step in
a long-term space-based
man/mache
partnership.
'"The work we're doing now
-- the robotics we're doing - is what we're going to
need ..
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
.
(key, value)
Provided by the
programmer
Group by key:
Reduce:
Collect all pairs

with same key
Collect all values

belonging to the
key and output
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
(key, value)
(key, value)
Only
sequential
reads
Sequentially
read the
data
Provided by the
programmer
61
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
62
MapReduce: Overview
Sequentially read a lot of data
Map:
Extract something you care about
Group by key: Sort and Shuffle

Reduce:
Aggregate, summarize, filter or transform
Write the result

Outline stays the same, Map and Reduce
change to fit the problem
63
Problems Suited for

Map-Reduce
Example: Host size

Suppose we have a large web corpus
Look at the metadata file
Lines of the form: (URL, size, date, )
For each host, find the total number of bytes

That is, the sum of the page sizes for all URLs from that particular host
Other examples:
Link analysis and graph processing
Machine Learning algorithms
65
Example: Language Model

Statistical machine translation:
Need to count number of times every 5-word sequence occurs in a large
corpus of documents
Very easy with MapReduce:

Map:
Extract (5-word sequence, count) from document
Reduce:
Combine the counts
66
Map Reduce Framework
Map-Reduce: Environment
Map-Reduce environment takes care of:
Partitioning the input data
Scheduling the programs execution across a
set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine communication
68
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
69
Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work
70
Combiners
A combiner is a local aggregation function for repeated keys produced
by same map.
Coordination : Master
Task Status ( Idle , in progress, completed)
Pings the nodes to get the state of nodes
When mapping is complete, this info is send to master to schedule
reducer task
Eight Common HadoopHadoop-able Problems

1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. PoS transaction analysis
5. Analyzing network data to predict failure
6. Threat analysis
7. Search quality
8. Data sandbox
1. Modelling True Risk

Challenge:
How much risk exposure does an organization really have with each customer?
Multiple sources of data and across multiple lines of business
Solution with Hadoop:
Source and aggregate disparate data sources to build data picture
e.g. credit card records, call recordings, chat sessions, emails, banking activity
Structure and analyze
Sentiment analysis, graph creation, pattern recognition
Typical Industry:
Financial Services (banks, insurance companies)
2.Customer Churn Analysis

Challenge:
Why is an organization really losing customers?
Data on these factors comes from different sources
Rapidly build behavioral model from disparate data sources
Structure and analyze with Hadoop
Traversing
Graph creation
Pattern recognition
Typical Industry:
Telecommunications, Financial Services
3. Recommendation Engine/Ad Targeting

Challenge:
Using user data to predict which products to recommend
Batch processing framework
Allow execution in in parallel over large datasets
Collaborative filtering
Collecting taste information from many users
Utilizing information to predict what similar users like
Typical Industry
Ecommerce, Manufacturing, Retail
Advertising
4. Point of Sale Transaction Analysis

Challenge:
Analyzing Point of Sale (PoS) data to target promotions and manage operations
Sources are complex and data volumes grow across chains of stores and other sources
Batch processing framework
Allow execution in in parallel over large datasets
Pattern recognition
Optimizing over multiple data sources
Utilizing information to predict demand
Typical Industry:
Retail
5. Analyzing Network Data to Predict Failure

Challenge:
Analyzing real-time data series from a network of sensors
Calculating average frequency over time is extremely tedious because of the need to
analyze terabytes
Take the computation to the data
Expand from simple scans to more complex data mining
Better understand how the network reacts to fluctuations
Discrete anomalies may, in fact, be interconnected
Identify leading indicators of component failure
Typical Industry:
Utilities, Telecommunications, Data Centers
6. Threat Analysis/Trade Surveillance

Challenge:
Detecting threats in the form of fraudulent activity or attacks
Large data volumes involved
Like looking for a needle in a haystack
Parallel processing over huge datasets
Pattern recognition to identify anomalies,
i.e., threats
Typical Industry:
Security, Financial Services,
General: spam fighting, click fraud
7. Search Quality
Challenge:
Providing real time meaningful search results
Analyzing search attempts in conjunction with structured data
Pattern recognition
Browsing pattern of users performing searches in different categories
Typical Industry:
Web, Ecommerce
8. Data Sandbox
Challenge:
Data Deluge
Dont know what to do with the data or what analysis to run
Dump all this data into an HDFS cluster
Use Hadoop to start trying out different analysis on the data
See patterns to derive value from data
Typical Industry:
Common across all industries

Hadoop

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hadoop

Transféré par

Droits d'auteur :

Formats disponibles

Big Data Platforms

Three big data platforms (systems)

1. General purpose RDBMS

Purpose-built database management systems

Ecosystem of open source projects

Hadoop distilled: Whats new?

Hadoop adoption rates

Based on 158 respondents, BI Leadership Forum, April, 2012

Which platform do you choose?

1. Extract, transform, load

The new analytical ecosystem

Googles Solutions / Projects

Open Source Worlds Solution

Current Status of Hadoop

Used by 40+ companies / universities over the

Startup focusing on providing services for hadoop

Hadoop Components : In General

What Hadoop Does?

Breaks the data into pieces and

The Five Hadoop Components

Should focus on writing Scale Free Program (can add

Core Hadoop Concepts

Hadoop: Very HighHigh-Level Overview

Hadoop Components: HDFS

Hadoop Components: MapReduce

HDFS Basic Concepts

Where Does Data Come From?

Analyzing Data: The Challenges

Huge volumes of data

What is Common Across HadoopHadoop-able

Benefits of Analyzing With Hadoop

What Analysis is Possible With Hadoop?

Convenient Cluster Management

Optimized for Batch Processing

Files are broken up into blocks

Single Node Architecture

Machine Learning, Statistics

Classical Data Mining

Motivation: Google Example

~1,000 hard drives to store the web

Map-reduce addresses all of the above

What is map reduce

Popularized by open-source Hadoop project.

Map reduce design goals

Commodity machines (cheap, but unreliable)

Each rack contains 16-64 nodes

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Idea and Solution

Map-reduce addresses these problems

Typical usage pattern

Distributed File System

File is split into contiguous chunks

Client library for file access

Distributed File System

Bring computation directly to the data!

Map Reduce Programming Model

There is one Map call for every (k,v) pair

Reduce(k, <v>*) <k, v>*

MapReduce: The Map Step

MapReduce: The Reduce Step

MapReduce: Word Counting

The crew of the space

Collect all pairs

Reduce(k, <v>) <k, v>