Académique Documents
Professionnel Documents
Culture Documents
Operational
System
Operational
System
ETL
Warehouse
DataData
Warehouse
ETL
Data
Mart
BI
Server
Reports /
Dashboards
Operational
System
Operational
System
Challenges:
- Cost to deploy and upgrade
- Doesnt support complex analytics
- Scalability and performance
3
2. Analytical platforms
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)
3. Hadoop
Unstructured data
Distributed File
System
Data scientist
BIG
DATA
Schema at Read
Drawbacks
Open Source $$
No SQL
MapReduce
- Comprehensive
- Agile
- Expressive
- Affordable
- Immature
- Batch oriented
- Expertise
Hadoop ecosystem
Source: Hortonworks
38%
Considering
32%
Experimenting
Implementing
In production
20%
5%
4%
Hadoop
Analytic Database
General Purpose
RDBMS
Structured
Semi-Structured
Unstructured
Workflows
Capture only whats
needed
Source
Systems
Capture in case
its needed
5. Explore data
9. Report and mine data
Analytical tools
6. Parse, aggregate
10
Hadoop
Open source (using the Apache license)
Around 40 core Hadoop committers from ~10 companies
Cloudera, Yahoo!, Facebook, Apple, and more
Hundreds of contributors writing features, fixing bugs
Many related projects, applications, tools, etc.
Hadoop History
Hadoop is based on work done by Google in the early 2000s
Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of
distributed computing
Meets all the requirements we have for reliability, scalability etc
Core concept: distribute the data as it is initially stored in the
system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial processing
History Continued..
Doug Cutting and Michael Cafarella created Hadoop in 2005
Yahoo started to create a search engine called Nutch Search Engine
Project (2005)
2006 Yahoo donated Hadoop Project to Apache
Feb 2006 Hadoop splits out of Nutch and Yahoo starts using it.
Dec 2006 Yahoo creating 100-node Webmap with Hadoop
Apr 2007 Yahoo on 1000-node cluster
Jan 2008 Hadoop made a top-level Apache project
Dec 2007 Yahoo creating 1000-node Webmap with Hadoop
Sep 2008 Hive added to Hadoop as a contributed project
Projects
Hadoop Projects
Hive
Hbase
Mahout
Pig
Oozie
Flume
Scoop
..
Hadoop Architecture
MASTER
Applications
Task Tracker
Job Tracker
Data Node
Name Node
SLAVES
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Continued..
Applications interact with Master (Batch Process)
Job Tracker (Master): Break the big Task into small pieces send to Task
Tracker in Slave
Name Node (Master): Maintain the index of which segment of data is
residing on which data node
Task Tracker process the small piece of Task
Data Node Manage the piece of data assign to it
Fault Tolerance
Built in fault tolerance (by default 3 copies of files in slaves)
Master may create single point of failure
Enterprise Hadoop create two copies of Master as well (Main
Master, Backup Master)
If a node fails, the master will detect that failure and re-assign the
work to a different node on the system
Continued..
Restarting a task does not require communication with nodes
working on other portions of the data
If a failed node restarts, it is automatically added back to the
system and assigned new tasks
If a node appears to be running slowly, the master can redundantly
execute another instance of the same task
For Programmers
Need not to think about
Where the file is located
How to manage failures
How to break computations into pieces
How to program for scalability
Hadoop Cluster
A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
Individual machines are known as nodes
A cluster can have as few as one node, as many as several
thousands
Single Node v/s Multi Node cluster
More nodes = better performance!
HDFS
Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
HDFS Architecture
HDFS Details
Data Coherency
Write-once-read-many access model
Client can only append to existing files
Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode
CPU
44
Cloud
MapReduce
Challenges:
How to distribute computation?
Distributed/parallel programming is hard
47
Pioneered by Google
Processes around 20 petabytes of data per day
Cost efficient
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack
Switch
Switch
CPU
Mem
Disk
Switch
CPU
CPU
Mem
Mem
Disk
Disk
CPU
Mem
Disk
49
50
Large-scale Computing
Large-scale computing for data mining
problems on commodity hardware
Challenges:
How do you distribute computation?
How can we make it easy to write distributed programs?
Machines fail:
One server may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to loose 1/day
People estimated Google had ~1M machines in 2011
1,000 machines fail every day!
51
Programming model
Map-Reduce
52
Storage Infrastructure
Problem:
If nodes fail, how to store data persistently?
Answer:
Distributed File System:
Provides global file namespace
Google GFS; Hadoop HDFS;
Master node
a.k.a. Name Node in Hadoops HDFS
Stores metadata about where files are stored
Might be replicated
C1
D0
C1
C2
C5
C5
C2
C5
C3
D0
D1
Chunk server 1
Chunk server 2
Chunk server 3
C0
C5
D0
C2
Chunk server N
55
Example
More Specifically
Input: a set of key-value pairs
Programmer specifies two methods:
Map(k, v) <k, v>*
Takes a key-value pair and outputs a set of key-value pairs
E.g., key is the filename, value is a single line in the file
58
Intermediate
key-value pairs
k
map
k
map
59
Output
key-value pairs
Key-value groups
reduce
k
Group
by key
reduce
k
60
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
.
(key, value)
Provided by the
programmer
Group by key:
Reduce:
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
(key, value)
(key, value)
Only
sequential
reads
Sequentially
read the
data
Provided by the
programmer
61
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
62
MapReduce: Overview
Sequentially read a lot of data
Map:
Extract something you care about
Other examples:
Link analysis and graph processing
Machine Learning algorithms
65
Reduce:
Combine the counts
66
Map-Reduce: Environment
Map-Reduce environment takes care of:
Partitioning the input data
Scheduling the programs execution across a
set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine communication
68
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
69
Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work
70
Combiners
A combiner is a local aggregation function for repeated keys produced
by same map.
Coordination : Master
Task Status ( Idle , in progress, completed)
Pings the nodes to get the state of nodes
When mapping is complete, this info is send to master to schedule
reducer task
7. Search Quality
Challenge:
Providing real time meaningful search results
Solution with Hadoop:
Analyzing search attempts in conjunction with structured data
Pattern recognition
Browsing pattern of users performing searches in different categories
Typical Industry:
Web, Ecommerce
8. Data Sandbox
Challenge:
Data Deluge
Dont know what to do with the data or what analysis to run
Solution with Hadoop:
Dump all this data into an HDFS cluster
Use Hadoop to start trying out different analysis on the data
See patterns to derive value from data
Typical Industry:
Common across all industries