Académique Documents
Professionnel Documents
Culture Documents
GENERAL INTRODUCTION
Presented By
www.kellytechno.com
Search Engine
10KB
2-20MB/second
www.kellytechno.com
GOOGLES SOLUTION
www.kellytechno.com
www.kellytechno.com
Spider
Batch Processing
System on top of
Hadoop
Internet
Search Log
Storage
Runtime
SE Web Server
www.kellytechno.
com
Business
Intelligenc
e
Batch Processing
System on top fo
Hadoop
Domain Knowledge
View/Click/Events Log
Storage
Database
Web Server
www.kellytechno.com
HADOOP HISTORY
Jan 2006 Doug Cutting joins Yahoo
Feb 2006 Hadoop splits out of Nutch and Yahoo
starts using it.
Dec 2006 Yahoo creating 100-node Webmap
with Hadoop
Apr 2007 Yahoo on 1000-node cluster
Jan 2008 Hadoop made a top-level Apache
project
Dec 2007 Yahoo creating 1000-node Webmap
with Hadoop
Sep 2008 Hive added to Hadoop as a contrib
project
www.kellytechno.com
HADOOP INTRODUCTION
Open
http://hadoop.apache.org/
Book:
http://oreilly.com/catalog/9780596521998/index.html
Written
Does
Runs
in Java
on
Linux,
Used
Cluster
by 40+ companies /
universities over the world
Yahoo,
Facebook, etc
Cloud Computing Donation from Google and
IBM
Startup
focusing on providing
services for hadoop
Cloudera
www.kellytechno.com
HADOOP COMPONENTS
Hadoop
(HDFS)
Hadoop Map-Reduce
Contributes
Hadoop
Streaming
Pig / JAQL / Hive
HBase
Hama / Mahout
www.kellytechno.com
www.kellytechno.com
GOALS OF HDFS
balancing
Node failures
Cluster expansion
www.kellytechno.com
HDFS ARCHITECTURE
www.kellytechno.com
HDFS DETAILS
Data Coherency
Write-once-read-many
access model
Client can only append to existing files
Intelligent Client
Client
www.kellytechno.com
hadoop
Web Interface
http://host:port/dfshealth.jsp
www.kellytechno.com
http://hadoop.apache.org/core/docs/current/hdfs_desig
n.html
File System
Kosmos File System (KFS)
Amazon S3 File System
www.kellytechno.com
www.kellytechno.com
HADOOP MAP-REDUCE
INTRODUCTION
processing
Web index building
www.kellytechno.com
www.kellytechno.com
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
Local
Map
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
Local
Sort
<nk1, 2>
<nk3, 1>
Local
Reduce
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk2, 3>
www.kellytechno.com
PHYSICAL FLOW
www.kellytechno.com
EXAMPLE CODE
www.kellytechno.com
HADOOP STREAMING
streaming
-input /user/zshao/articles
-mapper tr \n
-reducer uniq -c
-output /user/zshao/
-numReduceTasks 32
www.kellytechno.com
www.kellytechno.com
www.kellytechno.com
Sender
Receiver
R1
R2
R3
R1
R2
R3
sort
sort
Merge
Reduce
sort
Sender
Receiver
ReceivermergeNflowsinto1,calluser
Mappercallsuserfunctions:Map
Reducercallsuserfunctions:
Senderdoesflowcontrol
functionComparetosort,dumpbufferto
andPartition
CompareandReduce
disk,anddocheckpointing
Reducer
www.kellytechno.com
www.kellytechno.com
is scalable
Solution:
Reduce
Hive
HIVE
A
Allow
http://svn.apache.org/repos/asf/hadoop/hive/trunk/
www.kellytechno.com
HIVE ARCHITECTURE
Web UI
Mgmt, etc
Map Reduce
Hive CLI
Browsing Queries DDL
MetaStore
Hive QL
Parser
Planner
Execution
SerDe
Thrift API
HDFS
HIVE QL JOIN
page_view
page useri
id
d
1
111
111
222
time
user
pv_users
page age
id
1
2
25
25
32
9:08:1
4
SQL:
HIVE QL GROUP BY
pv_users
page age
id
SQL:
1
2
25
25
1
2
32
25
pageid_age_sum
25
25
1
2
32
www.kellytechno.com
page useri
id
d
time
111
9:08:0
1
111
9:08:1
3
222
9:08:1
4
111
result
9:08:2
0
pagei count_distinct_u
d
serid
1
2
2
1
SQL
SELECT pageid,
COUNT(DISTINCT userid)
FROM page_view GROUP BY
pageid
www.kellytechno.com
111 9:08:
13
111 9:08:
20
222 9:08:
14
Shuffle
randomly.
key
v
<1,11 9:08:
1>
01
<2,11 9:08:
1>
13
111 9:08
:01
111 9:08
:13
Reduce
key
v
<1,22 9:08:
2>
14
<2,11 9:08:
1>
20
www.kellytechno.c
222 9:08
:14
111 9:08
:20
Hive Optimizations
Efficient Execution of SQL on top of
Map-Reduce
www.kellytechno.com
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
Local
Map
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Globa
l
Shuffl
e
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
Local
Sort
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, 2>
<nk3, 1>
Local
Reduce
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
www.kellytechno.co
m
<nk2, 3>
key av
1 111
B
ke av
y
Map Reduce
key bv
1 222
SQL:
ABC
bv
111 222
Map Reduce
ke av
y
1
bv
cv
key cv
1 333
FROM
Map Reduce
25
32
page cou
id
nt
1
2
Extended SQL
1
1
page age
id
1
2
25
32
Map Reduce
age cou
nt
25
32
FROM pv_users
INSERT INTO TABLE
pv_pageid_sum
SELECT pageid, count(1)
GROUP BY pageid
INSERT INTO TABLE
pv_age_sum
SELECT age, count(1)
GROUP BY age;
1
1
www.kellytechno.co
m
pageid_age_partial_sum
page ag
id
e
1
1
25
25
1
2
1
25
32
25
pageid_age_su
m
1
2
25
32
2
1
25
www.kellytechno.co
m
MAP-SIDE AGGREGATION /
COMBINER
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<male, 343>
<female, 128>
Local
Map
<male, 343>
<male, 123>
Global
Shuffle
<male, 343>
<male, 123>
Local
Sort
<male, 466>
Local
Reduce
Machine 2
<k4, v4>
<k5, v5>
<k6, v6>
<male, 123>
<female, 244>
www.kellytechno.co
m
QUERY REWRITE
Predicate Push-down
select
select
Column Pruning
www.kellytechno.co
m
page quality
IP
url
clicked
viewed
http://a.com/
90
65.1.2.3
http://a.com/
12
145
http://b.com/
20
68.9.0.81
http://b.com/
45
383
http://c.com/
68
11.3.85.1
http://c.com/
23
67
www.kellytechno.co
m
Type system
Primitive types
Recursively build up using Composition/Maps/Lists
Schema Evolution
www.kellytechno.co
m
HIVE CLI
DDL:
create
Browsing:
show
tables
describe table
cat table
Loading Data
Queries
www.kellytechno.co
m
META STORE
Thrift API
Current
MetaStore UI:
Browse
HiPal:
Interactively
www.kellytechno.co
m
Philosophy
SQL
Map-Reduce
Query Operators
Projections
Equi-joins
Group
by
Sampling
Order By
www.kellytechno.co
m
Extended SQL:
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script' AS (dt, uid)
CLUSTER BY dt) map
INSERT INTO TABLE pv_users_reduced
REDUCE map.dt, map.uid
USING 'reduce_script' AS (date, count);
www.kellytechno.co
m