Académique Documents
Professionnel Documents
Culture Documents
Contact:
Dr. Bina Ramamurthy
CSE Department
University at Buffalo (SUNY)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243
Introduction to MapReduce
From CS Foundation to MapReduce
MapReduce programming model
Hadoop Distributed File System
Relevance to Undergraduate Curriculum
Demo (Internet access needed)
Our experience with the framework
Summary
References
green 2
Data Main
sun 1
collection
moon 1
land 1
part 1
WordCounter
parse( )
count( )
DataCollection ResultTable
web 2
weed 1
green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1
parse( )
count( )
weed 1
Data green 2
collection
sun 1
moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
WordList
Separate counters
DataCollection ResultTable
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Peta-scale Data
11
Main web 2
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Addressing the Scale Issue
12
Single machine cannot serve all the data: you need a distributed
special (file) system
Large number of commodity hardware disks: say, 1000 disks 1TB
each
Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
Thus failure is norm and not an exception.
File system has to be fault-tolerant: replication, checksum
Data transfer bandwidth is critical (location of data)
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data Peta Scale Data is Commonly Distributed
collection
14
Main web 2
Data
collection weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data Write Once Read Many (WORM) data
collection
15
Main web 2
Data
collection weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Data WORM Data is Amenable to Parallelism
collection
16
Main
Data
collection
1. Data with WORM
characteristics : yields
Data to parallel processing;
collection 2. Data without
Thread dependencies: yields
1..*
to out of order
Data processing
1..*
collection Parser Counter
Data Thread
Our parse is a mapping operation:
collection Parser
1..*
1..*
Counter
MAP: input <key, value> pairs
DataCollection WordList ResultTable
Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread
collection
1..*
1..*
Counter
Counter
green
1
sun 1
moon 1
land 1
part 1
Map web 1
Data green 1
processors
Data Map
Collection: split 2
……
Data
…
Collection: split n
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……
Data
…
Reduce
Collection: split n Map
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
P-0002
Parse-hash ,count3
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Words
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
MapReduce Programming
Model
23
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hadoop Distributed File System
32
blockmap
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Relevance and Impact on Undergraduate courses
33