Académique Documents
Professionnel Documents
Culture Documents
Page 1
An engine that
executes Pig Latin
locally or on a
Hadoop cluster.
Page 2
Pig-latin example
Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.
Page 3
Why pig ?
Faster development
Fewer lines of code
Dont re-invent the wheel
Flexible
Metadata is optional
Extensible
Procedural programming
Page 4
Pig optimizations
Page 5
Pig optimizations
What pig does for you
Do safe transformations of query to optimize
Optimized operations (join, sort)
What you do
Organize input in optimal way
Optimize pig-latin query
Tell pig what join/group algorithm to use
Page 6
Column pruner
Push up filter
Push down flatten
Push up limit
Partition pruning
Global optimizer
Page 7
Column Pruner
Pig will do column pruning automatically
A = load input as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into output;
DIY
A = load input;
B = order A by $0;
C = foreach B generate $0+$1;
Store C into output;
A = load input;
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;
Store C into output;
Page 8
Column Pruner
Another case Pig does not do column
pruning
Pig does not keep track of unused column after
grouping
DIY
A = load input as (a0, a1, a2);
B = group A by a0;
C = foreach B generate SUM(A.a1);
Store C into output;
Page 9
Push up filter
Pig split the filter condition before push
A
Filter
Join
Filter
Original query
Join
Filter
a0>0
b0>10
a0>0
b0>10
Join
Push up filter
Page 10
Push up limit
Load
Load
Flatten
Order
Order
Flatten
Page 11
Partition pruning
Prune unnecessary partitions entirely
HCatLoader
2010
2011
HCatLoader
Filter
(year>=2011)
2012
2010
2011
HCatLoader
(year>=2011)
2012
Architecting the Future of Big Data
Hortonworks Inc. 2011
Page 12
Pig Script
map 2
reduce 2
Pig temp file
map 3
reduce 3
Page 13
Page 14
Multiquery
Group by $0
Group by $1
Group by $2
Store
Store
Store
Happens automatically
CasesForeach
we want to Foreach
control multiquery:
Foreachcombine too
many
Page 15
Control multiquery
Disable multiquery
Command line option: -M
Page 16
Algebraic UDF
Initial
Intermediate
Final
Map
Initial
Combiner
Intermediate
A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);
Store C0 into output0;
Reduce
Final
Page 17
Accumulator UDF
Reduce side UDF
Normally takes a bag
Benefit
Big bag are passed in
batches
Avoid using too much
memory
Batch size
A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group,
my_accum(A);
Store C0 into output0;
pig.accumulative.batchsize=20000
Page 18
Memory optimization
Page 19
Input format
Serialization format
Compression
Page 20
Page 21
Input formats
RunTime (sec)
PigStorage
LzoPigStorage
PigStorage W Type
AvroStorage (has types)
Page 22
Columnar format
RCFile
Columnar format for a group of rows
More efficient if you query subset of
columns
Page 23
Test 2
Project all 5 columns
Page 24
Plain Text
RCFile
Page 25
Measure
Tune
Architecting the Future of Big Data
Hortonworks Inc. 2011
Page 26
Map
(logic)
Page 27
Page 28
Page 29
Page 30
Page 31
Page 32
Join 2
Join 1
Sort2
Sort1
Page 33
Future Directions
Page 34
Questions
?
Architecting the Future of Big Data
Hortonworks Inc. 2011
Page 35
Page 36