Making Pig Fly

Making Pig Fly
Optimizing Data Processing on Hadoop

Daniel Dai (@daijy)
Thejas Nair (@thejasn)
Hortonworks Inc. 2011
Page 1
What is Apache Pig?

Pig Latin, a high level
data processing
language.
An engine that
executes Pig Latin
locally or on a
Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Architecting the Future of Big Data
Page 2
Pig-latin example
Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.
USERS = load users as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load pages as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;

Page 3
Why pig ?
Faster development
Fewer lines of code
Dont re-invent the wheel
Flexible
Metadata is optional
Extensible
Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 4
Pig optimizations
Ideally user should not have to bother

Reality
Pig is still young and immature
Pig does not have the whole picture
Cluster configuration
Data histogram
Pig philosophy: Pig is docile

Page 5
Pig optimizations
What pig does for you
Do safe transformations of query to optimize
Optimized operations (join, sort)
What you do
Organize input in optimal way
Optimize pig-latin query
Tell pig what join/group algorithm to use

Page 6
Rule based optimizer
Column pruner
Push up filter
Push down flatten
Push up limit
Partition pruning
Global optimizer

Page 7
Column Pruner
Pig will do column pruning automatically
A = load input as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into output;
Pig will prune

a2 automatically
Cases Pig will not do column pruning

automatically
No schema specified in load statement
DIY
A = load input;
B = order A by $0;
C = foreach B generate $0+$1;

A = load input;
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;
Page 8
Column Pruner
Another case Pig does not do column
pruning
Pig does not keep track of unused column after
grouping
DIY
B = group A by a0;
C = foreach B generate SUM(A.a1);


A1 = foreach A generate $0, $1;
B = group A1 by a0;
C = foreach B generate SUM(A.a1);
Page 9
Push up filter
Pig split the filter condition before push
A
Filter
Join
Filter
a0>0 && b0>10
Original query

Join
Filter
a0>0
b0>10
Split filter condition
a0>0
b0>10
Join
Push up filter
Page 10
Other push up/down

Push down flatten
Push up limit

Load
Load
Flatten
Order
Order
Flatten
Page 11
Partition pruning
Prune unnecessary partitions entirely
HCatLoader
2010
2011
HCatLoader
Filter
(year>=2011)
2012
2010
2011
HCatLoader
(year>=2011)
2012
Page 12
Intermediate file compression

map 1
reduce 1
Pig temp file
Pig Script
map 2
reduce 2
Pig temp file
map 3
reduce 3

Page 13
Enable temp file compression
Pig temp file are not compressed by

default
Issues with snappy (HADOOP-7990)
LZO: not Apache license
Enable LZO compression

Install LZO for Hadoop
In conf/pig.properties
pig.tmpfilecompression
= true disk saving and 4x query
With
lzo, up to > 90%
pig.tmpfilecompression.codec = lzo
speed
up

Page 14
Multiquery
Combine two or more map/reduce job

into one
Load
Group by $0
Group by $1
Group by $2
Store
Store
Store
Happens automatically
CasesForeach
we want to Foreach
control multiquery:
Foreachcombine too
many

Page 15
Control multiquery
Disable multiquery
Command line option: -M
Using exec to mark the boundary

A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group, COUNT(A);
Store C0 into output0;
B1 = group A by $1;
exec
B2 = group A by $2;

Page 16
Implement the right UDF
Algebraic UDF
Initial
Intermediate
Final
Map
Initial
Combiner
Intermediate
A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);

Reduce
Final
Page 17
Implement the right UDF
Accumulator UDF
Reduce side UDF
Normally takes a bag
Benefit
Big bag are passed in
batches
Avoid using too much
memory
Batch size
A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group,
my_accum(A);
my_accum extends Accumulator {

public void accumulate() {
// take a bag trunk
}
public void getValue() {
// called after all bag trunks are
processed
}
}
pig.accumulative.batchsize=20000

Page 18
Memory optimization
Control bag size on reduce side

Mapreduce:
reduce(Text key, Iterator<Writable>
values, )
If bag size exceed threshold, spill to disk

Iterator
Control the bag size to fit the bag in memory if

possible
Bag of Input 1
Bag of Input 2
Bag of Input 3
pig.cachedbag.memusage=0.2

Page 19
Optimization starts before pig
Input format
Serialization format
Compression

Page 20
Input format -Test Query

> searches = load aol_search_logs.txt'
using PigStorage() as (ID, Query, );
> search_thejas = filter searches by Query
matches '.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, .)

Page 21
Input formats
RunTime (sec)
PigStorage
LzoPigStorage
PigStorage W Type
AvroStorage (has types)

Page 22
Columnar format
RCFile
Columnar format for a group of rows
More efficient if you query subset of
columns

Page 23
Tests with RCFile
Tests with load + project + filter out all

records.
Using hcatalog, w compression,types
Test 1
Project 1 out of 5 columns
Test 2
Project all 5 columns

Page 24
RCFile test results
Plain Text
RCFile

Page 25
Cost based optimizations
Optimizations decisions based on

your query/data
Often iterative process
Run
query
Measure
Tune
Page 26
Cost based optimization - Aggregation
Hash Based Agg

Map task
Map
(logic)
Use pig.exec.mapPartAgg=true to enable

Page 27
Cost based optimization Hash Agg.
Auto off feature

switches off HBA if output reduction is
not good enough
Configuring Hash Agg

Configure auto off feature pig.exec.mapPartAgg.minReduction
Configure memory used pig.cachedbag.memusage

Page 28
Cost based optimization - Join
Use appropriate join algorithm

Skew on join key - Skew join
Fits in memory FR join

Page 29
Cost based optimization MR tuning
Tune MR parameters to reduce IO

Control spills using map sort params
Reduce shuffle/sort-merge params

Page 30
Parallelism of reduce tasks

Number of reduce slots = 6
Factors affecting runtime
Cores simultaneously used/skew
Cost of having additional reduce tasks
Runtime
4.0
6.0
8.0
24.0
48.0
256.0

Page 31
Cost based optimization keep data

sorted
Frequent joins operations on same
keys
Keep data sorted on keys
Use merge join
Optimized group on sorted keys
Works with few load functions needs
additional i/f implementation

Page 32
Optimizations for sorted data
Join 2
Join 1
Sort2
Sort1

Page 33
Future Directions
Optimize using stats

Using historical stats w hcatalog
Sampling

Page 34
Questions
?
Page 35
Page 36

Making Pig Fly

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Making Pig Fly

Transféré par

Droits d'auteur :

Formats disponibles

Making Pig Fly

Optimizing Data Processing on Hadoop

Hortonworks Inc. 2011

What is Apache Pig?

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

USERS = load users as (uid, age);

Architecting the Future of Big Data

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Ideally user should not have to bother

Pig philosophy: Pig is docile

Architecting the Future of Big Data

Architecting the Future of Big Data

Rule based optimizer

Architecting the Future of Big Data

Pig will prune

Cases Pig will not do column pruning

Architecting the Future of Big Data

Architecting the Future of Big Data

A = load input as (a0, a1, a2);

a0>0 && b0>10

Architecting the Future of Big Data

Split filter condition

Other push up/down

Architecting the Future of Big Data

Intermediate file compression

Architecting the Future of Big Data

Enable temp file compression

Pig temp file are not compressed by

Enable LZO compression

Architecting the Future of Big Data

Combine two or more map/reduce job

Architecting the Future of Big Data

Using exec to mark the boundary

Architecting the Future of Big Data

Implement the right UDF

Architecting the Future of Big Data

Implement the right UDF

my_accum extends Accumulator {

Architecting the Future of Big Data

Control bag size on reduce side

If bag size exceed threshold, spill to disk

Control the bag size to fit the bag in memory if

Architecting the Future of Big Data

Optimization starts before pig

Architecting the Future of Big Data

Input format -Test Query

Architecting the Future of Big Data

Architecting the Future of Big Data

Architecting the Future of Big Data

Tests with RCFile

Tests with load + project + filter out all

Architecting the Future of Big Data

RCFile test results

Architecting the Future of Big Data

Cost based optimizations

Optimizations decisions based on

Cost based optimization - Aggregation

Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Architecting the Future of Big Data

Cost based optimization Hash Agg.

Auto off feature

Configuring Hash Agg

Configure memory used pig.cachedbag.memusage