Vous êtes sur la page 1sur 36

Making Pig Fly

Optimizing Data Processing on Hadoop


Daniel Dai (@daijy)
Thejas Nair (@thejasn)

Hortonworks Inc. 2011

Page 1

What is Apache Pig?


Pig Latin, a high level
data processing
language.

An engine that
executes Pig Latin
locally or on a
Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/


Architecting the Future of Big Data
Hortonworks Inc. 2011

Page 2

Pig-latin example
Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.

USERS = load users as (uid, age);


USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load pages as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 3

Why pig ?
Faster development
Fewer lines of code
Dont re-invent the wheel

Flexible
Metadata is optional
Extensible
Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/


Architecting the Future of Big Data
Hortonworks Inc. 2011

Page 4

Pig optimizations

Ideally user should not have to bother


Reality
Pig is still young and immature
Pig does not have the whole picture
Cluster configuration
Data histogram

Pig philosophy: Pig is docile

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 5

Pig optimizations
What pig does for you
Do safe transformations of query to optimize
Optimized operations (join, sort)

What you do
Organize input in optimal way
Optimize pig-latin query
Tell pig what join/group algorithm to use

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 6

Rule based optimizer

Column pruner
Push up filter
Push down flatten
Push up limit
Partition pruning
Global optimizer

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 7

Column Pruner
Pig will do column pruning automatically
A = load input as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into output;

Pig will prune


a2 automatically

Cases Pig will not do column pruning


automatically
No schema specified in load statement

DIY
A = load input;
B = order A by $0;
C = foreach B generate $0+$1;
Store C into output;

Architecting the Future of Big Data


Hortonworks Inc. 2011

A = load input;
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;
Store C into output;

Page 8

Column Pruner
Another case Pig does not do column
pruning
Pig does not keep track of unused column after
grouping
DIY
A = load input as (a0, a1, a2);
B = group A by a0;
C = foreach B generate SUM(A.a1);
Store C into output;

Architecting the Future of Big Data


Hortonworks Inc. 2011

A = load input as (a0, a1, a2);


A1 = foreach A generate $0, $1;
B = group A1 by a0;
C = foreach B generate SUM(A.a1);
Store C into output;

Page 9

Push up filter
Pig split the filter condition before push
A

Filter

Join

Filter

a0>0 && b0>10

Original query

Architecting the Future of Big Data


Hortonworks Inc. 2011

Join

Filter

a0>0

b0>10

Split filter condition

a0>0

b0>10

Join

Push up filter

Page 10

Other push up/down


Push down flatten

Push up limit

Architecting the Future of Big Data


Hortonworks Inc. 2011

Load

Load

Flatten

Order

Order

Flatten

Page 11

Partition pruning
Prune unnecessary partitions entirely
HCatLoader
2010
2011

HCatLoader

Filter
(year>=2011)

2012

2010
2011

HCatLoader
(year>=2011)

2012
Architecting the Future of Big Data
Hortonworks Inc. 2011

Page 12

Intermediate file compression


map 1
reduce 1
Pig temp file

Pig Script

map 2

reduce 2
Pig temp file

map 3
reduce 3

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 13

Enable temp file compression

Pig temp file are not compressed by


default
Issues with snappy (HADOOP-7990)
LZO: not Apache license

Enable LZO compression


Install LZO for Hadoop
In conf/pig.properties
pig.tmpfilecompression
= true disk saving and 4x query
With
lzo, up to > 90%
pig.tmpfilecompression.codec = lzo
speed
up

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 14

Multiquery

Combine two or more map/reduce job


into one
Load

Group by $0

Group by $1

Group by $2

Store

Store

Store

Happens automatically
CasesForeach
we want to Foreach
control multiquery:
Foreachcombine too
many

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 15

Control multiquery

Disable multiquery
Command line option: -M

Using exec to mark the boundary


A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group, COUNT(A);
Store C0 into output0;
B1 = group A by $1;
C1 = foreach B1 generate group, COUNT(A);
Store C1 into output1;
exec
B2 = group A by $2;
C2 = foreach B2 generate group, COUNT(A);
Store C2 into output2;

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 16

Implement the right UDF

Algebraic UDF
Initial
Intermediate
Final

Map
Initial

Combiner
Intermediate
A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);
Store C0 into output0;

Architecting the Future of Big Data


Hortonworks Inc. 2011

Reduce
Final

Page 17

Implement the right UDF

Accumulator UDF
Reduce side UDF
Normally takes a bag
Benefit
Big bag are passed in
batches
Avoid using too much
memory
Batch size

A = load input;
B0 = group A by $0;
C0 = foreach B0 generate group,
my_accum(A);
Store C0 into output0;

my_accum extends Accumulator {


public void accumulate() {
// take a bag trunk
}
public void getValue() {
// called after all bag trunks are
processed
}
}

pig.accumulative.batchsize=20000

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 18

Memory optimization

Control bag size on reduce side


Mapreduce:
reduce(Text key, Iterator<Writable>
values, )

If bag size exceed threshold, spill to disk


Iterator

Control the bag size to fit the bag in memory if


possible
Bag of Input 1
Bag of Input 2
Bag of Input 3
pig.cachedbag.memusage=0.2

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 19

Optimization starts before pig

Input format
Serialization format
Compression

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 20

Input format -Test Query


> searches = load aol_search_logs.txt'
using PigStorage() as (ID, Query, );
> search_thejas = filter searches by Query
matches '.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, .)

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 21

Input formats
RunTime (sec)

PigStorage
LzoPigStorage
PigStorage W Type
AvroStorage (has types)

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 22

Columnar format

RCFile
Columnar format for a group of rows
More efficient if you query subset of
columns

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 23

Tests with RCFile

Tests with load + project + filter out all


records.
Using hcatalog, w compression,types
Test 1
Project 1 out of 5 columns

Test 2
Project all 5 columns

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 24

RCFile test results

Plain Text
RCFile

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 25

Cost based optimizations

Optimizations decisions based on


your query/data
Often iterative process
Run
query

Measure

Tune
Architecting the Future of Big Data
Hortonworks Inc. 2011

Page 26

Cost based optimization - Aggregation

Hash Based Agg


Map task

Map
(logic)

Use pig.exec.mapPartAgg=true to enable

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 27

Cost based optimization Hash Agg.

Auto off feature


switches off HBA if output reduction is
not good enough

Configuring Hash Agg


Configure auto off feature pig.exec.mapPartAgg.minReduction

Configure memory used pig.cachedbag.memusage

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 28

Cost based optimization - Join

Use appropriate join algorithm


Skew on join key - Skew join
Fits in memory FR join

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 29

Cost based optimization MR tuning

Tune MR parameters to reduce IO


Control spills using map sort params
Reduce shuffle/sort-merge params

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 30

Parallelism of reduce tasks


Number of reduce slots = 6
Factors affecting runtime
Cores simultaneously used/skew
Cost of having additional reduce tasks
Runtime
4.0
6.0
8.0
24.0
48.0
256.0

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 31

Cost based optimization keep data


sorted
Frequent joins operations on same
keys
Keep data sorted on keys
Use merge join
Optimized group on sorted keys
Works with few load functions needs
additional i/f implementation

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 32

Optimizations for sorted data

Join 2
Join 1
Sort2
Sort1

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 33

Future Directions

Optimize using stats


Using historical stats w hcatalog
Sampling

Architecting the Future of Big Data


Hortonworks Inc. 2011

Page 34

Questions

?
Architecting the Future of Big Data
Hortonworks Inc. 2011

Page 35

Hortonworks Inc. 2011

Page 36

Vous aimerez peut-être aussi