TamingHadoop Pig

Taming Hadoop
Pig Hands on Labs

Running tutorial in local mode
$ cd /home/hadoop/training/pig/wordcount
$ pig -x local
Word count
List the files
grunt> fs -ls ;
load input.txt from the local directory
grunt>A = load './input.txt';
grunt>B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
grunt>C = group B by word;
grunt>D = foreach C generate COUNT(B), group;
grunt>store D into './wordcount';
grunt>dump D;
Running tutorial in mapred mode
$ pig -x mapred
grunt>A = load '/data/wordcount/input.txt';
grunt>B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
grunt>C = group B by word;
grunt>D = foreach C generate COUNT(B), group;
grunt>store D into '/data/wordcount/output';
grunt> fs -cat /data/wordcount/output/part*;
grunt>dump D;

Compute Average Dividend
Load data from hdfs to relation dividends
Group the data by symbol.
Compute average of dividends on each symbol group
Store the results in outputavg
Display the result
dividends = load '/data/dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividend:chararray);
grouped
= group dividends by symbol;
avg = foreach grouped generate group, AVG(dividends.dividend);
store avg into '/outputavg';
fs -cat /outputavg/part*;

NYSE daily total trade estimation
Create relation daily using NYSE_daily
For each record in the daily compute volume * close
Display the results
daily = load '/data/nyse' as
(exchange:chararray, symbol:chararray,
date:chararray, open:float,
high:float,low:float,close:float,volume:int,adj_close:float);
rough = foreach daily generate volume * close;
dump rough;

Group daily stocks by symbol
Instead of loading total data in to the relation ,we can load only a piece of
data that we need.
Group daily by stock symbol
Store the result in by_group
Display the result of part file
daily = load '/data/nyse' as (exchange, stock);
grpd = group daily by stock;
store grpd into '/by_group';
fs -cat /by_group/part*

Order Stocks by date
Create relation daily
Order records in daily by date
Display results
daily
= load '/data/nyse' as (exchange:chararray, symbol:chararray,

date:chararray, open:float, high:float, low:float,
close:float,
volume:int, adj_close:float);
bydate = order daily by date;
dump bydate ;

No schema join example
Pigs knowledge of the schema can change at different points
Because Pig does not know the schema of daily, it cannot know the schema of
the join of divs and daily.
divs = load '/data/dividends' as (exchange, stock_symbol, date, dividends);
daily = load '/data/nyse';
jnd
= join divs by stock_symbol, daily by $1;
dump jnd;
Transaction count
counting the transactions;
Load data into daily
Group daily records by stock symbol
Calculate the count of groups in stocks
display the results
daily = load '/data/nyse' as (exchange, stock);
grpd = group daily by stock;
cnt
= foreach grpd generate group, COUNT(daily);
dump cnt;

Operations using foreach
daily = load '/data/nyse';
calcs = foreach daily generate $7 / 1000000, $3 * 100.0, SUBSTRING($0, 0, 1),
$6 - $3;
dump calcs;

Getting unique records

We can get unique records from a relation using distict operator

daily
= load '/data/nyse' as (exchange:chararray, symbol:chararray);
uniq
= distinct daily;
dump uniq;

Limit operator
Use the limit operator to limit the records
divs2 = load '/data/dividends';
first10 = limit divs 10;
dump first10;

Explore Pig Join operators
Join selects records from one input to put together with records from another
input. This is done by indicating keys for each input. When those keys are
equal,the two rows are joined. Records for which no match is found are
dropped
Data Overview
Historical NYSE stock data from 1970 2010, including daily open, close,
low, high and trading volume figures. Data is organized alphabetically by
ticker symbol.

NYSE Stock Data Set
exchange
symbol
date
dividend
NYSE
CPO
2009-12-30
0.14
NYSE
CCS
2009-10-28
0.414
NYSE
CIF
2009-04-13
0.022

NYSE daily Data set
exchange
symbol
date open
high low
close volume adj_close
NYSE
CPO
2009-12-31 29.90 30.00 29.21 29.23
225300 29.23
NYSE
CSL
2009-05-26 22.69 23.93 22.50 23.51
361100 23.28
NYSE
CMP
2009-10-29 63.56 65.67 62.93 64.89
918000 64.54

Use cases
Join by key
Self Join
Join by two keys
Left Outer Join

1.Join daily stocks and dividends by key
Load NYSE_daily
Load NYSE_dividends
Join daily and divs by symbol.
When keys are equal the two rows are joined.
Display the result
daily = load '/data/nyse' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load '/data/dividends' as (exchange, symbol, date, dividends);
jnd
= join daily by symbol, divs by symbol;
dump jnd;

2.Self join
Load relation divs1 using NYSE_dividends

Load relation divs2 using NYSE_dividends

Join divs1 and divs2 relations by symbol
divs1
divs2
= load '/data/dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividends);
= load '/data/dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
= join divs1 by symbol, divs2 by symbol;
jnd

Pig does a cross product between the records from both inputs. To minimize
memory usage, it has MapReduce order the records coming into the reducer
using the input annotation it added in the map phase. Thus all of the records
for the left input arrive first. Pig caches these in memory. All of the
records for the right input arrive second. As each of these records arrives,
it is crossed with each record from the left side to produce an output
record.
Filter records by date and dividend
increased = filter jnd by divs1::date < divs2::date and
divs1::dividends < divs2::dividends;

join preserves the names of the fields of the inputs passed to it. It also
prepends the name of the relation the field came from, followed by a ::
describe jnd;
jnd: {daily::exchange: bytearray,daily::symbol: bytearray,daily::date:
bytearray,daily::open: bytearray,daily::high: bytearray,daily::low:
bytearray,daily::close: bytearray,daily::volume: bytearray,daily::adj_close:
bytearray,divs::exchange: bytearray,divs::symbol: bytearray,divs::date:
bytearray,divs::dividends: bytearray}
dump increased;

3.Join NYSE daily and NYSE dividends by 2 keys
Load daily and divs using NYSE_daily and NYSE_dividends
Join two tables using symbol and date
When both keys are equal the two rows are joined.
volume, adj_close);
jnd
= join daily by (symbol, date), divs by (symbol, date);
dump jnd;

4.Left outer join
Load daily and divs using NYSE_daily and NYSE_dividends
In case of left outer join records from the left side will be included even
when they do not have a match on the right side.
volume, adj_close);
jnd
= join daily by (symbol, date) left outer, divs by (symbol, date);
dump jnd;

TamingHadoop Pig

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

TamingHadoop Pig

Transféré par

Droits d'auteur :

Formats disponibles

Taming Hadoop

Pig Hands on Labs

= load '/data/nyse' as (exchange:chararray, symbol:chararray,

We can get unique records from a relation using distict operator

Load relation divs2 using NYSE_dividends

= load '/data/dividends' as (exchange:chararray, symbol:chararray,

Vous aimerez peut-être aussi