Académique Documents
Professionnel Documents
Culture Documents
Instead of loading total data in to the relation ,we can load only a piece of
data that we need.
Group daily by stock symbol
Store the result in by_group
Display the result of part file
daily = load '/data/nyse' as (exchange, stock);
grpd = group daily by stock;
store grpd into '/by_group';
fs -cat /by_group/part*
Order Stocks by date
Create relation daily
Order records in daily by date
Display results
daily
close:float,
volume:int, adj_close:float);
bydate = order daily by date;
dump bydate ;
No schema join example
Pigs knowledge of the schema can change at different points
Because Pig does not know the schema of daily, it cannot know the schema of
the join of divs and daily.
divs = load '/data/dividends' as (exchange, stock_symbol, date, dividends);
daily = load '/data/nyse';
jnd
= join divs by stock_symbol, daily by $1;
dump jnd;
Transaction count
counting the transactions;
Load data into daily
Group daily records by stock symbol
Calculate the count of groups in stocks
display the results
daily = load '/data/nyse' as (exchange, stock);
grpd = group daily by stock;
cnt
= foreach grpd generate group, COUNT(daily);
dump cnt;
Operations using foreach
daily = load '/data/nyse';
calcs = foreach daily generate $7 / 1000000, $3 * 100.0, SUBSTRING($0, 0, 1),
$6 - $3;
dump calcs;
Getting unique records
jnd
Pig does a cross product between the records from both inputs. To minimize
memory usage, it has MapReduce order the records coming into the reducer
using the input annotation it added in the map phase. Thus all of the records
for the left input arrive first. Pig caches these in memory. All of the
records for the right input arrive second. As each of these records arrives,
it is crossed with each record from the left side to produce an output
record.
Filter records by date and dividend
increased = filter jnd by divs1::date < divs2::date and
divs1::dividends < divs2::dividends;
join preserves the names of the fields of the inputs passed to it. It also
prepends the name of the relation the field came from, followed by a ::
describe jnd;
jnd: {daily::exchange: bytearray,daily::symbol: bytearray,daily::date:
bytearray,daily::open: bytearray,daily::high: bytearray,daily::low:
bytearray,daily::close: bytearray,daily::volume: bytearray,daily::adj_close:
bytearray,divs::exchange: bytearray,divs::symbol: bytearray,divs::date:
bytearray,divs::dividends: bytearray}
dump increased;
3.Join NYSE daily and NYSE dividends by 2 keys
Load daily and divs using NYSE_daily and NYSE_dividends
Join two tables using symbol and date
When both keys are equal the two rows are joined.
daily = load '/data/nyse' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load '/data/dividends' as (exchange, symbol, date, dividends);
jnd
= join daily by (symbol, date), divs by (symbol, date);
dump jnd;
4.Left
outer
join
Load daily and divs using NYSE_daily and NYSE_dividends
In case of left outer join records from the left side will be included even
when they do not have a match on the right side.
daily = load '/data/nyse' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load '/data/dividends' as (exchange, symbol, date, dividends);
jnd
= join daily by (symbol, date) left outer, divs by (symbol, date);
dump jnd;