Académique Documents
Professionnel Documents
Culture Documents
Objective
Move your Database Backup to
The objective of this Pig tutorial is to get you up and running Pig scripts on a realworld Hadoop
dataset stored in Hadoop.
Hadoop as the source of truth
Prerequisites
What causes some Big Data
The following are the prerequisites for setting up Pig and running Pig scripts
projects to fail?
You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
To install hadoop, see my previous blog article on Hadoop Setup
Your machine should have Java 1.6 installed More Articles
It is assumed you have basic knowledge of Java programming and SQL.
Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial.
Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset is on
the github page of the Tutorials)
Setting up Pig
Platform
This Pig tutorial assumes Linux/Mac OS X. If using Windows, please install Cygwin. It is required for shell support in
addition to the required software above.
Procedure
Download a stable tarbal (for our tutorial we used pig0.10.0.tar.gz (~50 MB), which works with Hadoop 0.20.X,
1.0.X and 0.23.X) from one of the apache download mirrors.
Unpack the tarball in the directory of your choice, using the following command
$ tar xzvf pigx.y.z.tar.gz
Set the environment variable PIG_HOME to point to the installation directory for convinience:
You can either do
$ cd pigx.y.z
$ export PIG_HOME={{pwd}}
or set PIG_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
$ export PIG_HOME=<path_to_hive_home_directory>
e.g.
$ export PIG_HOME=’/Users/Work/pig0.10.0′
$ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH
Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
$ export JAVA_HOME=<<Java_installation_directory>>
Execution Modes
Pig has two modes of execution – local mode and MapReduce mode.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single
machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the
local option to the x or exectype parameter when starting pig. This starts the interactive shell called Grunt:
http://www.orzota.com/pigtutorialforbeginners/ 1/5
3/20/2015 Pig Tutorial for Beginners OrzotaOrzota
$ pig x local
grunt>
MapReduce Mode
In this mode, Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster
can be pseudo or fully distributed cluster.
First check the compatibility of the Pig and Hadoop versions being used. The compatibility details are given in the
Pig release page (for our tutorial, the Hadoop version is 1.0.3, and Pig version is 0.10.0, which is compatible to the
Hadoop version we are using).
First, export the variable PIG_CLASSPATH to add Hadoop conf directory.
$ export PIG_CLASSPATH=$HADOOP_HOME/conf/
After exporting the PIG_CLASSPATH, run the pig command, as shown below
$ pig
[main] INFO org.apache.pig.Main – Apache Pig version 0.10.0 (r1328203)
compiled Apr 19 2012, 22:54:12
[main] INFO org.apache.pig.Main – Logging error messages to:
/Users/varadmeru/pig_1351858332488.log
[main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine –
Connecting to hadoop file system at: hdfs://localhost:9000
main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine –
Connecting to mapreduce job tracker at: localhost:9001
grunt>
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive
shell for your Pig queries. You can run Pig programs in three ways via Script, Grunt, or embedding the script into
Java code. Running on Interactive shell is shown in the Problem section. To run a batch of pig scripts, it is
recommended to place them in a single file and execute them in batch mode.
Executing Scripts in Batch Mode
Running in Local mode:
$ pig x local BookXGroupByYear.pig
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
For running Pig script file in MapReduce mode:
$ pig x mapreduce BookXGroupByYear.pig
OR
$ pig BookXGroupByYear.pig
Its a good practice to have “*.pig” extension to the file for clarity and maintainabilty. The BookXGroupByYear.pig
file is available on our github page of the Tutorials for your reference.
Now we focus on solving a simple but realworld use case with Pig. This is the same problem that was solved in the
previous blog articles (Stepbystep MapReduce Programming using Java and Hive for Beginners using SQLlike
query for Hive).
Problem
The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our
input data set (file BXBooks.csv) is a csv file. Some sample rows:
“ISBN”;”BookTitle”;”BookAuthor”;”YearOfPublication”;”Publisher”;”ImageURLS”;”ImageURLM”;”ImageURLL”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University
Press”;”http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
…
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo
Canada”;”http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
The first row is the header row. The other rows are sample records from the file. Our objective is to find the
frequency of books published each year.
Now as our data is not cleansed and might give us erroronous results due to it being serialized, we clean it by the
http://www.orzota.com/pigtutorialforbeginners/ 2/5
3/20/2015 Pig Tutorial for Beginners OrzotaOrzota
following commands:
$ cd /Users/Work/Data/BXCSVDump
$ sed ‘s/&;/&/g’ BXBooks.csv | sed e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’
> BXBooksCorrected.txt
The sed commands help us to remove the delimiters “;” (semicolon) from the content and replace them with $$$.
Also, the pattern “&;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove
the header line, it will be processed as part of the data, which it isn’t.
All the above steps are required to cleanse the data and give accurate results.
“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton &; Company”;
“http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”;
“http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg” is changed to
“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”;
“http://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”;
“http://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”;
“http://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”
Now, copy the file into Hadoop:
$ hadoop fs mkdir input
$ hadoop fs put /Users/Work/Data/BXCSVDump/BXBooksCorrected.txt input
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Running Pig Flow using the command line:
$ pig
grunt> BookXRecords = LOAD ‘/user/varadmeru/input/BXBooksCorrected1.txt’
>> USING PigStorage(‘;’) AS (ISBN:chararray,BookTitle:chararray,
>> BookAuthor:chararray,YearOfPublication:chararray,
>> Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
20121105 01:09:11,554 [main] WARN org.apache.pig.PigServer – Encountered Warning
IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> GroupByYear = GROUP BookXRecords BY YearOfPublication;
20121105 01:09:11,810 [main] WARN org.apache.pig.PigServer – Encountered Warning
IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> CountByYear = FOREACH GroupByYear
>> GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
20121105 01:09:11,996 [main] WARN org.apache.pig.PigServer – Encountered Warning
IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> STORE CountByYear
>> INTO ‘/user/work/output/pig_output_bookx’ USING PigStorage(‘t’);
The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine
and the username of the hadoop setup. The output of the above pig run is stored into
the output/pig_output_bookx folder on HDFS. It can be displayed on the screen by :
$ hadoop fs cat output/pig_output_bookx/partr00000
Output
The snapshot of the output of the Pig flow is shown below:
http://www.orzota.com/pigtutorialforbeginners/ 3/5
3/20/2015 Pig Tutorial for Beginners OrzotaOrzota
Comparison with Java MapReduce and Hive
You can see the above output and compare with the outputs from the MapReduce code from the stepbysetp
MapReduce guide and the Hive for beginners blog post. Let’s take a look at how the code for Pig differs from Java
MapReduce code and Hive query for the same solution:
Hivehive>
EXISTS BXDataSet
STRING,BookAuthor
STRING, YearOfPublication
STRING, Publisher STRING,
STRING,ImageURLM STRING,
STRING)
COMMENT ‘BXBooks Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’
STORED AS TEXTFILE;
hive> LOAD DATA
Mapper INPATH ‘/user/work/input/BX
BooksCorrected.csv’
Reducer
TABLE BXDataSet
hive> select
count(booktitle) from bxdataset
yearofpublication;
PigBookXRecords = LOAD
‘/user/varadmeru/input/BX
BooksCorrected1.txt’
PigStorage(‘;’)
AS (ISBN:chararray,
BookTitle:chararray,
BookAuthor:chararray,
YearOfPublication:chararray,
Publisher:chararray,
ImageURLS:chararray,
ImageURLM:chararray,
ImageURLL:chararray);
GroupByYear =
GROUP BookXRecords
CountByYear =
http://www.orzota.com/pigtutorialforbeginners/ 4/5
3/20/2015 Pig Tutorial for Beginners OrzotaOrzota
FOREACH GroupByYear
GENERATE CONCAT
(chararray)
STORE CountByYear
‘/user/varadmeru/output/pig_output_bookx’
USING PigStorage(‘t’);
It is clear from the above that the high level abstractions such as Hive and Pig reduce the programming effort
required as well as the complexity of learning and writing MapReduce code. In the small Pig example above, we
reduced the lines of code from roughly 25 (for Java) to 3 (for Hive) and 4 (for Pig).
Conclusion
In this tutorial we learned how to setup Pig, and run Pig Latin queries. We saw the query for the same problem
which we solved MapReduce code from the stepbysetp MapReduce guide and the Hive for
beginners with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay
tuned for more exciting tutorials from the small world of BigData.
References
1. We referred the PigLatin Basics and Builtin Functions guides from Apache Pig page.
2. We referred the Getting Started guide from Apache Pig page.
http://www.orzota.com/pigtutorialforbeginners/ 5/5