Vous êtes sur la page 1sur 201

1

single-node Hadoop Installation Preparation

Last Updated:- 3rd Nov 2012


All software will be in D:\\Software
Verify that VM player is installed else you need to install it.
Start your VM player and start the Hadoop VM Node master
Logon with the following credentials:
root / root123
Create directory hadoop :- mkdir /hadoop
The required steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System
(HDFS) on RedHat Linux.



Red Hat Linux


Hadoop 1.0.3, released May 2012

Prerequisites
Sun Java 6 Verify java as below.
# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
www.hpottech.com

single-node Hadoop Installation Preparation

Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)


If Java is not there then installed Java : change directory to /hadoop and run the following
# sh /mnt/hgfs/Hadoopsw/jdk-6u17-linux-i586.bin
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use
Hadoop on it .
#su - root
# ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/root/.ssh/id_rsa):
Created directory '/home/root/.ssh'.
Your identification has been saved in /home/root/.ssh/id_rsa.
Your public key has been saved in /home/root/.ssh/id_rsa.pub.
The key fingerprint is:

www.hpottech.com

single-node Hadoop Installation Preparation

9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 root@ubuntu
The key's randomart image is:
[...snipp...]
The second line will create an RSA key pair with an empty password.
Second, you have to enable SSH access to your local machine with this newly created key.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the root user. The step is also
needed to save your local machines host key fingerprint to the root users known_hosts file.
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

www.hpottech.com

single-node Hadoop Installation Preparation

Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
# ssh localhost
Last login: Mon Dec 3 21:36:02 2012 from localhost.localdomain

www.hpottech.com

Hadoop Installation Single Node

Start the VM and follow the following steps


Hadoop
Installation
Extract the contents of the Hadoop package to a /hadoop.
$ cd /hadoop
$ tar xzf hadoop-1.0.0.tar.gz
$ mv hadoop-1.0.0 hadoop
$ chown -R hduser:hadoop hadoop (optional)

Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user root. If you use a shell other than bash, you should of course
update its appropriate configuration files instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/hadoop/hadoop
(Changes the directory app. to your installation.)
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/hadoop/jdk1.6.0_17
www.hpottech.com

Hadoop Installation Single Node

# Add Hadoop bin/ directory to PATH


export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Distributed File System (HDFS)


Configuration
Our goal in this tutorial is a single-node setup of Hadoop
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME.
Open /hadoop/hadoop/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun
JDK/JRE 6 directory.
Change

# The java implementation to use. Required.


# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

www.hpottech.com

Hadoop Installation Single Node

to

# The java implementation to use. Required.


export JAVA_HOME=/hadoop/jdk1.6.0_17/

conf/*-site.xml
Now we create the directory and set the required ownerships and permissions:

$ mkdir -p /app/hadoop/tmp

www.hpottech.com

Hadoop Installation Single Node

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system </description>
</property>

www.hpottech.com

Hadoop Installation Single Node

In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs at.</description>
</property>

www.hpottech.com

Hadoop Installation Single Node

In file conf/hdfs-site.xml:

<!-- In: conf/hdfs-site.xml -->


<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. </description>
</property>

www.hpottech.com

Hadoop Installation Single Node

Formatting the HDFS filesystem via the NameNode


The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local
filesystem of your cluster .
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
hduser@ubuntu:~$ /hadoop/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format


10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo'
on Fri Feb 19 08:07:34 UTC 2010
www.hpottech.com

Hadoop Installation Single Node

************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster


Run the command:

hduser@ubuntu:~$ /hadoop/hadoop/bin/start-all.sh

www.hpottech.com

Hadoop Installation Single Node

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Suns Java since v1.5.0).
hduser@ubuntu:$ jps
(if jps doesnt found, go to /Software/JDK/bin and execute ./jps)
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
www.hpottech.com

Hadoop Installation Single Node

1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
hduser@ubuntu:~$ netstat -plten | grep java
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp

0
0
0
0
0
0
0
0
0
0

0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java


0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java
0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java
0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java
0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java
0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java
0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java
0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java
0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java
0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java

hduser@ubuntu:~$

If there are any errors, examine the log files in the /logs/ directory.

Stopping your single-node cluster


Run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

www.hpottech.com

Hadoop Installation Single Node

to stop all the daemons running on your machine.


Exemplary output:

hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

www.hpottech.com

Lab 1 - HDFS
Lab objectives
In this lab you will practice HDFS command line interface.

Lab instructions
This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the
results.

Basic Hadoop Filesystem commands


1. In order to work with HDFS you need to use the hadoop fs command. For example to list the /
and /app directories you need to input the following commands:

>
>

hadoop fs -ls /
hadoop fs -ls /app

2.

There are many commands you can run within the Hadoop filesystem. For example to make the
directory test you can issue the following command:
>

hadoop fs -mkdir test

Now let's see the directory we've created:


>
>

hadoop fs -ls /
hadoop fs -ls /user/root

You will notice that the test directory got created under the /user/root directory. This is because
as the root user, your default path is /user/root and thus if you don't specify an absolute path all
HDFS commands work out of /user/root (this will be your default working directory).

3.

You should be aware that you can pipe (using the | character) any HDFS command to be used
with the Linux shell. For example, you can easily use grep with HDFS by doing the following:
>
>

hadoop fs -mkdir /user/root/test2


hadoop fs -ls /user/root | grep test

As you can see the grep command only returned the lines which had test in them (thus
removing the "Found x items" line and oozie-root directory from the listing.

4.

In order to move files between your regular linux filesystem and HDFS you will likely use the put
and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to /hadoop
> hadoop fs -put /hadoop/pg20417.txt pg20417.txt
> hadoop fs -ls /user/root
You should now see a new file called /user/root/README listed. In order to view the contents of
this file we will use the -cat command as follows:
>

hadoop fs -cat pg20417.txt

You should see the output of the README file (that is stored in HDFS). We can also use the linux
diff command to see if the file we put on HDFS is actually the same as the original on the local
filesystem. You can do this as follows:
>

diff <( hadoop fs -cat pg20417.txt) /hadoop/pg20417.txt

Since the diff command produces no output we know that the files are the same (the diff
command prints all the lines in the files that differ).

Some more Hadoop Filesystem commands


1.

In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In
the Linux shell this is generally done with the "-R" argument)

For example, to do a recursive

listing we'll use the -lsr command rather than just -ls. Try this:
>

hadoop fs -ls /user

>

2.

hadoop fs -lsr /user

In order to find the size of files you need to use the -du or -dus commands. Keep in mind that
these commands return the file size in bytes. To find the size of the pg20417.txt file use the
following command:
>

hadoop fs -du pg20417.txt

To find the size of all files individually in the /user/root directory use the following command:
>

hadoop fs -du /user/root

To find the size of all files in total of the /user/root directory use the following command:
>

hadoop fs -dus /user/root

3.

If you would like to get more information about a given command, invoke -help as follows:
>

hadoop fs -help

For example, to get help on the dus command you'd do the following:
>

hadoop fs -help dus

------ This is the end of this lab -----System -> Network Config -> DNS
Hostname - Activate
/etc/sysconfig/network

HDFS Admin Command

bin/hadoop dfsadmin report


This is a dfsadmin command for reporting on each DataNode. It display the status of Hadoop cluster.

www.hpottech.com

HDFS Admin Command

bin/hadoop dfsadmin -metasave hadoop.txt


This will save some of NameNodes metadata into its log directory under filename.
In this metadata, youll find lists of blocks waiting for replication, blocks being replicated, and blocks awaiting
deletion. For replication each block will also have a list of DataNodes being replicated to. Finally, the metasave file
will also have summary statistics on each DataNode.

Go to log folder:
# cd /hadoop/hadoop/logs
# ls

# vi hadoop.txt

www.hpottech.com

HDFS Admin Command

hadoop dfsadmin -safemode get

hadoop dfsadmin -safemode enter

hadoop dfsadmin -safemode leave

www.hpottech.com

HDFS Admin Command

hadoop fsck /

hadoop version

www.hpottech.com

HDFS Admin Command

www.hpottech.com

Fair Scheduler

Installation
To run the fair scheduler in your Hadoop installation, you need to put it on the CLASSPATH.
copy the hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar from
/hadoop/hadoop-2.0.0-mr1-cdh4.1.0/contrib/fairscheduler to HADOOP_HOME/lib. Using the
following command.
cp hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar $HADOOP_HOME/lib

Edit HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:


<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>

Once you restart the cluster, you can check that the fair scheduler is running by going to
http://<jobtracker URL>/scheduler on the JobTracker's web UI.
A "job scheduler administration" page should be visible there.
Hpot-Tech.com

Fair Scheduler

http://192.168.1.5:50030/scheduler
Run the map reduce job from two telnet window and observer the link display.
Create two different folders for input and output.
Run the jobs with different input and output folder.

Hpot-Tech.com

Fair Scheduler

Hpot-Tech.com

Fair Scheduler

Hpot-Tech.com

HDFC Commands and Web Interface

1) Verifying File System Health


a. bin/hadoop fsck /

2) HDFS Web Interface : features

hPotTech | hadoop

HDFC Commands and Web Interface

Hadoop Web Interfaces


Hadoop comes with several web interfaces which are by default (see conf/core-site.xml) available at these
locations:


http://192.168.80.133:50030/ web UI for MapReduce job tracker(s)

http://192.168.80.133:50060/ web UI for task tracker(s)

http:// 92.168.80.133:50070/ web UI for HDFS name node(s)

These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.

hPotTech | hadoop

HDFC Commands and Web Interface

MapReduce Job Tracker Web Interface


The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the local machines Hadoop log files (the machine on which the web
UI is running on).
By default, its available at http://localhost:50030/.
http://localhost:50030/

A screenshot of Hadoop's Job Tracker web interface.


hPotTech | hadoop

HDFC Commands and Web Interface

Task Tracker Web Interface


The task tracker web UI shows you running and non-running
non
tasks. It also gives access to the local
ocal machines Hadoop
log files.
By default, its available at http://localhost:50060/.
http://localhost:50060/

A screenshot of Hadoop's Task Tracker web interface.

hPotTech | hadoop

HDFC Commands and Web Interface

HDFS Name Node Web Interface


The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It
also gives access to the local machines Hadoop log files.
By default, its available at http://localhost:50070/.

hPotTech | hadoop

HDFC Commands and Web Interface

A screenshot of Hadoop's Name Node web interface.

hPotTech | hadoop

Managing a Hadoop Cluster - PigTutorial

Start the Hadoop VM Master


Base Directory:- /hadoop
Java Installation
1. Java 1.6.x (from Sun) is installed on /usr/jre16.
2. Verify The JAVA_HOME environment variable as below else set it.

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

Pig Installation
# tar xzf pig-0.10.0.tar.gz

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

#Set the PIG_HOME environment variable (vi $HOME/.bashrc)

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

#export PIG_HOME=/hadoop/pig-0.10.0
#export PATH=$PATH:$PIG_HOME/bin

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

Start the hadoop cluster.

#bash

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

$ cd /hadoop/pig-0.10.0/
Set value conf/pig.properties: exectype=local

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

$ bin/pig -x local
Enter the following command in the Grunt shell;
log = LOAD '/hadoop/pig-0.10.0/tutorial/data/excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO output;

www.hpottech.com

Managing a Hadoop Cluster - PigTutorial

# quit
file:///hadoop/pig-0.10.0/tutorial/data/output

Results:

www.hpottech.com

Hbase - Tutorial

Install:- Hbase 1) Get the hbase software from /software.


2) Untar the software in /hadoop

$ tar xfz hbase-0.92.1.tar.gz C /hadoop


$ cd /hadoop/hbase-0.92.1

3)

edit conf/hbase-site.xml and set the directory for HBase to write to

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///hadoop/hbase</value>
</property>
</configuration>

By default, hbase.rootdir is set to /tmp/hbase-${user.name}

Hadoop | HPot-Tech

Hbase - Tutorial

Edit conf/hbase-env.sh, uncommenting the JAVA_HOME line pointing it to your java install.

Installation complete here.

Hadoop | HPot-Tech

Hbase - Tutorial

Start hadoop
Start HBase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out

Shell Exercises
Connect to your running HBase via the shell.
$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010

hbase(main):001:0>

Hadoop | HPot-Tech

Hbase - Tutorial

Create a table named test with a single column family named cf. Verify its creation by listing all tables and then insert
some values.
hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds

Hadoop | HPot-Tech

Hbase - Tutorial

Above we inserted 3 values, one at a time. The first insert is at row1, column cf:a with a value of value1. Columns in
HBase are comprised of a column family prefix -- cfin this example -- followed by a colon and then a column qualifier
suffix (a in this case).

Hadoop | HPot-Tech

Hbase - Tutorial

Verify the data insert.


Run a scan of the table by doing the following
hbase(main):007:0> scan 'test'
ROW

COLUMN+CELL

row1

column=cf:a, timestamp=1288380727188, value=value1

row2

column=cf:b, timestamp=1288380738440, value=value2

row3

column=cf:c, timestamp=1288380747365, value=value3

3 row(s) in 0.0590 seconds

Hadoop | HPot-Tech

Hbase - Tutorial

Get a single row as follows


hbase(main):008:0> get 'test', 'row1'
COLUMN

CELL

cf:a

timestamp=1288380727188, value=value1

1 row(s) in 0.0400 seconds

Hadoop | HPot-Tech

Hbase - Tutorial

Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test'
0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test'
0 row(s) in 0.0770 seconds

Exit the shell by typing exit.


hbase(main):014:0> exit

Stopping HBase
Stop your hbase instance by running the stop script.
$ ./bin/stop-hbase.sh
stopping hbase...............

Hadoop | HPot-Tech

Hive - Tutorial

Installing Hive
$ tar -xzvf hive-0.9.0.tar.gz C /hadoop

www.hpottech.com

Page 1

Hive - Tutorial

Set the environment variable HIVE_HOME to point to the installation directory at $HOME/.bashrc:
$ export HIVE_HOME=/hadoop/hive-0.9.0
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH

www.hpottech.com

Page 2

Hive - Tutorial

Running Hive
Hive uses hadoop that means:

you must have hadoop in your path OR


export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse


(aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.
Start hadoop service.
Commands to perform this setup
$
$
$
$

$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop

www.hpottech.com

fs
fs
fs
fs

-mkdir
-mkdir
-chmod g+w
-chmod g+w

/tmp
/user/hive/warehouse
/tmp
/user/hive/warehouse

Page 3

Hive - Tutorial

Errors log:
/tmp/<user.name>/hive.log

Type hive
#bash
#hive

Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

www.hpottech.com

Page 4

Hive - Tutorial

hive> SHOW TABLES;

hive> DESCRIBE invites;

hive> LOAD DATA LOCAL INPATH


'/hadoop/hive-0.9.0/examples/files/kv2.txt'
15');

www.hpottech.com

OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-

Page 5

Hive - Tutorial

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

www.hpottech.com

Page 6

Hive - Tutorial

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

(Verify the file in the hdfs file system)

www.hpottech.com

Page 7

Hive - Tutorial

#hadoop fs -cat /tmp/hdfs_out/000000_0

Dropping tables:
hive> DROP TABLE invites;

# quit;

www.hpottech.com

Page 8

Hive - Bucket

Start Hadoop cluster start-all.sh


Start hive.
copy extend1502.log to /hadoop
Create table from the Hive console.
hive > CREATE TABLE weblog (mdate STRING , mtime STRING, ssitename STRING, scomputername STRING, sip STRING, csmethod STRING,
csuristem STRING, suriquery STRING, sport STRING, csusername STRING, cip STRING, csversion STRING, csUserAgent STRING, csCookie STRING,
csReferer STRING, cshost STRING, scstatus STRING, scsubstatus STRING, scwin32status STRING, scbytes STRING, csbytes STRING, timetaken
STRING)
hive > PARTITIONED BY (dt STRING) CLUSTERED BY (scomputername) INTO 96 BUCKETS;

Hpot-Tech

Hive - Bucket

hive> SET hive.enforce.bucketing = true;


hive > LOAD DATA LOCAL INPATH '/hadoop/extend1502.log' OVERWRITE INTO TABLE weblog PARTITION (dt='2008-08-15');

Hpot-Tech

Hive - Bucket

Hpot-Tech

Input Types

Create one Java project :- InputTypes and create the following classes.

hPot-Tech

Input Types

package com.hp.types;
// == JobBuilder
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>");

hPot-Tech

Input Types

return null;
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}

hPot-Tech

Input Types

if (overwrite) {
output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}

hPot-Tech

Input Types

package com.hp.types;
// cc SmallFilesToSequenceFileConverter A MapReduce program for packaging a collection of small files as a single SequenceFile
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

//vv SmallFilesToSequenceFileConverter
public class SmallFilesToSequenceFileConverter extends Configured
implements Tool {
static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString());
}
@Override
protected void map(NullWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
context.write(filenameKey, value);

hPot-Tech

Input Types

}
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}
job.setInputFormatClass(WholeFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setMapperClass(SequenceFileMapper.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
System.exit(exitCode);
}
}
// ^^ SmallFilesToSequenceFileConverter

hPot-Tech

Input Types

package com.hp.types;
// cc WholeFileInputFormat An InputFormat for reading a whole file as a record
import java.io.IOException;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.*;

//vv WholeFileInputFormat
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}
//^^ WholeFileInputFormat

hPot-Tech

Input Types

package com.hp.types;
// cc WholeFileRecordReader The RecordReader used by WholeFileInputFormat for reading a whole file as a record
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
//vv WholeFileRecordReader
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);

hPot-Tech

Input Types

IOUtils.readFully(in, contents, 0, contents.length);


value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
//^^ WholeFileRecordReader

hPot-Tech

10

Input Types

Copy the following files:

Run the application and output should be as follows:

hPot-Tech

11

Input Types

Submit the jobs to cluster


Un common the path initialization as follow:
/* args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();*/

Export the jar and run the following command


#hadoop fs -mkdir smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/a smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/b smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/c smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/d smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/e smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/f smallfiles/

# hadoop jar /hadoop/hadoop/mytypes.jar com.hp.types.SmallFilesToSequenceFileConverter -D mapred.reduce.tasks=2 smallfiles outputit

hPot-Tech

12

Input Types

Verify the result as follows: Using HDFS View and Job Tracker

hPot-Tech

13

Input Types

hPot-Tech

Hadoop in cluster mode

From two single-node clusters to a multi-node cluster We will build a multi-node cluster using two Red hat boxes in
this tutorial. The best way to do this for starters is to install, configure and test a local Hadoop setup for each of the two
RH boxes, and in a second step to merge these two single-node clusters into one multi-node cluster in which one RH
box will become the designated master (but also act as a slave with regard to data storage and processing), and the other
box will become only a slave. Its much easier to track down any problems you might encounter due to the reduced
complexity of doing a single-node cluster setup first on each machine.

Tutorial approach and structure.

ww.hpottech.com

Hadoop in cluster mode

Prerequisites
Configuring single-node clusters first in both the VM
Use the earlier tutorial.

Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one RH
box the master (which will also act as a slave) and the other RH box a slave.
We will call the designated master machine just the master from now on and the slave-only machine the

slave. We will also give the two machines these respective hostnames in their networking setup, most notably
in /etc/hosts. If the hostnames of your machines are different (e.g. node01) then you must adapt the
settings in this tutorial as appropriate.
Shutdown each single-node cluster with /bin/stop-all.sh before continuing if you havent done so already.

ww.hpottech.com

Hadoop in cluster mode

Copy the earlier VM and rename as HadoopSlave

ww.hpottech.com

Hadoop in cluster mode

Generate Mac for the new VM as follows:

ww.hpottech.com

Hadoop in cluster mode

Start both the VM


Change the New VM host name:
System -> Administration -> Network -> DNS hadoopslave

Update /etc/hosts

ww.hpottech.com

Hadoop in cluster mode

Verify the IP as belows:

ww.hpottech.com

Hadoop in cluster mode

Networking
Both machines must be able to reach each other over the network.
Update /etc/hosts on both machines with the following lines:
# vi /etc/hosts (for master AND slave)
10.72.47.42

master

10.72.47.27

slave

ww.hpottech.com

Hadoop in cluster mode

SSH access
The root user on the master must be able to connect
a) to its own user account on the master i.e. ssh

master in this context and not necessarily ssh

localhost and
b) to the root user account on the slave via a password-less SSH login.
You have to add the root@masters public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this users$HOME/.ssh/authorized_keys).
You can do this manually or use the following SSH command:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave

This command will prompt you for the login password for user root on slave, then copy the public SSH key for you,
creating the correct directory and fixing the permissions as necessary.

ww.hpottech.com

Hadoop in cluster mode

The final step is to test the SSH setup by connecting with user root from the master to the user account

root on
the slave. The step is also needed to save slaves host key fingerprint to the root@mastersknown_hosts file.
So, connecting from master to master
$ ssh master

ww.hpottech.com

10

Hadoop in cluster mode

And from master to slave.


root@master:~$ ssh slave

ww.hpottech.com

11

Hadoop in cluster mode

Hadoop
Cluster Overview

How the final multi-node cluster will look like.

ww.hpottech.com

12

Hadoop in cluster mode

The master node will run the master daemons for each layer: NameNode for the HDFS storage layer, and JobTracker
for the MapReduce processing layer. Both machines will run the slave daemons: DataNode for the HDFS layer, and
TaskTracker for MapReduce processing layer. Basically, the master daemons are responsible for coordination and
management of the slave daemons while the latter will do the actual data storage and data processing work.

Masters vs. Slaves


Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively.
These are the actual master nodes. The rest of the machines in the cluster act as both DataNode and TaskTracker.
These are the slaves or worker nodes.

Configuration
conf/masters (master only)
On master, update /conf/masters that it looks like this:
master

ww.hpottech.com

13

Hadoop in cluster mode

conf/slaves (master only)


On master, update conf/slaves that it looks like this:
master
slave

ww.hpottech.com

14

Hadoop in cluster mode

conf/*-site.xml (all machines)


Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in hadoop-site.xmlwere moved
to conf/core-site.xml (fs.default.name),

conf/mapred-site.xml(mapred.job.tracker)

and conf/hdfs-site.xml (dfs.replication).


Assuming you configured each machine as described in the single-node cluster tutorial, you will only have to change a
few variables.
Important: You have to change the configuration files conf/core-site.xml, conf/mapred-

site.xmland conf/hdfs-site.xml on ALL machines as follows.


First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies
the NameNode(the HDFS master) host and port. In our case, this is the master machine.

ww.hpottech.com

15

Hadoop in cluster mode

<!-- In: conf/core-site.xml -->


<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system </description>
</property>

ww.hpottech.com

16

Hadoop in cluster mode

Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies
theJobTracker (MapReduce master) host and port. Again, this is the master in our case.
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs at </description>
</property>

ww.hpottech.com

17

Hadoop in cluster mode

<!-- In: conf/hdfs-site.xml -->


<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. </description>
</property>

ww.hpottech.com

18

Hadoop in cluster mode

Create the necessary folder structure in both the node


#mkdir p /hadoop/hdfs/name
#mkdir p /hadoop/hdfs/data

Formatting the HDFS filesystem via the NameNode


# bin/hadoop namenode -format

ww.hpottech.com

19

Hadoop in cluster mode

ww.hpottech.com

20

Hadoop in cluster mode

Background: The HDFS name table is stored on the NameNodes (here: master) local filesystem in the directory
specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information
for the DataNodes.

Starting the multi-node cluster


HDFS daemons
Run the command /bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will
bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the
machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
#bin/start-dfs.sh

ww.hpottech.com

21

Hadoop in cluster mode

On slave, you can examine the success or failure of this command by inspecting the log file logs/. Exemplary
output:

ww.hpottech.com

22

Hadoop in cluster mode

As you can see in slaves output above, it will automatically format its storage directory (specified
bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master
# jps

ww.hpottech.com

23

Hadoop in cluster mode

and the following on slave.

ww.hpottech.com

24

Hadoop in cluster mode

MapReduce daemons
Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up
the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
$ bin/start-mapred.sh

ww.hpottech.com

25

Hadoop in cluster mode

On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-

root-tasktracker-hadoopslave.log. Exemplary output:

ww.hpottech.com

26

Hadoop in cluster mode

At this point, the following Java processes should run on master


$ jps

(the process IDs dont matter of course)

ww.hpottech.com

27

Hadoop in cluster mode

and the following on slave.


$ jps

ww.hpottech.com

28

Hadoop in cluster mode

Stopping the multi-node cluster


Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin
with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped
on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the NameNode daemon is
stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).

MapReduce daemons
Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce
cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
$ bin/stop-mapred.sh

ww.hpottech.com

29

Hadoop in cluster mode

(Note: The output above might suggest that the JobTracker was running and stopped on slave, but you can be assured
that the JobTracker ran on master.)
At this point, the following Java processes should run on master
$ jps

ww.hpottech.com

30

Hadoop in cluster mode

and the following on slave.


$ jps

ww.hpottech.com

31

Hadoop in cluster mode

HDFS daemons
Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the
NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in
the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
$ bin/stop-dfs.sh

ww.hpottech.com

32

Hadoop in cluster mode

At this point, the only following Java processes should run on master
$ jps

and the following on slave.


$ jps

ww.hpottech.com

33

Hadoop in cluster mode

Running a MapReduce job


Just follow the steps described in the section Running a MapReduce job of the single-node cluster tutorial.
Heres the exemplary output on master
Copy the data before running the following.
$ bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /user/root/in /user/root/out

ww.hpottech.com

34

Hadoop in cluster mode

and on slave for its datanode


# from logs/ hadoop-root-datanode-hadoopslave.log on slave

ww.hpottech.com

35

Hadoop in cluster mode

and on slave for its tasktracker.


# from logs/ hadoop-root-tasktracker-hadoopslave.log on slave

If you want to inspect the jobs output data, just retrieve the job result from HDFS to your local filesystem.

ww.hpottech.com

Joins

Create a java project JoinMap and create the following classes:

Hpot-Tech

Joins

package com.hp.join;
// == JobBuilder
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>");
return null;

Hpot-Tech

Joins

}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}
if (overwrite) {

Hpot-Tech

Joins

output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}

Hpot-Tech

Joins

package com.hp.join;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class JoinRecordMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
throws IOException {
parser.parse(value);
output.collect(new TextPair(parser.getStationId(), "1"), value);
}
}

Hpot-Tech

Joins

package com.hp.join;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.MultipleInputs;
import org.apache.hadoop.util.*;
@SuppressWarnings("deprecation")
public class JoinRecordWithStationName extends Configured implements Tool {
public static class KeyPartitioner implements Partitioner<TextPair, Text> {
@Override
public void configure(JobConf job) {}
@Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}

@Override
public int run(String[] args) throws Exception {
if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;
}
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Join record with station name");
Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(conf, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(conf, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);

Hpot-Tech

Joins

FileOutputFormat.setOutputPath(conf, outputPath);
conf.setPartitionerClass(KeyPartitioner.class);
conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class);
conf.setMapOutputKeyClass(TextPair.class);
conf.setReducerClass(JoinReducer.class);
conf.setOutputKeyClass(Text.class);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
args = new String[3];
args[0] = "inputncdc";
args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
System.exit(exitCode);
}
}

Hpot-Tech

Joins

package com.hp.join;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class JoinReducer extends MapReduceBase implements
Reducer<TextPair, Text, Text, Text> {
public void reduce(TextPair key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text stationName = new Text(values.next());
while (values.hasNext()) {
Text record = values.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
output.collect(key.getFirst(), outValue);
}
}
}

Hpot-Tech

Joins

package com.hp.join;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class JoinStationMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
throws IOException {
if (parser.parse(value)) {
output.collect(new TextPair(parser.getStationId(), "0"),
new Text(parser.getStationName()));
}
}
}

Hpot-Tech

10

Joins

package com.hp.join;
import java.math.*;
import org.apache.hadoop.io.Text;
public class MetOfficeRecordParser {
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureValid;
public void parse(String record) {
if (record.length() < 18) {
return;
}
year = record.substring(3, 7);
if (isValidRecord(year)) {
airTemperatureString = record.substring(13, 18);
if (!airTemperatureString.trim().equals("---")) {
BigDecimal temp = new BigDecimal(airTemperatureString.trim());
temp = temp.multiply(new BigDecimal(BigInteger.TEN));
airTemperature = temp.intValueExact();
airTemperatureValid = true;
}
}
}
private boolean isValidRecord(String year) {
try {
Integer.parseInt(year);
return true;
} catch (NumberFormatException e) {
return false;
}
}
public void parse(Text record) {
parse(record.toString());
}

Hpot-Tech

11

Joins

public String getYear() {


return year;
}
public int getAirTemperature() {
return airTemperature;
}
public String getAirTemperatureString() {
return airTemperatureString;
}
public boolean isValidTemperature() {
return airTemperatureValid;
}
}

Hpot-Tech

12

Joins

package com.hp.join;
import java.text.*;
import java.util.Date;
import org.apache.hadoop.io.Text;
public class NcdcRecordParser {
private static final int MISSING_TEMPERATURE = 9999;
private static final DateFormat DATE_FORMAT =
new SimpleDateFormat("yyyyMMddHHmm");
private String stationId;
private String observationDateString;
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureMalformed;
private String quality;
public void parse(String record) {
stationId = record.substring(4, 10) + "-" + record.substring(10, 15);
observationDateString = record.substring(15, 27);
year = record.substring(15, 19);
airTemperatureMalformed = false;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
airTemperature = Integer.parseInt(airTemperatureString);
} else if (record.charAt(87) == '-') {
airTemperatureString = record.substring(87, 92);
airTemperature = Integer.parseInt(airTemperatureString);
} else {
airTemperatureMalformed = true;
}
airTemperature = Integer.parseInt(airTemperatureString);
quality = record.substring(92, 93);
}

Hpot-Tech

13

Joins

public void parse(Text record) {


parse(record.toString());
}
public boolean isValidTemperature() {
return !airTemperatureMalformed && airTemperature != MISSING_TEMPERATURE
&& quality.matches("[01459]");
}
public boolean isMalformedTemperature() {
return airTemperatureMalformed;
}
public boolean isMissingTemperature() {
return airTemperature == MISSING_TEMPERATURE;
}
public String getStationId() {
return stationId;
}
public Date getObservationDate() {
try {
System.out.println(observationDateString);
return DATE_FORMAT.parse(observationDateString);
} catch (ParseException e) {
throw new IllegalArgumentException(e);
}
}
public String getYear() {
return year;
}
public int getYearInt() {
return Integer.parseInt(year);
}
public int getAirTemperature() {
return airTemperature;

Hpot-Tech

14

Joins

}
public String getAirTemperatureString() {
return airTemperatureString;
}
public String getQuality() {
return quality;
}
}

Hpot-Tech

15

Joins

package com.hp.join;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.IOUtils;
public class NcdcStationMetadata {
private Map<String, String> stationIdToName = new HashMap<String, String>();
public void initialize(File file) throws IOException {
BufferedReader in = null;
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
String line;
while ((line = in.readLine()) != null) {
if (parser.parse(line)) {
stationIdToName.put(parser.getStationId(), parser.getStationName());
}
}
} finally {
IOUtils.closeStream(in);
}
}
public String getStationName(String stationId) {
String stationName = stationIdToName.get(stationId);
if (stationName == null || stationName.trim().length() == 0) {
return stationId; // no match: fall back to ID
}
return stationName;
}
public Map<String, String> getStationIdToNameMap() {
return Collections.unmodifiableMap(stationIdToName);
}
}

Hpot-Tech

16

Joins

package com.hp.join;
import org.apache.hadoop.io.Text;
public class NcdcStationMetadataParser {
private String stationId;
private String stationName;
public boolean parse(String record) {
if (record.length() < 42) { // header
return false;
}
String usaf = record.substring(0, 6);
String wban = record.substring(7, 12);
stationId = usaf + "-" + wban;
stationName = record.substring(13, 42);
try {
Integer.parseInt(usaf); // USAF identifiers are numeric
return true;
} catch (NumberFormatException e) {
return false;
}
}
public boolean parse(Text record) {
return parse(record.toString());
}
public String getStationId() {
return stationId;
}
public String getStationName() {
return stationName;
}
}

Hpot-Tech

17

Joins

package com.hp.join;
// cc TextPair A Writable implementation that stores a pair of Text objects
// cc TextPairComparator A RawComparator for comparing TextPair byte representations
// cc TextPairFirstComparator A custom RawComparator for comparing the first field of TextPair byte representations
// vv TextPair
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
@Override

Hpot-Tech

18

Joins

public void write(DataOutput out) throws IOException {


first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public String toString() {
return first + "\t" + second;
}
@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
// ^^ TextPair

Hpot-Tech

19

Joins

// vv TextPairComparator
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
}
static {
WritableComparator.define(TextPair.class, new Comparator());
}
// ^^ TextPairComparator
// vv TextPairFirstComparator
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);

Hpot-Tech

20

Joins

}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
// ^^ TextPairFirstComparator
// vv TextPair
}
// ^^ TextPair

Hpot-Tech

21

Joins

Create the following folder and copy the file:

Hpot-Tech

22

Joins

Hpot-Tech

23

Joins

Run the application:

Hpot-Tech

24

Joins

Submit the jar in cluster:


Export the jar and submit as follows:
Create the necessary input folders:
Un common the path initialization as follow:
/*args = new String[3];
args[0] = "inputncdc";
args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();*/

#hadoop fs -mkdir incdc/


#hadoop fs -mkdir instation/
#hadoop fs -copyFromLocal /hadoop/data/sample.txt incdc/
#hadoop fs -copyFromLocal /hadoop/data/stations*.txt instation/
#hadoop jar /hadoop/hadoop/myhadoopjoin.jar com.hp.join.JoinRecordWithStationName incdc instation outputs

Hpot-Tech

25

Joins

You can view the data as follows:

Hpot-Tech

Running a MapReduce job


We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and
counts how often words occur.
The input is text files and the output is text files, each line of which contains a word and the count of how often it
occurred, separated by a tab.

copy input data


$ls -l /mnt/hgfs/Hadoopsw
total 3604
-rw-r--r-- 1 hduser hadoop

674566 Feb

3 10:17 pg20417.txt

-rw-r--r-- 1 hduser hadoop 1573112 Feb

3 10:18 pg4300.txt

-rw-r--r-- 1 hduser hadoop 1423801 Feb

3 10:18 pg5000.txt

Restart the Hadoop cluster


Restart your Hadoop cluster if its not running already.
# bin/start-all.sh

www.hpottech.com

Running a MapReduce job


Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to HadoopsHDFS.
#bin/hadoop fs mkdir /user/root
#bin/hadoop fs mkdir /user/root/in
#bin/hadoop dfs -copyFromLocal /mnt/hgfs/Hadoopsw/*.txt /user/root/in

Run the MapReduce job


Now, we actually run the WordCount example job.
#cd $HADOOP_HOME
#bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /user/root/in /user/root/out

This command will read all the files in the HDFS directory /user/root/in, process it, and store the result in the
HDFS directory /user/root/out.

www.hpottech.com

Running a MapReduce job

www.hpottech.com

Running a MapReduce job

www.hpottech.com

Running a MapReduce job

Check if the result is successfully stored in HDFS directory /user/root/out/:


#bin/hadoop dfs -ls /user/root

www.hpottech.com

Running a MapReduce job


$ bin/hadoop dfs -ls /user/root/out

www.hpottech.com

Running a MapReduce job


Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
# bin/hadoop dfs -cat /user/root/out/part-r-00000

www.hpottech.com

Running a MapReduce job


Copy the output to local file.
$ mkdir /tmp/hadoop-output
# bin/hadoop dfs -getmerge /user/root/out/ /tmp/hadoop-output

www.hpottech.com

Running a MapReduce job


Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at
these locations:


http://localhost:50030/ web UI for MapReduce job tracker(s)

http://localhost:50060/ web UI for task tracker(s)

http://localhost:50070/ web UI for HDFS name node(s)

These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.

MapReduce Job Tracker Web Interface


The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the local machines Hadoop log files (the machine on which the web
UI is running on).
By default, its available at http://localhost:50030/.

www.hpottech.com

10

Running a MapReduce job

A screenshot of Hadoop's Job Tracker web interface.

www.hpottech.com

11

Running a MapReduce job


Task Tracker Web Interface
The task tracker web UI shows you running and non-running
non running tasks. It also gives access to the local machines Hadoop
log files.
By default, its available at http://localhost:50060/.
http://localhost:50060/

A screenshot of Hadoop's Task Tracker web interface.

www.hpottech.com

12

Running a MapReduce job


HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It
also gives access to the local machines Hadoop log files.
By default, its available at http://localhost:50070/.

www.hpottech.com

13

Running a MapReduce job

A screenshot of Hadoop's Name Node web interface.

www.hpottech.com

MR Unit Testing

Start eclipse
Create New java project : MRUnitTest
Unzip mrunit jar in /hadoop/mrunit
Include mrunit jar in the project

Hpot-Tech

MR Unit Testing

Hpot-Tech

MR Unit Testing

package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class MaxTemperatureMapper
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
System.out.println("-----"+ year +"="+ airTemperature );
context.write(new Text(year), new IntWritable(airTemperature));
}
}

Hpot-Tech

MR Unit Testing

package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}

Hpot-Tech

MR Unit Testing

package com.hp.test;
// cc MaxTemperatureMapperTestV1 Unit test for MaxTemperatureMapper
// == MaxTemperatureMapperTestV1Missing
// vv MaxTemperatureMapperTestV1
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureMapper;
public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputKey(new LongWritable(123))
.withInputValue(value)
.withOutput(new Text("1950"), new IntWritable(-11))
.runTest();
}
// ^^ MaxTemperatureMapperTestV1
//@Ignore //
// vv MaxTemperatureMapperTestV1Missing
@Test
public void ignoresMissingTemperatureRecord() throws IOException,
InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9+99991+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputKey(new LongWritable(123))
.withInputValue(value)

Hpot-Tech

MR Unit Testing

.runTest();
}
// ^^ MaxTemperatureMapperTestV1Missing
@Test
public void processesMalformedTemperatureRecord() throws IOException,
InterruptedException {
Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +
// Year ^^^^
"RJSN V02011359003150070356999999433201957010100005+353");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputValue(value)
.withInputKey(new LongWritable(123))
.withOutput(new Text("1957"), new IntWritable(1957))
.runTest();
}
// vv MaxTemperatureMapperTestV1
}
// ^^ MaxTemperatureMapperTestV1

Hpot-Tech

MR Unit Testing

package com.hp.test;
// == MaxTemperatureReducerTestV1
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureReducer;
public class MaxTemperatureReducerTest {
//vv MaxTemperatureReducerTestV1
@Test
public void returnsMaximumIntegerInValues() throws IOException,
InterruptedException {
new ReduceDriver<Text, IntWritable, Text, IntWritable>()
.withReducer(new MaxTemperatureReducer())
.withInputKey(new Text("1950"))
.withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
.withOutput(new Text("1950"), new IntWritable(10))
.runTest();
}
//^^ MaxTemperatureReducerTestV1
}

Hpot-Tech

MR Unit Testing

Final updated project view:

Hpot-Tech

MR Unit Testing

Run as Junit

Hpot-Tech

Map reduce - Partitioner

Start eclipse.
Create on java project: Partioner and create the following java class:

Hpot-Tech

Map reduce - Partitioner

Add the User Libraries:

package com.hp.partitioner;
// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}

Hpot-Tech

Map reduce - Partitioner

String quality = line.substring(92, 93);


if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
// ^^ MaxTemperatureMapper

package com.hp.partitioner;
// cc MaxTemperatureReducer Reducer for maximum temperature example
// vv MaxTemperatureReducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
// ^^ MaxTemperatureReducer

Hpot-Tech

Map reduce - Partitioner

package com.hp.partitioner;
// cc MaxTemperatureWithCombiner Application to find the maximum temperature, using a combiner function for efficiency
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
// vv MaxTemperatureWithCombiner
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
/*if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}*/
Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path("in"));


FileOutputFormat.setOutputPath(job, new Path("out"+System.currentTimeMillis()));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
// ^^ MaxTemperatureWithCombiner

Hpot-Tech

Map reduce - Partitioner

Copy the following files in :

Run the application as java

Hpot-Tech

Map reduce - Partitioner

Out put.

Hpot-Tech

Map reduce - Partitioner

Hpot-Tech

Map reduce - Partitioner

Submit the application in cluster.


1) Comment the hardcoded path.
2) Copy the files to HDFS
3) And execute the class
#hadoop fs -copyFromLocal /hadoop/data/19*gz in/
# hadoop jar /hadoop/hadoop/mypartitioner.jar com.hp.partitioner.MaxTemperatureWithCombiner in output

View the jobtrackers:


http://192.168.92.128:50030/jobtracker.jsp
And output
http://192.168.92.128:50070/dfshealth.jsp

Hpot-Tech

Pig UDF

Start eclipse
Create java project. :- PigUDF
Include Hadoop Library in Java Build Path
Create and Include Pig User library (Available in Pig Installation folder)

Hpot-Tech

Pig UDF

package com.hp.hadoop.pig;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class IsGoodQuality extends FilterFunc {


@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}
}
//^^ IsGoodQuality
//vv IsGoodQualityTyped
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>();
funcSpecs.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));

Hpot-Tech

Pig UDF

return funcSpecs;
}
}

- export the project as jar


/hadoop/pig-0.10.0/mypigudf.jar
-copy the pigudf.txt to /hadoop/pig-0.10.0/ [ using cp command from shared folder]
Type pig and type as follows:
grunt> records = LOAD '/hadoop/pig-0.10.0/pigudf.txt' AS (year:chararray, temperature:int, quality:int);
grunt> REGISTER /hadoop/pig-0.10.0/mypigudf.jar;
grunt> filtered_records = FILTER records BY temperature != 9999 AND com.hp.hadoop.pig.IsGoodQuality(quality);
grunt> grouped_records = GROUP filtered_records BY year;
grunt>max_temp = FOREACH grouped_records GENERATE group,
grunt>MAX(filtered_records.temperature);
grunt>DUMP max_temp;

Hpot-Tech

Sqoop

Untar the sqoop in /hadoop folder

Declare the $SQOOP_HOME In the $HOME/.bashrc file


$ export SQOOP_HOME=/hadoop/sqoop
$ export HBASE_HOME=/hadoop/hbase-0.92.1
And include in the PATH variable.
$ export PATH=$SQOOP_HOME/bin:$PATH

hPotTech

Sqoop

Run Sqoop:
$sqoop

Installed mysql
Verify that mysql is already installed in the system

rpm -qa | grep -i mysql

hPotTech

Sqoop

#rpm -ivh MySQL-server-5.5.25-1.rhel5.i386.rpm


wait till it display the completion message.

hPotTech

Sqoop

#rpm -i MySQL-client-5.5.25-1.rhel5.i386.rpm
verify the installation: mysql V

Start my SQl:
Follows the steps below to stop and start MySQL

[local-host]# service mysql status


MySQL running (12588)

OK

OK

[local-host]# service mysql stop


Shutting down MySQL.
[local-host]# service mysql start
Starting MySQL.

hPotTech

Sqoop

type:
# mysql

mysql> CREATE DATABASE hadoopguide;


Query OK, 1 row affected (0.02 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> quit;
Bye

hPotTech

Sqoop

Create table and insert some values:

# mysql hadoopguide

mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> widget_name VARCHAR(64) NOT NULL,
-> price DECIMAL(10,2),
-> design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10', 1, 'Connects two gizmos');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4, NULL);
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',
-> 13, 'Our flagship product');
hPotTech

Sqoop

Query OK, 1 row affected (0.00 sec)


mysql> quit;

hPotTech

Sqoop

#cd /hadoop
extract the following jar
#tar -xvf mysql-connector-java-5.1.20.tar.gz C /hadoop
#cd /hadoop/mysql-connector-java-5.1.20
#cp mysql-connector-java-5.1.20-bin.jar $SQOOP_HOME/lib
It will store the necessary library to connect to Mysql from Sqoop
Start the hdfs.
Connect using Sqoop

hPotTech

Sqoop

Type the following in the console


#sqoop import --connect jdbc:mysql://localhost/hadoopguide --table widgets

hPotTech

10

Sqoop

Verify the data import


#bin/hadoop fs -cat widgets/part-m-00000

hPotTech

11

Sqoop

Additional Notes:
PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:

/usr/bin/mysqladmin -u root password 'new-password'


/usr/bin/mysqladmin -u root -h master password 'new-password'

Alternatively you can run:


/usr/bin/mysql_secure_installation

which will also give you the option of removing the test
databases and anonymous user created by default. This is
strongly recommended for production servers.

See the manual for more instructions.


Please report any problems with the /usr/bin/mysqlbug script!
bin/mysqld_safe --user=mysql &

hPotTech

Map Reducing

Goals: You will be able to write Map Reduce Program using Eclipse IDE.
IDE Set Up:
1) Untar the eclipse-jee-juno-linux-gtk.tar.gz

Copy the hadoop-eclipse-plugin-1.0.3.jar to eclipse plugin folder


Start eclipse
# cd /hadoop/eclipse

Hpot-Tech

Map Reducing

Hpot-Tech

Map Reducing

OK

Hpot-Tech

Map Reducing

Configure the hadoop environment in IDE.


Windows -> preferences -> Hadoop Installation Directory -> browse to your hadoop installation folder -> Apply -> OK

Hpot-Tech

Map Reducing

Create new Java project :- MaxTemperature


Create code as below:

Hpot-Tech

Map Reducing

package com.hp.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
/**
* @param args
*/
public static void main(String[] args) throws Exception{
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"+System.currentTimeMillis()));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Hpot-Tech

Map Reducing

package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class MaxTemperatureMapper
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}

Hpot-Tech

Map Reducing

package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}

Hpot-Tech

Map Reducing

Create one folder in:

Copy the following files value as follows:


1901 and 1902

Run the application as java application :- right click MaxTemperature.java

Hpot-Tech

10

Map Reducing

Output;

Hpot-Tech

11

Map Reducing

Hpot-Tech

12

Map Reducing

13/01/05 15:14:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where
applicable
13/01/05 15:14:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the
same.
13/01/05 15:14:25 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/01/05 15:14:25 INFO input.FileInputFormat: Total input paths to process : 2
13/01/05 15:14:26 INFO mapred.JobClient: Running job: job_local_0001
13/01/05 15:14:26 INFO util.ProcessTree: setsid exited with exit code 0
13/01/05 15:14:26 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ab444
13/01/05 15:14:26 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:27 INFO mapred.JobClient: map 0% reduce 0%
13/01/05 15:14:30 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:30 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:31 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:31 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:31 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/05 15:14:33 INFO mapred.LocalJobRunner:
13/01/05 15:14:33 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/05 15:14:33 INFO mapred.JobClient: map 100% reduce 0%
13/01/05 15:14:33 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ff8c74
13/01/05 15:14:33 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:33 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:33 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:33 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:33 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
13/01/05 15:14:37 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13c6641
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Merger: Merging 2 sorted segments
13/01/05 15:14:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 144423 bytes
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/01/05 15:14:37 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to out1357379065119
13/01/05 15:14:40 INFO mapred.LocalJobRunner: reduce > reduce
13/01/05 15:14:40 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/01/05 15:14:41 INFO mapred.JobClient: map 100% reduce 100%

Hpot-Tech

13

Map Reducing

13/01/05 15:14:41 INFO mapred.JobClient: Job complete: job_local_0001


13/01/05 15:14:41 INFO mapred.JobClient: Counters: 20
13/01/05 15:14:41 INFO mapred.JobClient: File Output Format Counters
13/01/05 15:14:41 INFO mapred.JobClient: Bytes Written=30
13/01/05 15:14:41 INFO mapred.JobClient: FileSystemCounters
13/01/05 15:14:41 INFO mapred.JobClient: FILE_BYTES_READ=4589102
13/01/05 15:14:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=456325
13/01/05 15:14:41 INFO mapred.JobClient: File Input Format Counters
13/01/05 15:14:41 INFO mapred.JobClient: Bytes Read=1777168
13/01/05 15:14:41 INFO mapred.JobClient: Map-Reduce Framework
13/01/05 15:14:41 INFO mapred.JobClient: Map output materialized bytes=144431
13/01/05 15:14:41 INFO mapred.JobClient: Map input records=13130
13/01/05 15:14:41 INFO mapred.JobClient: Reduce shuffle bytes=0
13/01/05 15:14:41 INFO mapred.JobClient: Spilled Records=26258
13/01/05 15:14:41 INFO mapred.JobClient: Map output bytes=118161
13/01/05 15:14:41 INFO mapred.JobClient: Total committed heap usage (bytes)=548327424
13/01/05 15:14:41 INFO mapred.JobClient: CPU time spent (ms)=0
13/01/05 15:14:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=220
13/01/05 15:14:41 INFO mapred.JobClient: Combine input records=0
13/01/05 15:14:41 INFO mapred.JobClient: Reduce input records=13129
13/01/05 15:14:41 INFO mapred.JobClient: Reduce input groups=2
13/01/05 15:14:41 INFO mapred.JobClient: Combine output records=0
13/01/05 15:14:41 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/01/05 15:14:41 INFO mapred.JobClient: Reduce output records=2
13/01/05 15:14:41 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/01/05 15:14:41 INFO mapred.JobClient: Map output records=13129

Hpot-Tech

14

Map Reducing

Create one class as follows:


package com.hp.hadoop;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MaxTemperatureDriver extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}

Hpot-Tech

15

Map Reducing

public static void main(String[] args) throws Exception {


int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}

Hpot-Tech

16

Map Reducing

Export the application as jar

Hpot-Tech

17

Map Reducing

Next Next Finish

You can see the MaxTemperature.jar as above.

Hpot-Tech

18

Map Reducing

Run the program in Hadoop cluster as follows:

Hpot-Tech

19

Map Reducing

Start cluster : start-all.sh


#hadoop fs -mkdir intemp

Hpot-Tech

20

Map Reducing

dump two files in /hadoop/hadoop


1901
1902
Copy the above files in HDFS:
#hadoop fs -copyFromLocal /hadoop/hadoop/19* intemp

Hpot-Tech

21

Map Reducing

Execute job:
#bin/hadoop jar /hadoop/hadoop/mymapreduce.jar com.hp.hadoop.MaxTemperatureDriver intemp outtemp

Hpot-Tech

22

Map Reducing

You can view the result from HDFS browser as below.

Hpot-Tech

Vous aimerez peut-être aussi