Vous êtes sur la page 1sur 47

Getting Started With

By Reeshu Patel

Getting Started with Hadoop 1

Hadoop, and large-scale distributed data processing in general,is rapidly becoming
an important skill set for many programmers. An effective programmer, today, must
have knowledge of relational databases, networking, and security, all of which were
considered optional skills a couple decades ago. Similarly, basic understanding of
distributed data processing will soon become an essential part of every
programmer? toolbox.
Leading universities, such as Stanford and CMU, have already started introducing
Hadoop into their computer science curriculum. This book will help you, the practi
-cing programmer, get up to speed on Hadoop quickly and start using it to process
your data sets .
We introduces Hadoop more formally, positioning it in terms of distributed systems
and data processing systems. It gives an overview of the MapReduce programming
model. A simple word counting example with existing tools highlights the
challenges around processing data at large scale.
You will implement that example using Hadoop to gain a deeper appreciation of
Hadoop simplicity. We will also discuss the history of Hadoop and some
perspectives on the MapReduce paradigm. But let me first briefly explain why I
wrote this book and why it useful to you.

Getting Started with Hadoop

Why hadoop
Speaking from experience, I first found Hadoop to be tantalizing in its possibilities,
yet frustrating to progress beyond coding the basic examples. The documentation at
the official Hadoop site is fairly comprehensive, but it is not always easy to find
straightfor ward answers to straightforward questions. The purpose of writing the
book is to address this problem. I wont focus on the Inittygritty details. Instead I
will provide the information that will allow you to quickly create useful code, along
with more advanced topics most often encountered in practice.

Comparing SQL databases and Hadoop

Given that Hadoop is a framework for processing data, what makes it better than
standard relational databases, the workhorse of data processing in most of today
Applications. One reason is that SQL (structured query language) is by design
targeted at structured data. Many of Hadoop initial applications deal with
unstructured data such as text. From this perspective Hadoop provides a more
general paradigm than SQL or working only with structured data, the comparison is
more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a
query language which can be implemented on top of Hadoop as the execution
engine. 3 But in practice, SQL databases tend to refer to a whole set of legacy
technologies, with several dominant vendors, optimized for a historical set of
applications. Many of these existing commercial databases are a mismatch to the
requirements that Hadoop targets.
With that in mind, lets make a more detailed comparison of Hadoop with typical
SQL databases on specific dimensions.

Getting Started with Hadoop 3

Before going through a formal treatment of MapReduce, let? go through an exercise
of scaling a simple program to process a large data set. You?l see the challenges of
scaling a data processing program and will better appreciate the benefits of using a
framework such as MapReduce to handle the tedious chores for you
Our exercise is to count the number of times each word occurs in a set of
documents. In this example, we have a set of documents having only one document
with only one sentence:
Do as I say, not as I do. We derive the word counts shown to the right. We?l call this
particular exercise word counting. When the set of documents is small, a
straightforward program will do the job
WordCountas 2do 2i2not 1say 1

What is Hadoop
Formally speaking, Hadoop is an open source framework for writing and running
distributed applications that process large amounts of data. Distributed computing
is a wide and varied field, but the key distinctions of Hadoop are that it is

Getting Started with Hadoop

Hadoop accessibility and simplicity give it an edge over writing and running large
distributed programs. Even college students can quickly and cheaply create their
own Hadoop cluster. On the other hand, its robustness and scalability make it
suitable for even the most demanding jobs at Yahoo and Facebook. These features
make Hadoop popular in both academia and industry.

Understanding distributed systems and

To understand the popularity of distributed systems (scale-out) vis-vis huge
monolithic servers (scale-up), consider the price performance of current I/O
technology. A high-end machine with four I/O channels each having a throughput
of 100 MB/sec will require three hours to read a 4 TB data set! With Hadoop, this
same data set will be divided into smaller (typically 64 MB) blocks that are spread
among many machines in the clu ster via the Hadoop Distributed File System
(HDFS).With a modest degree of replication, the cluster machines can read the data
set in parallel and provide a much higher throughput. And such a cluster of
commodity machines turns out to be cheaper than one high-end server.

Understanding MapReduce
You are probably aware of data processing models such as pipelines and message
queues. These models provide specific capabilities in developing different aspects of
data processing applications. The most familiar pipelines are the Unix pipes.
Pipelines can help the reuse of processing primitives; simple chaining of existing
modules creates new ones. Message queues can help the synchronization of
processing primitives.
Getting Started with Hadoop 5

The programmer writes her data processing task as processing primitives in the
form of either a producer or a consumer. The timing of their execution is managed
by the system.
Similarly, MapReduce is also a data processing model. Its greatest advantage is the
easy scaling of data processing over multiple computing nodes. Under the
MapReduce model, the data processing primitives are called mappers and
reducers.Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once you write an application in the MapReduce form,
scaling the application to run .
over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many
programmers to the MapReduce model.

Starting Hadoop
If you work in an environment where someone else sets up the Hadoop cluster for
you, you may want to skim through this chapter. You want to understand enough to
set up your personal development machine, but you can skip through the details of
configuring the communication and coordination of various nodes.
After discussing the physical components of Hadoop in section. we progress to
setting up your cluster in sections and Next Section will focus on the three
operational modes of Hadoop and how to set them up. You?l read about web-based
tools that assist monitoring your cluster in next section .

Getting Started with Hadoop

The building blocks of Hadoop

We are discussed the concepts of distributed storage and distributed computation in
the previous chapter. Now let see how Hadoop implements those ideas. On a fully
configured cluster, running Hadoop means running a set of daemons, or resident
programs, on the different servers in your network. These daemons have specific
roles; some exist only on one server, some exist across multiple servers. The
daemons include
3:-Secondary NameNode
We discuss each one and its role within Hadoop.

Let begin with arguably the most vital of the Hadoop daemons he NameNode.
Hadoop employs a master/slave architecture for both distributed storage and
distributed computation. The distributed storage system is called the Hadoop File
System, or HDFS. The NameNode is the master of HDFS that directs the slave
DataNode daemons to perform the low-level I/O tasks. The NameNode is the
bookkeeper of HDFS; it keeps track of how your files are broken down into file

Getting Started with Hadoop 7

blocks, which nodes store those blocks, and the overall health of the distributed file
The function of the NameNode is memory and I/O intensive. As such, the server
hosting the NameNode typically doesn store any user data or perform any
computations for a MapReduce program to lower the workload on the machine.
This means that the NameNode server doesn double as a DataNode or a
There is unfortunately a negative aspect to the importance of the NameNode it a
single point of failure of your Hadoop cluster. For any of the other daemons, if
their host nodes fail for software or hardware reasons, the Hadoop cluster will likely
continue to function smoothly or you can quickly restart it. Not so for the

Each slave machine in your cluster will host a DataNode daemon to perform the
grunt work of the distributed file system reading and writing HDFS blocks to actual
files on the local file system. When you want to read or write a HDFS file, the file is
broken into blocks and the NameNode will tell your client which DataNode each
block resides in. Your client communicates directly with the DataNode daemons to
process the local files corresponding to the blocks. Furthermore, a DataNode may
communicate with other DataNodes to replicate its data blocks for redundancy.

Getting Started with Hadoop

Figure 1.1 illustrates the roles of the NameNode and DataNodes. In this figure, we
show two data files, one at /user/chuck/data1 and another at /user/james/data2.
The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file
consists of blocks 4 and 5. The content of the files are distributed among the
DataNodes. In this illustration, each block has three replicas. For example, block 1
(used for data1) is replicated over the three rightmost DataNodes. This ensures that
if any one DataNode crashes or becomes inaccessible over the network, you will
still be able to read the files.

3:-Secondary NameNode
The Secondary NameNode (SNN) is an assistant daemon for monitoring the state
of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it
typically resides on its own machine as well. No other DataNode or TaskTracker
daemons run on the same server. The SNN differs from the NameNode in that this
process doesn receive or record any real-time changes to HDFS. Instead, it
communicates with the NameNode to take snapshots of the HDFS metadata at
intervals defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point of failure for a Hadoop
cluster, and the SNN snapshots help minimize the downtime and loss of data.
Nevertheless, a NameNode failure requires human intervention to reconfigure the

Getting Started with Hadoop 9

cluster to use the SNN as the primary NameNode. We will discuss the recovery
process in chapter 8 when we cover best practices for managing your cluster.

The JobTracker daemon is the liaison between your application and Hadoop. Once
you submit your code to your cluster, the JobTracker determines the execution plan
by determining which files to process, assigns nodes to different tasks, and monitors
all tasks as they are running. Should a task fail, the JobTracker will automatically
relaunch the task, possibly on a different node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop cluster. It typically run on a
server as a master node of the cluster.

As with the storage daemons, the computing daemons also follow a master/slave
architecture: the JobTracker is the master overseeing the overall execution of a
MapReduce job and the TaskTrackers manage the execution of individual tasks on
each slave node. Figure 2.2 illustrates this interaction.
Each TaskTracker is responsible for executing the individual tasks that the
JobTracker assigns. Although there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in
One responsibility of the TaskTracker is to constantly communicate with the
JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within


Getting Started with Hadoop

a specified amount of time, it will assume the TaskTracker has crashed and will
resubmit the corresponding tasks to other nodes in the cluster.
One responsibility of the TaskTracker is to constantly communicate with the
JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within
a specified amount of time, it will assume the TaskTracker has crashed and will
resubmit the corresponding tasks to other nodes in the cluster.

Figure 2.2 JobTracker and TaskTracker interaction. After a client calls the
JobTracker to begin a data processing job, the JobTracker partitions the work and
assigns different map and reduce tasks to each TaskTracker in the cluster.

Define a common account

We are been speaking in general terms of one node accessing another; more
precisely this access is from a user account on one node to another user account on
the target machine. For Hadoop, the accounts should have the same username on
all of the nodes (we use hadoop-user in this book), and for security purpose we
recommend it being a user-level account. This account is only for managing your

Getting Started with Hadoop 11

Hadoop cluster. Once the cluster daemons are up and running, you be able to run
your actual MapReduce jobs from other accounts.

Verify SSH installation

The first step is to check whether SSH is installed on your nodes. We can easily do
this by use of the "which" UNIX command:
[hadoop-user@master]$ which ssh
[hadoop-user@master]$ which sshd
[hadoop-user@master]$ which ssh-keygen
If you instead receive an error message such as this,
/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...
Install OpenSSH (www.openssh.com) via a Linux package manager or by
downloading the source directly. (Better yet, have your system administrator do it
for you.)

Generate SSH key pair


Getting Started with Hadoop

Having verified that SSH is correctly installed on all nodes of the cluster, we use ssh
keygen on the master node to generate an RSA key pair. Be certain to avoid
entering a passphrase, or you?l have to manually enter that phrase every time the
master node attempts to access another node.
[hadoop-user@master]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop-user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.
Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub.
After creating your key pair, your public key will be of the form
[hadoop-user@master]$ more /home/hadoop-user/.ssh/id_rsa.pub
Getting Started with Hadoop 13

IRQ== hadoop-user@master
and we next need to distribute this public key across your cluster.

Distribute public key and validate logins

Albeit a bit tedious, you will next need to copy the public key to every slave node as
well as the master node:
[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub :~/master_key
Manually log in to the target node and set the master key as an authorized key (or
append to the list of authorized keys if you have others defined).
After generating the key, you can verify it? correctly defined by attempting to log in
to the target node from the master
[hduser@master]$ ssh <target>
The authenticity of host 'target (xxx.xxx.xxx.xxx)' can? be established.
RSA key fingerprint is 72:31:d8:1b:11:36:43:52:56:11:77:a4:ec:82:03:1d.
Are you sure you want to continue connecting (yes/no)? yes

Getting Started with Hadoop

Warning: Permanently added 'target' (RSA) to the list of known hosts.

Last login: Sun Jan 4 15:32:22 2009 from master
After confirming the authenticity of a target node to the master node, you won? be
prompted upon subsequent login attempts.
[hduser@master]$ ssh target
Last login: Sun Jan 4 15:32:49 2009 from master
We are now set the groundwork for running Hadoop on your own cluster. Let
discuss the different Hadoop modes you might want to use for your projects.

Running Hadoop
We need to configure a few things before running Hadoop
The first thing you need to do is to specify the location of Java on all the nodes
includ ing the master. In hadoop-env.sh define the JAVA_HOME environment
variable to point to the Java installation directory. On our servers, we?e it defined as
export JAVA_HOME=/usr/share/jdk
If you followed the examples in chapter 1, you?e already completed this step.) The
hadoop-env.sh file contains other variables for defining your Hadoop environment,
but JAVA_HOME is the only one requiring initial modification. The default
settings on the other variables will probably work fine. As you become more
familiar with Hadoop you can later modify this file to suit your individual needs
(logging directory location, Java class path, and so on).
Getting Started with Hadoop 15

The majority of Hadoop settings are contained in XML configuration files. Before
version 0.20, these XML files are hadoop-default.xml and hadoop-site.xml. As the
names imply, hadoop-default.xml contains the default Hadoop settings to be used
unless they are explicitly overridden in hadoop-site.xml. In practice you only deal
with hadoop-site.xml. In version 0.20 this file has been separated out into three
XML files: core-site.xml, hdfs-site.xml, and mapred-site.xml. This refactoring better
aligns the configuration settings to the subsystem of Hadoop that they control. In
the rest of this here we will generally point out which of the three files used to
adjust a configuration setting. If you use an earlier version of Hadoop, keep in mind
that all such configuration settings are modified in hadoop-site.xml

Local (standalone) mode

The standalone mode is the default mode for Hadoop. When you first uncompress
the Hadoop source package, it ignorant of your hardware setup. Hadoop chooses to
be conservative and assumes a minimal configuration. All three XML files (or
hadoop site.xml before version 0.20) are empty under this default mode:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
With empty configuration files, Hadoop will run completely on the local machine.
Because there no need to communicate with other nodes, the standalone mode

Getting Started with Hadoop

doesn use HDFS, nor will it launch any of the Hadoop daemons. Its primary use is
for developing and debugging the application logic of a MapReduce program
without the additional complexity of interacting with the daemons. When you ran
the example MapReduce program in chapter 1, you were running it in standalone

Pseudo-distributed mode
The pseudo-distributed mode is running Hadoop in a cluster of onewith all
daemons running on a single machine. This mode complements the standalone
mode for debugging your code, allowing you to examine memory usage, HDFS
input/out put issues, and other daemon interactions. Listing 2.1 provides simple
XML files to configure a single server in this mode.
Listing 2.1
Example of the three configuration files for pseudo-distributed mode

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
Getting Started with Hadoop 17

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<description>The host and port that the MapReduce job tracker runs at.

Getting Started with Hadoop


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<description>The actual number of replications can be specified when the file is

Setting up Hadoop on your machine:

This recipe describes how to run Hadoop in the local mode.
Getting Started with Hadoop 19

First We will be Getting ready for Installation

Java On our PC:


You can use command for java installation

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:ferramroberto/java
# Update the source list
$ sudo apt-get update
# Install Sun Java 6 JDK
$ sudo apt-get install sun-java6-jdk
# Select Sun's Java as the default on your machine.
# See 'sudo update-alternatives --config java' for more information
$ sudo update-java-alternatives -s java-6-sun

After installation, make a quick check

wheather Sun JDK is correctly setup:
1:- user@ubuntu:~# java ?ersion
2:- java version "1.6.0_20
3:- Java(TM) SE Runtime Environment (build 1.6.0_20-b02)

Getting Started with Hadoop

4:- Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Now let us do the Hadoop installation:

1:-Download the most recent Hadoop 1.0 branch distribution from
2:-Unzip the Hadoop distribution using the following command. You will have to
change the x.x in the filename with the actual release you have downloaded. If you
are using Windows, you should use your favorite archive program such as WinZip
or WinRAR for extracting the distribution. From this point onward, we shall call the
Hadoop directory HADOOP_HOME.
>tar hadoop-1.0..03.tar.gz

For Adding Hadoop Using Following

1:- $ sudo addgroup hadoop
2:- $ sudo adduser --ingroup hadoop hduser
This will add the user hduser and the group hadoop to your local machine.

Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your
local machine if you want to use Hadoop on it (which is what we want to do in this
short tutorial). For our single-node setup of Hadoop, we therefore need to
configure SSH access to localhost for the hduser user we created in the previous

Getting Started with Hadoop 21

1:- user@ubuntu:~$ su - hduser

2:- hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
3:- Generating public/private rsa key pair.
4:- Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
5:- Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
The second line will create an RSA key pair with an empty password. Generally,
using an empty password is not recommended, but in this case it is needed to
unlock the key without your interaction (you don? want to enter the passphrase
every time Hadoop interacts with its nodes).
1:-hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub


The final step is to test the SSH setup by connecting to your local machine with the
hduser user.
1:- hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes

Getting Started with Hadoop

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010
i686 GNU/Linux
Ubuntu 10.04 LTS

Disabling Ipv6
One problem with IPv6 on Ubuntu is that using for the various networkingrelated Hadoop configuration options will result in Hadoop binding to the IPv6
addresses of my Ubuntu box. In my case, I realized that there? no practical point in
enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I
simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your
choice and add the following lines to the end of the file:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.You can
check whether IPv6 is enabled on your machine with the following command:
1:-$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that? what
we want).
You can also disable IPv6 only for Hadoop as documented in Hadoop-3437 You
can do so by adding the following line to conf/hadoop-env.sh:
Getting Started with Hadoop 23

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop Installation
Downlode Hadoop from Apache Downlode mirrors and extract the contents of
the Hadoop package to a location of your choice. I picked /usr/local/hadoop.
Make sure to change the owner of all the files to the hduser user and hadoop group,
for example:
$ cd /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz
$ sudo mv hadoop-1.0.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
(Just to give you the idea, YMMV personally, I create a symlink from hadoop-1.0.3
to hadoop.)

Now we Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If
you use a shell other than bash, you should of course update its appropriate
configuration files instead of .bashrc.
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-sun
unalias fs &> /dev/null
alias fs="hadoop fs"

Getting Started with Hadoop

unalias hls &> /dev/null

alias hls="fs -l

How it works
Hadoop local mode does not start any servers but does all the work within the same
JVM.When you submit a job to Hadoop in the local mode, that job starts a JVM to
run the job,and that JVM carries out the job. The output and the behavior of the
job is the same as a distributed Hadoop job, except for the fact that the job can only
use the current node forrunning tasks. In the next recipe, we will discover how to
run a MapReduce program using the unzipped Hadoop distribution.

HDFS Concepts
HDFS is the distributed filesystem that is available with Hadoop. MapReduce tasks
use HDFS to read and write data. HDFS deployment includes a single NameNode
and multiple DataNodes.For the HDFS setup, we need to configure NameNodes
and DataNodes, and then specify the DataNodes in the slaves file. When we start
the NameNode, startup script will start the DataNodes.

The Hadoop Distributed File System

Architecture and Design
The following picture gives an overview of the most important HDFS components.

Getting Started with Hadoop 25

Our goal in this tutorial is a single-node setup of Hadoop. More information of
what we do in this section is available on the hadoop.

The only required environment variable we have to configure for Hadoop in this
tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if
you used the installation path in this tutorial, the full path is
/usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment
variable to the Sun JDK/JRE 6 directory.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Getting Started with Hadoop

# The java implementation to use. Required.

export JAVA_HOME=/usr/lib/jvm/java-6-sun
Note: If you are on a Mac with OS X 10.7 you can use the following line to set up
JAVA_HOME in conf/hadoop-env.sh.
export JAVA_HOME=`/usr/libexec/java_home

Now we create the directory and set the

required ownerships and permissions
You can leave the settings belows is with the exception of the hadoop.tmp.dir
parameter this parameter you must change to a directory of your choice. We will use
the directory /app/hadoop/tmp in this tutorial. Hadoop? default configurations use
hadoop.tmp.dir as the base temporary directory both for the local file system and
HDFS, so don? be surprised if you see Hadoop creating the specified directory
automatically on HDFS at some later point.
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp
Note:-We will create directory form root.

In this section, we will configure the directory where Hadoop will store its data files,
the network ports it listens to, etc. Our setup will use Hadoop Distributed File
System, HDFS, even though our little clusteronly contains our single local machine.
Add the following snippets between the <configuration> ... </configuration> tags
in the respective configuration XML file.
Getting Started with Hadoop 27

<description>A base for other temporary directories.</description>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming the
FileSystem implementation class. The uri's authority is used to determine the host,
port, etc. for a filesystem.
In file conf/mapred-site.xml
<description>The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce task.
In file conf/hdfs-site.xml:
<description>Default block replication.
The actual number of replications can be specified when the file is created. he
default is used if replication is not specified in create time.

Getting Started with Hadoop


Formatting the HDFS filesystem via the

The first step to starting up your Hadoop installation is formatting the Hadoop
filesystem which is implemented on top of the local filesystem of your
?luster(which includes only your local machine if you followed this tutorial). You
need to do this the first time you set up a Hadoop cluster.
To format the filesystem (which simply initializes the directory specified by the
dfs.name.dir variable), run the command
1:-hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoophduser/dfs/name has been successfully formatted.
Getting Started with Hadoop 29

10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/

Starting your single-node cluster

Run the command:
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hdusernamenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoophduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduserjobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoophduser-tasktracker-ubuntu.out
A nifty tool for checking whether the expected Hadoop processes are running is jps
(part of Sun? Java since v1.5.0). See also how to deburg Map Reduce Program.
hduser@ubuntu:/usr/local/hadoop$ jps

Getting Started with Hadoop

2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode

Stopping your single-node cluster

Run the command
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
to stop all the daemons running on your machine.
Example output:
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Web-based cluster UI
Having covered the operational modes of Hadoop, we can now introduce the web
interfaces that Hadoop provides to monitor the health of your cluster. The browser
interface allows you to access information you desire much faster than digging
through logs and directories.
The NameNode hosts a general report on port 50070. It gives you an overview of
the state of your cluster .HDFS. Figure 2.4 displays this report for a 2-node cluster
example. From this interface, you can browse through the filesystem, check the
Getting Started with Hadoop 31

status of each DataNode in your cluster, and peruse the Hadoop daemon logs to
verify your cluster is functioning correctly.
Again, a wealth of information is available through this reporting interface. You can
access the status of ongoing MapReduce tasks as well as detailed reports about
completed jobs. The latter is of particular importance hese logs describe which
nodes performed which tasks and the time/resources required to complete each
task. Finally, the Hadoop configuration for each job is also available, as shown in
figure 2.6. With all of this information you can streamline your MapReduce
programs to better utilize the resources of your cluster.
http://localhost:50070 :-web UI of the NameNode daemon

Figure 2.4 A snapshot of the HDFS web interface. From this interface you can
browse through the HDFS filesystem, determine the storage available on each
individual node, and monitor the overall health of your cluster.


Getting Started with Hadoop

Figure 2.5 A snapshot of the MapReduce web interface. This tool allows you to
monitor active MapReduce jobs and access the logs of each map and reduce task.
The logs of previously submitted jobs are also available and are useful for
debugging your programs.

Hadoop modules
Apache Hadoop includes these modules
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data

Other Hadoop-related projects at Apache

1:-Avro: A data serialization system.
Getting Started with Hadoop 33

2:-Cassandra: A scalable multi-master database with no single points of failure.

3:-Chukwa: A data collection system for managing large distributed systems.
4:-HBase: A scalable, distributed database that supports structured data storage
for large tables.
5:-Hive: A data warehouse infrastructure that provides data summarization and
ad hoc querying.
6:-Mahout: A Scalable machine learning and data mining library.
7:-Pig: A high-level data-flow language and execution framework for parallel
8:-ZooKeeper: A high-performance coordination service for distributed

Getting Start MapReduce Programming with

Single Cluster
MapReduce is a programming model for data processing. The model is simple, yet
not too simple to express useful programs in. Hadoop can run MapReduce
programs writ- ten in various languages; we shall look at the same program
expressed in Java,Ruby, Python, and C++. Most important, MapReduce programs
are inherently parallel, thus putting very large-scale data analysis into the hands of
anyone with enough machines at their disposal. MapReduce comes into itsown for
large datasets, so let? start by looking at one.

Writing a WordCount MapReduce sample

How to write a simple MapReduce program and how to execute it.


Getting Started with Hadoop

To run a MapReduce job, users should furnish a map function, a reduce function,
input data, and an output data location. When executed, Hadoop carries out the
following steps:
1:- Hadoop breaks the input data into multiple data items by new lines and runs
the map function once for each data item, giving the item as the input for the
function. When executed, the map function outputs one or more key-value pairs.
2:- Hadoop collects all the key-value pairs generated from the map function, sorts
them by the key, and groups together the values with the same key.
3:- For each distinct key, Hadoop runs the reduce function once while passing the
key and list of values for that key as input.
4:- The reduce function may output one or more key-value pairs, and Hadoop
writes them to a file as the final result.

About combiner step to the WordCount

MapReduce program
After running the map function, if there are many key-value pairs with the same key,
Hadoop has to move all those values to the reduce function. This can incur a
significant overhead. To optimize such scenarios, Hadoop supports a special
function called combiner. If provided, Hadoop will call the combiner from the same
node as the map node before invoking the reducer and after running the mapper.
This can significantly reduce the amount of data transferred to the reduce step.

Getting Started with Hadoop 35

Developing a MapReduce Application with

The domain of business problems that Hadoop was designed to solve, and the
internal architecture of Hadoop that allows it to solve these problems. Applications
that run in Hadoop are called MapReduce applications, so this article
demonstrates how to build a simple MapReduce application.

Setting Up a Development Environment

We will use three ebooks from Project Gutenberg for this example:
The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci
Ulysses by James Joyce
Download each ebook as text files in Plain Text UTF-8 encoding and store the files
in a local temporary directory of choice, for example /tmp/gutenberg.
In this case we create Directory
$ sudo mkdir -p /app/hadoop/tmp/ gutenberg
$ sudo chown hduser:hadoop /app/hadoop/tmp/ gutenberg
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp/ gutenberg
Affter we will copy all file in the gutenberg directory
By cp commond affter we will check
hduser@ubuntu:~$ ls -l /tmp/gutenberg/

Getting Started with Hadoop

total 3604
-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
Download and decompress this file on your local machine. If you are planning on
doing quite a bit of Hadoop development, it might be in your best interest to add
the decompressed bin folder to your environment PATH. You can test your
installation by executing the hadoop command from the bin folder:
There are numerous commands that can be passed to Hadoop, but in this article we
be focusing on executing Hadoop applications in a development environment, so
the only one we will be interested in is the following:
hadoop jar <jar-file-name>

Phases in MapReduce Framework:

MapReduce works by breaking the processing into two phases :
1:-Map phase
2:-Reduce phase
Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer. The programme also specifies two functions: the map
function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text
input format that gives us each line in the dataset as a text value. The key is
the offset of the beginning of the line from the beginning of the file, but as
we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since the
se are the only fields we are interested in. In this case, the map function is just a data
Getting Started with Hadoop 37

preparation phase, setting up the data in such a way that the reducer function can
do its work on it: finding the maximum temperature for each year. The map
function is also a good place to drop bad records: here we filter out temperatures
that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input
data (some unused columns have been dropped to fit the page, indicated by
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated in
bold text), and emits them as its output (the temperature values have been
interpreted as
(1950, 0)
(1950, 22)
(1950, -11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the key38

Getting Started with Hadoop

value pairs by key. So, continuing the example, our reduce function sees the
following input:
(1949, [111, 78])
(1950, [0, 22, ?11])
Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year. The
whole data flow is illustrated in Figure 2-1. At the bottom of the diagram is a Unix
pipeline, which mimics the whole MapReduce flow, and which we will see again
later in the chapter when we look at Hadoop Streaming.
Figure -1. MapReduce logical data flow

MapReduce Input and Output Formats

The first program that you write in any programming language is typically a ?ello,
World application. In terms of Hadoop and MapReduce, the standard application
that everyone writes is the Word Count application. The Word Count application
counts the number of times each word in a large amount of text occurs.It is a
perfect example to learn about MapReduce because the mapping step and reducing
step are trivial, but introduce you to thinking in MapReduce. The following is a
summary of the components in the Word Count application and their function.
FileInputFormat: We define a FileInputFormat to read all of the files in a specified
directory (passed as the first argument to the MapReduce application) and pass
those to a TextInputFormat (see Listing 1) for distribution to our mappers.
TextInputFormat: The default InputFormat for Hadoop is the TextInputFormat,
which reads one line at a time and returns the key as the byte offset as the key
(LongWritable) and the line of text as the value (Text).
Getting Started with Hadoop 39

Word Count Mapper: This is a class that we write which tokenizes the single line of
text passed to it by the InputFormat into words and then emits the word itself with
a count of to note that we saw this word.

While we dont need a combiner in a development environment, the combiner is an
implementation of the reducer (described later in this article) that runs on the local
node before passing the key/value pair to the reducer. Using combiners can
dramatically improve performance, but you need to make sure that combining your
results does not break your reducer: In order for the reducer to be used as a
combiner, its operation must be associative, otherwise the maps sent to the reducer
will not result in the correct result.
Word Count Reducer: The word count reducer receives a map of every word and a
list of all the counts for the number of times that the word was observed by the
mappers. Without a combiner, the reducer would receive a word and a collection
of ?, but because we are going to use the reducer as a combiner, we will have a
collection of numbers that will need to be added together.
TextOutputFormat: In this example, we use the TextOutputFormat class and tell it
that the keys will be Text and the values will be IntWritable.
FileOutputFormat: The TextOutputFormat sends its formatted output to a
FileOutputFormat, which writes results to a self created outputdirectory.

Sample Applications
Download each ebook as text files in Plain Text UTF-8 encoding and store the files
in a local temporary directory of choice, for example /tmp/gutenberg.

Getting Started with Hadoop

We already have done

Restart the Hadoop cluster

Restart your Hadoop cluster if it not running already.
1:-hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
Before we run the actual MapReduce job, we must first copy the files from our local
file system to Hadoop HDFS.
1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-copyFromLocal
/tmp/gutenberg /user/hduser/gutenberg
2:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hduser supergroup
0 2010-05-08 17:40 /user/hduser/gutenberg
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38
-rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38
-rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38
Now that everything is prepared, we can finally run our MapReduce job on the
Hadoop cluster. As I said above, we leverage the Hadoop Streaming API for
helping us passing data between our Map and Reduce code via STDIN and

Getting Started with Hadoop 41

Now, we actually run the WordCount

example job
1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar
wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

Note: Some people run the command

above and get the following error message
Exception in thread "main" java.io.IOException: Error opening job jar:
at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
Caused by: java.util.zip.ZipException: error in opening zip file
In this case, re-run the command with the full name of the Hadoop Examples JAR
file, for example:
1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar
wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

Example output of the previous command in

the console:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar
wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%

Getting Started with Hadoop

10/05/08 17:43:28 INFO mapred.JobClient: Job complete:

10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
10/05/08 17:43:28 INFO mapred.JobClient: Job Counters
10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters
10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
10/05/08 17:43:28 INFO mapred.JobClient:
10/05/08 17:43:28 INFO mapred.JobClient:
10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286
10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934
10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290
10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874
10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
Check if the result is successfully stored in HDFS directory
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
Found 2 items
drwxr-xr-x - hduser supergroup
0 2010-05-08 17:40 /user/hduser/gutenberg
drwxr-xr-x - hduser supergroup
0 2010-05-08 17:43
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenbergoutput
Getting Started with Hadoop 43

Found 2 items
drwxr-xr-x - hduser supergroup
0 2010-05-08 17:43
-rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43
Retrieve the job result from HDFS: To inspect the file, you can copy it from HDFS
to the local file system. Alternatively, you Can use the command,
1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat
hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge
/user/hduser/gutenberg-output /tmp/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenbergoutput
"1490 1
"1498," 1
"35" 1
"40," 1
"A_ 1
"Alack! 1
You can see Screen Short here,


Getting Started with Hadoop

JobTracker Web Interface (MapReduce

The JobTracker web UI provides information about general job statistics of the
Hadoop cluster, running/completed/failed jobs and a job history log file. It also
gives access to the ?local machine?? Hadoop log files (the machine on which the
web UI is running on).
By default, it is available at http://localhost:50030/.
1:- Scheduling

Getting Started with Hadoop 45

2:- Running job

Killing Jobs
Unfortunately, sometimes a job goes awry after you are started it but it doesn
actually fail. It may take a long time to run or may even be stuck in an infinite loop.
In (pseudo-) distributed mode you can manually kill a job using the command
bin/hadoop job -kill job_id
where job_id is the job ID as given in JobTracker Web UI.

We are discussed the key nodes and the roles they play within the Hadoop
architecture. You are learned how to configure your cluster, as well as manage some
basic tools to monitor your cluster overall health.


Getting Started with Hadoop

Overall, this chapter focuses on one-time tasks. Once you are formatted the
NameNode for your cluster, you will (hopefully) never need to do so again.
Likewise, you shouldnt keep altering the hadoop-site.xml configuration file for your
cluster or assigning daemons to nodes. In the next chapter, you will learn about the
aspects of Hadoop you will be interacting with on a daily basis, such as managing
files in HDFS. With this knowledge you will be able to begin writing your own
MapReduce applications and realize the true potential that Hadoop has to offer.

Getting Started with Hadoop 47