Vous êtes sur la page 1sur 3

Hadoop Configuration

There are a handful of files for controlling the configuration of a Hadoop installation; the most important ones are
listed in Table 9-1. This section covers MapReduce 1, which employs the jobtracker and tasktracker daemons.
Running MapReduce 2 is substantially different, and is covered in YARN Configuration on page 318.
Table 9-1. Hadoop configuration files
Filename
Format

Description

hadoop-env.sh

Bash script

core-site.xml

masters

Hadoop
configuration
XML
Hadoop
configuration
XML
Hadoop
configuration
XML
Plain text

Slaves

Plain text

A list of machines (one per line) that each run a


datanode and a tasktracker.

hadoopmetrics.properties

Java Properties

log4j.properties

Java Properties

Properties for controlling how metrics are


published in Hadoop (see Metrics on page
350).
Properties for system logfiles, the namenode
audit log, and the task log for the tasktracker
child process (Hadoop Logs on page 173).

hdfs-site.xml

mapred-site.xml

Environment variables that are used in the scripts


to run Hadoop.
Configuration settings for Hadoop Core, such as
I/O settings that are common to HDFS and
MapReduce.
Configuration settings for HDFS daemons: the
namenode, the secondary namenode, and the
datanodes.
Configuration settings for MapReduce daemons:
the jobtracker, and the tasktrackers.
A list of machines (one per line) that each run a
secondary namenode.

These files are all found in the conf directory of the Hadoop distribution. The configuration directory can be
relocated to another part of the filesystem (outside the Hadoop

YARN Configuration
YARN is the next-generation architecture for running MapReduce (and is described in YARN (MapReduce 2) on
page 194). It has a different set of daemons and configuration options to classic MapReduce (also called
MapReduce 1), and in this section we shall look at these differences and how to run MapReduce on YARN.
Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a single resource manager running on
the same machine as the HDFS namenode (for small clusters) or on a dedicated machine, and node managers
running on each worker node in the cluster.
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the cluster. This script will start a
resource manager (on the machine the script is run on), and a node manager on each machine listed in the slaves
file.
YARN also has a job history server daemon that provides users with details of past job runs, and a web app proxy
server for providing a secure way for users to access the UI provided by YARN applications. In the case of
MapReduce, the web UI served by the proxy provides information about the current job you are running, similar to
the one described in The MapReduce Web UI on page 164. By default the web app proxy server runs in the same
process as the resource manager, but it may be configured to run as a standalone daemon.

YARN has its own set of configuration files listed in Table 9-8, these are used in addition to those in Table 9-1.
Table 9-8. YARN configuration files
Filename Format
yarnenv.sh
yarnsite.xml

Bash script
Hadoop
configuration XML

Description
Environment variables that are used in the scripts to run
YARN.
Configuration settings for YARN daemons: the resource
manager, the job history server, the webapp proxy server,
and the node managers.

Important YARN Daemon Properties


When running MapReduce on YARN the mapred-site.xml file is still used for general MapReduce properties,
although the jobtracker and tasktracker-related properties are not used. None of the properties in Table 9-4 are
applicable to YARN, except for mapred.child.java.opts (and the related properties mapreduce.map.java.opts and map
reduce.reduce.java.opts which apply only to map or reduce tasks, respectively). The JVM options specified in this way
are used to launch the YARN child process that runs map or reduce tasks.
The configuration files in Example 9-4 show some of the important configuration properties for running
MapReduce on YARN.
Example 9-4. An example set of site configuration files for running MapReduce on YARN
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>
<!-- Not marked as final so jobs can include JVM debugging options -->
</property>
</configuration>
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8040</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>
<final>true</final>
</property>

YARN Configuration
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
</configuration>

The YARN resource manager address is controlled via yarn.resourceman ager.address, which takes the form of a hostport pair. In a client configuration this property is used to connect to the resource manager (using RPC), and in

addition the mapreduce.framework.name property must be set to yarn for the client to use YARN rather than the local
job runner.
Although YARN does not honor mapred.local.dir, it has an equivalent property called yarn.nodemanager.local-dirs,
which allows you to specify which local disks to store intermediate data on. It is specified by a comma-separated
list of local directory paths, which are used in a round-robin fashion.
YARN doesnt have tasktrackers to serve map outputs to reduce tasks, so for this function it relies on shuffle
handlers, which are long-running auxiliary services running in node managers. Since YARN is a general-purpose
service the shuffle handlers need to be explictly enabled in the yarn-site.xml by setting the yarn.nodemanager.aux-serv
ices property to mapreduce.shuffle.
Table 9-9 summarizes the important configuration properties for YARN.
Table 9-9. Important YARN daemon properties
Property name

Type

Default value

Description

yarn.resourceman
ager.address

hostname and port

0.0.0.0:8040

The hostname and port that


the resource managers RPC
server runs on.

yarn.nodeman
ager.local-dirs

comma-separated
directory names

/tmp/nm-localdir

A list of directories where


node managers allow
containers to store
intermediate data. The data is
cleared out when the
application ends.

yarn.nodeman ager.auxservices

commaseparated
service names

A list of auxiliary services run


by the node manager. A
service is implemented by the
class defined by the property
yarn.nodemanager.aux-serv
ices.service-name.class. By

default no auxiliary services


are specified.
yarn.nodeman
ager.resource.mem
ory-mb

Int

8192

The amount of physical


memory (in MB) which may
be allocated to containers
being run by the node
manager.

Property name

Type

Default value

Description

yarn.nodeman
ager.vmem-pmemratio

Float

2.1

The ratio of virtual to


physical memory for
containers. Virtual memory
usage may exceed the
allocation by this amount.