Vous êtes sur la page 1sur 10

Hadoop Setup

Prerequisite:
• System: Mac OS / Linux / Cygwin on Windows
Notice:
1. only works in Ubuntu will be supported by TA. You may
try other environments for challenge.
2. Cygwin on Windows is not recommended, for its
instability and unforeseen bugs.

• Java Runtime Environment, JavaTM 1.6.x recommended

• ssh must be installed and sshd must be running to use


the Hadoop scripts that manage remote Hadoop
daemons.

Hadoop Setup
Single Node Setup (Usually for debug)
• Untar hadoop-*.**.*.tar.gz to your user path
About Version:
The latest stable version 1.0.1 is recommended.

• edit the file conf/hadoop-env.sh to define at least


JAVA_HOME to be the root of your Java installation
• edit the files to configure properties:
conf/core-site.xml: conf/hdfs-site.xml: conf/mapred-site.xml:
<configuration> <configuration> <configuration>
<property> <property> <property>
<name> <name> <name>
fs.default.name dfs.replication mapred.job.tracker
</name> </name> </name>
<value> <value> <value>
hdfs://localhost:9000 1 localhost:9001
</value> </value> </value>
</property> </property> </property>
</configuration> </configuration> </configuration> Hadoop Setup
Cluster Setup ( the only acceptable setup for HW)
• Same steps as single node setup
• Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml
• Add the master’s node name to conf/master
Add all the slaves’ node name to conf/slaves
• Edit /etc/hosts in each node: add IP and node name item
for each node
Suppose your master’s node name is ubuntu1 and its IP is
192.168.0.2, then add line “192.168.0.2 ubuntu1” to the file
• Copy the folder to the same path of all nodes
Notice: JAVA_HOME may not be set the same in each node

Hadoop Setup
Execution
• generating ssh keygen. Passphrase will be omitted when
starting up:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

• Format a new distributed-filesystem:


$ bin/hadoop namenode –format
• Start the hadoop daemons:
$ bin/start-all.sh
• The hadoop daemon log output is written to the
${HADOOP_LOG_DIR} directory (defaults to
${HADOOP_HOME}/logs).

Hadoop Setup
Execution(continued)

• Copy the input files into the distributed filesystem:


$ bin/hadoop fs -put conf input
• Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input
output 'dfs[a-z.]+'
• Examine the output files:
• View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*

• When you're done, stop the daemons with:


$ bin/stop-all.sh

Hadoop Setup
Details About Configuration Files
Hadoop configuration is driven by two types of important
configuration files:
1. Read-only default configuration:
src/core/core-default.xml
src/hdfs/hdfs-default.xml
src/mapred/mapred-default.xml
conf/mapred-queues.xml.template.
2. Site-specific configuration:
conf/core-site.xml
conf/hdfs-site.xml
conf/mapred-site.xml
conf/mapred-queues.xml

Hadoop Setup
Details About Configuration Files (continued)
conf/core-site.xml:
Parameter Value Notes
fs.default.name URI of NameNode. hdfs://hostname/

conf/hdfs-site.xml:
Parameter Value Notes
Path on the local filesystem If this is a comma-delimited
where the NameNode list of directories then the
dfs.name.dir stores the namespace and name table is replicated in
transactions logs all of the directories, for
persistently. redundancy.
If this is a comma-delimited
Comma separated list of
list of directories, then data
paths on the local filesystem
dfs.data.dir will be stored in all named
of a DataNode where it
directories, typically on
should store its blocks.
different devices.

Hadoop Setup
Details About Configuration Files (continued)
conf/mapred-site.xml:
Parameter Value Notes
mapred.job.tracker Host or IP and port of JobTracker. host:port pair.
Path on the HDFS where where the Map/Reduce This is in the default filesystem (HDFS) and must be
mapred.system.dir framework stores system files e.g. accessible from both the server and client
/hadoop/mapred/system/. machines.
Comma-separated list of paths on the local
mapred.local.dir filesystem where temporary Map/Reduce data is Multiple paths help spread disk i/o.
written.
The maximum number of Map/Reduce tasks, which
Defaults to 2 (2 maps and 2 reduces), but vary it
mapred.tasktracker.{map|reduce}.tasks.maximum are run simultaneously on a given TaskTracker,
depending on your hardware.
individually.
If necessary, use these files to control the list of
dfs.hosts/dfs.hosts.exclude List of permitted/excluded DataNodes.
allowable datanodes.
If necessary, use these files to control the list of
mapred.hosts/mapred.hosts.exclude List of permitted/excluded TaskTrackers.
allowable TaskTrackers.
The Map/Reduce system always supports atleast
one queue with the name as default. Hence, this
parameter's value should always contain the string
default. Some job schedulers supported in Hadoop,
like the Capacity Scheduler, support multiple
queues. If such a scheduler is being used, the list of
configured queue names must be specified here.
Comma separated list of queues to which jobs can
mapred.queue.names Once queues are defined, users can submit jobs to
be submitted.
a queue using the property name
mapred.job.queue.name in the job configuration.
There could be a separate configuration file for
configuring properties of these queues that is
managed by the scheduler. Refer to the
documentation of the scheduler for information on
the same.

Hadoop Setup
You may get detailed information from
The official site:
http://hadoop.apache.org

Course slides & Textbooks:


http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html

Michael G. Noll's Blog (a good guide):


http://www.michael-noll.com/

If you have good materials to share, please send them to TA.

Hadoop Setup

Vous aimerez peut-être aussi