Vous êtes sur la page 1sur 24

Installing Hadoop and Running word count

Application
Cahya Perdana
201583373
1. Create instances on Amazon Web Service

Choose Ubuntu image

Add 4 nodes, so we will have 4 node (1 master node, 1 secondary master,


and 2 slaves)

Setup storage each nodes, We recommend each node should have 29 GB.

When we are in configure security group just clik review and launch.

Finally click launch to launch all the instances.

When there is dialog box to download key pair, click download key pair. This
key pair will be useful to try connect to all instances that we have made.

All the instances that we have made are running well.

To make easy how we know which is the master node and slave node, we will
change Name instances to make easy us remember.

Now we add protocol that we would like to use in Security group. Add SSH
with port 22, All TCP with source anywhere, and All ICMP with source
anywhere. With all this settings, we can remote our instances and even we
can check by using Ping.

2. Remote instances
To remote our intstances, we need some tools:
Putty Key Generator : tool that can help you to generate key access to
connect our instances on amazon web service.
Putty: tool that will help you to connect to our instances using SSH.

First we are going to generate key. Import key from key pair that we
downloaded before.

Save the key from putty key generator.

Now open putty client, and then set SSH key in SSH menu.

Choose session, and fill with the dns of our instace.

Login using user Ubuntu.

Now open all instance in putty.

Now set up all the hostname in our nodes like this table:

AMI

Public DNS

IP

Name
Master
slave2
slave1
secondary

ec2-54-191-198-212.us-west2.compute.amazonaws.com
ec2-52-11-196-185.us-west-2.compute.amazonaws.com
ec2-54-191-199-0.us-west-2.compute.amazonaws.com
ec2-54-191-198-251.us-west2.compute.amazonaws.com

172.31.45.31
172.31.45.30
172.31.45.29
172.31.45.28

Edit hostname in each node using comman $vi /etc/hosts

Then we will upload our key generation to Master node, so Master node can
connect to the other nodes by using SSH. To Upload file to the master node,
we will use winscp. Set up connection in winscp to master node.

After we set up connection to master node, we will upload key from putty key
generation (hadoop.pem)

3. Download Hadoop
This instruction will apply to all nodes. First we have to upgrade our
repository so we can get the latest java SDK.
$sudo apt-get update
$sudo add-apt-repository ppa:webupd8team/java

Run this command $sudo apt-get install oracle-jdk7-installer to install Java


development kit version 7.
Check Java SDK 7 with command $java version

Download Hadoop with this command


$wget http://apache.mirror.gtcomm.net/hadoop/common/hadoop-1.2.1/hadoop1.2.1.tar.gz

Extract the file using command


$tar xzvf Hadoop1.2.1 tar.gz
Rename the folder by using this command
$mv Hadoop1.21 hadoop
Do All this step in Secondary master and all slaves nodes.
4. Set Up Hadoop.
Before we set up Hadoop, we need to check whether our master node can
access to all nodes that we have created before. To check we will connect to
other nodes using SSH. First we have to add key that we have upload to
master node from our computer. Then run this command.
$ssh-add Hadoop.pem

Now we will check whether our SSH setting is running well, we run with
command:
$ssh ubuntu@ec2-54-191-198-251.us-west-2.compute.amazonaws.com.

Now check to the others node.


Next we will setup Hadoop configuration. We do this step in Master node and
then we will copy all the configuration to the other nodes.

Hadoop-env.sh

First step is setup Hadoop-env.sh with this command.


$vi $HADOOP_CONF/Hadoop-env.sh
Add java parameter like this picture.

Core-site.xml

$vi $HADOOP_CONF/Core-site.xml
Now add this property:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-54-209-221-112.compute1.amazonaws.com:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hdfstmp</value>
</property>
</configuration>
HDFS-site.xml
This file contains the configuration for HDFS daemons, the master node,
secondary master and slave nodes.
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Mapred-site.xml
This file contains the configuration settings for MapReduce daemons;
<property>
<name>mapred.job.tracker</name>
<value>hdfs://ec2-54-191-198-212.us-west2.compute.amazonaws.com:8020</value>
</property>
Move all this configuration from master node to all the other nodes.
$scp Hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml
ubuntu@ec2-54-191-198-251.us-west2.compute.amazonaws.com:/home/ubuntu/hadoop/conf

Do this step to the other nodes.


5. Setup Master & Slave
Modify file master & slave on Master Nodes.
Modify file master with this command $vi $HADOOP_CONF/master
Then add all Master node in this.

Modify file slave with this command $vi $HADOOP_CONF/slave


Then add all slave node.

Next copy master and slave file to secondary master.


$scp master slaves ubuntu@ec2-54-191-198-251.us.west2.compute.amazonaws.com:/home/ubuntu/hadoop/conf

Modify file master and slave for Slave nodes.


This step is almost the same with the previous step, but in master file on
slave node there is no master node, so we will empty this file.

For the slave file we will add


Setup Hadoop-env.sh

Masters

Slaves

Setup slaves
Setup the master file, we delete all the values in this.

Slaves
For the slaves file, add the next slave node.

6. Startup Daemon
The first step to start our Hadoop is formatting the Hadoop filesystem.
$hadoop namenode format

Start all Hadoop from master node with command:


$cd $HADOOP_CONF
$start-all.sh

Now, lets check our Hadoop. From browser.


Link to check our Hadoop status.

Ec2-54-191-198-212.us-west-2.compute.amazonaws.com:50070/dfshealth.jsp

Link to check task tracker.


Ec2-54-191-199-0.us-west-2.compute.amazonaws.com:50060/tasktracker.jsp

To quick verify our Hadoop it is working or not, submit this command:


$hadoop jar hadoop-examples-1.2.1.jar pi 10 1000000

Running Example WordCount.

1. Create wordCount.java
$vi wordCount.java
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);

}
}

2. Compile wordcount.java with this command.


$ mkdir wordcount_classes
$ javac -classpath /home/ubuntu/hadoop-core-1.2.1.jar -d
wordcount_classes WordCount.java
$ jar -cvf /home/ubuntu/wordcount.jar -C wordcount_classes/ .
3. Prepare the input file.
Create input file with .txt format
$vi /home/Ubuntu/input.txt
and add this sentences.
my name is cahya perdana. My name means in Indonesia is I am the first
light of hope
4. Create dfs folder to put input file.
Move to /home/Ubuntu/Hadoop/bin
Run $./Hadoop dfs mkdir /home/Ubuntu/wordcount/input
To move input.txt to dfs folder, run this command.
$./Hadoop dfs put /home/Ubuntu/input.txt /home/Ubuntu/wordcount/input
5. Running java application with this command.
$./Hadoop jar /home/Ubuntu/wordcount.jar org.myorg.WordCount
/home/Ubuntu/wordcount/input /home/Ubuntu/wordcount/output
If the program is running well, the output must be like this:

Vous aimerez peut-être aussi