Vous êtes sur la page 1sur 32

G. V. P.

COLLEGE OF ENGINEERING (A)


VISAKHAPATNAM
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SKILL BASED LAB ELECTIVE

III YEAR-V SEMESTER

2018-2019

HADOOP LAB

Name:

Roll No:

Section:

Signature of External Examiner Signature of Internal Examiner


INDEX

S.NO DESCRIPTION PAGE NO. REMARKS

1 Introduction

2 Hadoop installation using


Ubuntu

3 Hadoop installation using


CloudEra

4 HDFS Commands

5 Word Count

10
INTRODUCTION

Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage.
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of
data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data
in the form of disks it may fill an entire football field. The same amount was created in every two days
in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though all this
information produced is meaningful and can be useful when processed, it is being neglected.

What is Big Data?


Big data means really a big data, it is a collection of large datasets that cannot be processed using
traditional computing techniques. Big data is not merely a data, rather it has become a complete
subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?


Big data involves the data produced by different devices and applications. Given below are some of the
fields that come under the umbrella of Big Data.
 Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
 Social Media Data : Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
 Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.
 Power Grid Data : The power grid data holds information consumed by a particular node with
respect to a base station.
 Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.
 Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.

 Structured data : Relational data.


 Semi Structured data : XML data.
 Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
Big data is really critical to our life and its emerging as one of the most important technologies in
modern world. Follow are just few benefits which are very much known to all of us:
 Using the information kept in the social network like Facebook, the marketing agencies are
learning about the response for their campaigns, promotions, and other advertising mediums.
 Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
 Using the data regarding the previous medical history of patients, hospitals are providing better
and quick service.
Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data

 Curation

 Storage

 Searching

 Sharing

 Transfer

 Analysis

 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be
written to interact with the database, process the required data and present it to the users for analysis
purpose.

Limitation
This approach works well where we have less volume of data that can be accommodated by standard
database servers, or up to the limit of the processor which is processing the data. But when it comes to
dealing with huge amounts of data, it is really a tedious task to process such data through a traditional
database server.

Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into
small parts and assigns those parts to many computers connected over the network, and collects the
results to form the final result dataset.

Hadoop
Google and started an Open Source Project called HADOOP in 2005. Hadoop runs applications using
the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short,
Hadoop framework is capable enough to develop applications capable of running on clusters of
computers and they could perform complete statistical analysis for a huge amounts of data.
A Hadoop frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.

Basic Hadoop framework includes following modules:


 Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.

 Hadoop MapReduce: This is a system for parallel processing of large data sets.

MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.

The term MapReduce actually refers to the following two different tasks that Hadoop programs
perform:
 The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
 The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and
re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System


Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3
FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File
System (HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a single NameNode that manages the
file system metadata and one or more slave DataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes
care of read and write operation with the file system. They also take care of block creation, deletion
and replication based on instruction given by NameNode.

HDFS provides a shell like any other file system and a list of commands are available to interact with
the file system. These shell commands will be covered in a separate chapter along with appropriate
examples.

How Does Hadoop Work?


Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2

The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of
the reduce function is stored into the output files on the file system.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatic distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
HADOOP INTSALLATION - UBUNTU
Download Hadoop

$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz

Unzip it

$ tar xvzf hadoop-3.0.0.tar.gz

Hadoop Configuration

Make a directory called hadoop and move the folder ‘hadoop-3.0.0’ to this directory

$ sudomkdir -p /usr/local/hadoop

$ cd hadoop-3.0.0/

$ sudo mv * /usr/local/hadoop
$ sudochown -R hduser:hadoop /usr/local/hadoop

Setting up Configuration files


We will change content of following files in order to complete hadoop installation.

1. ~/.bashrc

2. hadoop-env.sh

3. core-site.xml

4. hdfs-site.xml

5. yarn-site.xml

~/.bashrc

If you don’t know the path where java is installed, first run the following command to locate it

$update-alternatives –config java

Now open the ~/.bashrc file

$sudonano ~/.bashrc

Note: I have used ‘nano’ editor, you can use a different one. No issues.

Now once the file is opened, append the following code at the end of file,

#HADOOP VARIABLES START


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”
#HADOOP VARIABLES END

Press CTRL+O to save and CTRL+X to exit from that window.

Update .bashrc file to apply changes

$source ~/.bashrc

hadoop-env.sh

We need to tell Hadoop the path where java is installed. That’s what we will do in this file, specify the path for
JAVA_HOME variable.

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Now, the first variable in file will be JAVA_HOME variable, change the value of that variable to

export JAVA_HOME=usr/lib/jvm/java-8-openjdk-amd64
core-site.xml

Create temporary directory

$ sudomkdir -p /app/hadoop/tmp

$ sudochownhduser:hadoop /app/hadoop/tmp

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/core-site.xml

Append the following between configuration tags. Same as below.

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the
FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the
FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a
filesystem.</description>
</property>
</configuration>
hdfs-site.xml

Mainly there are two directories,

1. Name Node

2. Data Node

Make directories

$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode

$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode

$ sudochown -R hduser:hadoop /usr/local/hadoop_store

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Change the content between configuration tags shown as below.

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is
created. The default is used if replication is not specified in create time.

</description>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

</configuration>

yarn-site.xml

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Just like the other two, add the content to configuration tags.

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Format Hadoop file system


Hadoop installation is now done. All we have to do is change format the name-nodes before using it.

$ hadoopnamenode -format
Start Hadoop daemons

Now that hadoop installation is complete and name-nodes are formatted, we can start hadoop by going to
following directory.

$ cd /usr/local/hadoop/sbin

$ start-all.sh

Just check if all daemons are properly started using the following command:

$ jps
Stop Hadoop daemons

$ stop-all.sh

Appreciate yourself because you’ve done it. You have completed all the Hadoop installation steps and Hadoop is
now ready to run the first program.

Let’s run MapReduce job on our entirely fresh Hadoop cluster setup

Go to the following directory

$ cd /usr/local/hadoop

Run the following command

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar pi 2 5

Hooray! It’s done.

HADOOP INTSALLATION – CLOUDERA


Installation for Cloudera VM CDH 5.10 for Oracle VirtualBox

Install Oracle VirtualBox

Go to google and type “download oracle virtualbox”

For Windows system, Select VirtualBox 5.1.26 for Windows hosts

Select an appropriate download folder and download the exe file.


Follow the download Wizard and install the Oracle VirtualBox
Installing Cloudera VM for VirtualBox.

Firstly go to Cloudera.com in your web browser

On top bar navigate to the downloads link


Go to the quickstart VM links and install a Virtual machine for VirtualBox

Kindly select the Quickstart VM with CDH 5.7 and download the VirtualBox install

Please sign in with Cloudera credentials and then download the VM image. Download is around 5 GB.
Unzip the folder to a suitable path and have the .ovf file available

Open the Virtual Box it looks like this. Ignore all pre-installed VM’s on my virtualbox. Your installation
will be blank

Hit File -> Import Appliance

Navigate to the unzipped folder of your CDH5.10ovf image


Select the image and hit Import. It will open the Cloudera Quickstart VM. The loading may take some
time.

Hit Start virtual machine to start your Hadoop virtual machine. Loading will take few minutes to load
the CentOS 6.7
HDFS COMMANDS

 ls
Lists the contents of a directory.
Example:
cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
drwxr-xr-x - cloudera cloudera 0 2018-08-23 08:49 dh
drwxr-xr-x - cloudera cloudera 0 2018-08-27 03:19 out
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-28 06:45 poems

 appendToFile
Appends the contents to the given file. The file will be created if it does not exist.

Example:

[cloudera@quickstart ~]$ hadoop fs -cat example.txt


This is hdfs file system

[cloudera@quickstart ~]$ hadoop fs -appendToFile - example.txt


Welcome...!

[cloudera@quickstart ~]$ hadoop fs -cat example.txt


This is hdfs file system
Welcome...!

 cat

Displays the contents of the file.


Example:

[cloudera@quickstart ~]$ hadoop fs -cat example.txt


This is hdfs file system
 touchz command

Creates a new file in the given path.

Example:

[cloudera@quickstart ~]$ hadoop fs -touchz samplefile.txt

[cloudera@quickstart ~]$ hadoop fs -ls

Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 samplefile.txt

 mkdir
It creates a new directory.

Example:

[cloudera@quickstart ~]$ hadoop fs -mkdir newdir

[cloudera@quickstart ~]$ hadoop fs -ls

Found 6 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
 cp command

Used for copying files from one directory to another within HDFS.
Example:

[cloudera@quickstart ~]$ hadoop fs -cp olddir/file1 newdir

[cloudera@quickstart ~]$ hadoop fs -ls newdir

Found 1 items
-rw-r--r-- 1 cloudera cloudera 9 2018-08-03 02:48 newdir/file1

 du
Displays sizes of files and directories contained in the given directory or the size of a file if
its just a file.

Example:
[cloudera@quickstart ~]$ hadoop fs -du exmpldir

0 0 exmpldir/file1
38 38 exmpldir/file2

 put

Copy single src file, or multiple src files from local file system to the hadoop data file system.
Example:
[cloudera@quickstart ~]$ cat>exmple

[cloudera@quickstart ~]$ hadoop fs -put exmple outputfile

[cloudera@quickstart ~]$ hadoop fs -ls

Found 5 items
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 30 2018-08-30 05:01 outputfile
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
 copyFromLocal
CopyFromLocal is same as put command

Example:
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal poems copyfile
[cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-rw-r--r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28
-rw-r--r-- 1 cloudera cloudera 1092 2018-08-23 02:45 cm_api.sh
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile

 get command

Copy single src file, or multiple src files from hadoop file system to the local file
system.
Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 file1
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45newdir

[cloudera@quickstart ~]$ hadoop fs –get example.txt localcopy


[cloudera@quickstart ~]$ ls
count.jar example.txt hello poems~ Videos fileread
a datasets GSOD lib prgrm wcount.jar ab.txt
Documents gt Music Public workspace cloudera-manager
Downloads count Pictures Templates wpart.jar cm_api.sh eclipse
hadop poems tempr.jar
 copyToLocal
copyToLocal is same as get command

Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 inp
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir

[cloudera@quickstart ~]$ hadoop fs -copyToLocal inp outfile


[cloudera@quickstart ~]$ ls
count.jar example.txt poems~ tolocalfile
a datasets GSOD lib prgrm Videos
ab.txt Documents gt Music Public wordcount.jar
cloudera-manager Downloads outfile Pictures Templates workspace

 getfacl
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a
default ACL, then getfacl also displays the default ACL.

Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl exampl.txt
# file: exampl.txt
# owner: cloudera
# group:
cloudera
user::rw
x
group::r
-x
other::r-
x

 moveFromLocal
Same as -put, except that the source is deleted after it's copied.
Example:

[cloudera@quickstart ~]$ hadoop fs -moveFromLocal file movfile


[cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
-rw-r--r-- 1 cloudera cloudera 30 2018-08-30 05:01 file1
drwxr-xr-x - cloudera cloudera 0 2018-08-31 08:47 mergefile
-rw-r--r-- 1 cloudera cloudera 14 2018-08-31 08:51 movfile
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir

 mv command
Move files that match the specified file pattern to a destination. When moving
multiple files, the destination must be a directory.

Example:
cloudera@quickstart ~]$ hadoop fs -mv abc.txt newdir
[cloudera@quickstart ~]$ hadoop fs -ls newdir
Found 1 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 newdir/abc.txt

 rm command
Used for removing a file from HDFS. The command -rm r can be used for
recursive delete.
rmdir command can be used to delete directories .
Options:
-skipTrash option bypasses trash, if enabled, and immediately deletes <src>

-f If the file does not exist, do not display a diagnostic message or modify the exit
status to reflect an error.
-[rR] Recursively deletes directories

Example:
cloudera@quickstart ~]$ hadoop fs -rm sample.txt

 test command
This command can be used to test a hdfs file’s existence or zero length or is it a
directory.
Options:
-d return 0 if <path> is a directory.
-e return 0 if <path> exists.
-f return 0 if <path> is a file.
-s return 0 if file <path> is greater than zero bytes in size.
-z return 0 if file <path> is zero bytes in size, else return 1.

[cloudera@quickstart ~]$ hadoop fs -test -d newdir

 expunge
This command is used to empty the trash in hadoop file system.

[cloudera@quickstart ~]$ hadoop fs -expunge

 count
Count the number of directories, files and bytes under the paths that match the
specified file pattern.
The output columns are,
DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example:
[cloudera@quickstart ~]$ hadoop fs -count newdir
1 2 75 newdir
 chmod
Changes permissions of a file.
Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt
# file: example.txt
# owner: cloudera
# group: cloudera

user::r
--
group:
:--x
other::
-w-

[cloudera@quickstart ~]$ hadoop fs -chmod 777 example.txt

[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt


# file: example.txt
# owner: cloudera
#
groupcloudera
user::rwx
group::rwx
other::rwx

 chown
Changes owner and group of a file.
Syntax:
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
WordCount

Write a Hadoop MapReduce program to calculate the individual word count of a file.

1. Before you run the sample, you must create input and output locations in HDFS. Use the
following commands to create the input directory/user/cloudera/wordcount/input in HDFS:
$ sudo su hdfs
$ hadoop fs -mkdir /user/cloudera
$ hadoop fs -chown cloudera /user/cloudera
$ exit
$ sudo su cloudera
$ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input

2. Create sample text files to use as input, and move them to


the/user/cloudera/wordcount/input directory in HDFS. You can use any files you choose; for
convenience, the following shell commands create a few small input files for illustrative
purposes. The Makefile also contains most of the commands that follow.
$ echo "Hadoop is an elephant" > file0
$ echo "Hadoop is as yellow as can be" > file1
$ echo "Oh what a yellow fellow is Hadoop" > file2
$ hadoop fs -put file* /user/cloudera/wordcount/input

3. Compile the WordCount class.


To compile in a package installation of CDH:
$ mkdir -p build
$ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint
To compile in a parcel installation of CDH:
$ mkdir -p build
$ javac -cp /opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop-
mapreduce/* \
WordCount.java -d build -Xlint

4. Create a JAR file for the WordCount application.


$ jar -cvf wordcount.jar -C build/ .

5. Run the WordCount application from the JAR file, passing the paths to the input and
output directories in HDFS.
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input
/user/cloudera/wordcount/output
When you look at the output, all of the words are listed in UTF-8 alphabetical order
(capitalized words first). The number of occurrences from all input files has been reduced to a
single sum for each word.

$ hadoop fs -cat /user/cloudera/wordcount/output/*


Hadoop 3
Oh 1
a 1
an 1
as 2
be 1
can 1
elephant 1
fellow 1
is 3
what 1
yellow 2

6. If you want to run the sample again, you first need to remove the output directory. Use
the following command.
$ hadoop fs -rm -r /user/cloudera/wordcount/output

Vous aimerez peut-être aussi