Big Data

Guru Nanak Dev Engg.
College
Big Data Using

Hadoop
SUBMITTED BY :-
GAGANPAL SINGH
D3IT-A1
1311374
BIG DATA
Growth of and Digitization of Global Information

Storage Capacity
Big data is a broad term for work with data sets so large or
complex that traditional data processing applications are
inadequate and distributed databases are needed. Challenges
include sensor design, capture, data curation, search,sharing,
storage, transfer, analysis, fusion, visualization, and information
privacy.
The term often refers simply to the use of predictive
analytics or other certain advanced methods to extract value

from data, and seldom to a particular size of data set. Accuracy
in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost
reduction and reduced risk. Analysis of data sets can find new
correlations, to "spot business trends, prevent diseases, combat
crime and so on. Scientists, business executives, practitioners of
media, and advertising and governments alike regularly meet
difficulties with large data sets in areas including Internet
search, finance and business informatics. Scientists encounter
limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics
simulations, and biological and environmental research.
Relational database management systems and desktop statistics
and visualization packages often have difficulty handling big data.
The work instead requires "massively parallel software running on
tens, hundreds, or even thousands of servers". What is
considered "big data" varies depending on the capabilities of the
users and their tools, and expanding capabilities make big data a
moving target. Thus, what is considered "big" one year becomes
ordinary later. "For some organizations, facing hundreds of
gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens
or hundreds of terabytes before data size becomes a significant
consideration.
"Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and process
optimization. Additionally, a new V "Veracity" is added by some
organizations to describe it.
Gartners definition of the 3Vs is still widely used, and in

agreement with a consensual definition that states that "Big Data
represents the Information assets characterized by such a High
Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value". The 3Vs
have been expanded to other complementary characteristics of
big data:
Volume: big data doesn't sample. It just observes and

tracks what happens
Velocity: big data is often available in real-time
Variety: big data draws from text, images, audio, video; plus
it completes missing pieces through data fusion
Machine Learning: big data often doesn't ask why and simply
detects patterns
Digital footprint: big data is often a cost-free byproduct of

digital interaction
Characteristics
Volume:- The quantity of generated data is important in this
context. The size of the data determines the value

and potential of the data under consideration, and
data itself contains a term related to size, and hence
the characteristic.
Variety:- The type of content, and an essential fact that data
analysts must know. This helps people who are
associated with and analyze the data to effectively
use the data to their advantage and thus uphold its
importance.
Velocity:- In this context, the speed at which the data is

generated and processed to meet the demands and
the challenges that lie in the path of growth and
development.
Variability:- The inconsistency the data can show at timeswhich
can hamper the process of handling and managing
the
data effectively.
Veracity:- The quality of captured data, which can vary greatly.

Accurate analysis depends on the veracity of source
data.
Complexity:- Data management can be very complex, especially
when large volumes of data come from multiple
sources. Data must be linked, connected, and
correlated so users can grasp the information
the
data is supposed to convey.
Applications
Big data has increased the demand of information management
specialists in that Software AG, Oracle

Corporation,IBM, Microsoft, SAP, EMC, HP and Dell have spent
more than $15 billion on software firms specializing in data
management and analytics. In 2010, this industry was worth more
than $100 billion and was growing at almost 10 percent a year:
about twice as fast as the software business as a whole.
Developed economies increasingly use data-intensive technologies.
There are 4.6 billion mobile-phone subscriptions worldwide, and
between 1 billion and 2 billion people accessing the internet.
Between 1990 and 2005, more than 1 billion people worldwide
entered the middle class, which means more people become more
literate, which in turn leads to information growth. The world's
effective capacity to exchange information
through telecommunication networks was 281 petabytes in 1986,
471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in
2007 and predictions put the amount of internet traffic at 667
exabytes annually by 2014. According to one estimate, one third
of the globally stored information is in the form of alphanumeric
text and still image data, which is the format most useful for
most big data applications. This also shows the potential of yet
unused data (i.e. in the form of video and audio content).
While many vendors offer off-the-shelf solutions for Big Data,
experts recommend the development of in-house solutions
custom-tailored to solve the company's problem at hand if the
company has sufficient technical capabilities.
Government
The use and adoption of Big Data within governmental processes

is beneficial and allows efficiencies in terms of cost, productivity,
and innovation. That said, this process does not come without its
flaws. Data analysis often requires multiple parts of government
(central and local) to work in collaboration and create new and
innovative processes to deliver the desired outcome. Below are
the thought leading examples within the Governmental Big Data
space.
United States of America

In 2012, the Obama administration announced the Big Data
Research and Development Initiative, to explore how big data
could be used to address important problems faced by the
government. The initiative is composed of 84 different big data
programs spread across six departments.
Big data analysis played a large role in Barack Obama's
successful 2012 re-election campaign.
The United States Federal Government owns six of the ten
most powerful supercomputers in the world.
The Utah Data Center is a data center currently being
constructed by the United States National Security Agency.
When finished, the facility will be able to handle a large amount
of information collected by the NSA over the Internet. The exact
amount of storage space is unknown, but more recent sources
claim it will be on the order of a few exabytes.
India
Big data analysis helped in parts, responsible for the BJP and
its allies to win Indian General Election 2014.
The Indian Government utilises numerous techniques to

ascertain how the Indian electorate is responding to
government action, as well as ideas for policy augmentation
United Kingdom
Data on prescription drugs: by connecting origin, location
and the time of each prescription, a research unit was able
to exemplify the considerable delay between the release of
any given drug, and a UK-wide adaptation of the National
Institute for Health and Care Excellence guidelines. This
suggests that new/most up-to-date drugs take some time to
filter through to the general patien
HADOOP
Apache Hadoop is an open-source software framework written
in Java for distributed storage and distributed processing of
very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a
fundamental assumption that hardware failures (of individual
machines, or racks of machines) are commonplace and thus should
be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop
Distributed File System (HDFS)) and a processing part
(MapReduce). Hadoop splits files into large blocks and distributes
them amongst the nodes in the cluster. To process the data,
Hadoop MapReduce transfers packaged code for nodes to process
in parallel, based on the data each node needs to process. This
approach takes advantage of data locality nodes manipulating the
data that they have on hand to allow the data to
be processed faster and more efficiently than it would be in a
more conventional supercomputer architecture that relies on
a parallel file system where computation and data are connected
via high-speed networking.

The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common contains libraries and utilities needed by

other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed filesystem that stores data on commodity machines, providing
very high aggregate bandwidth across the cluster;
Hadoop YARN a resource-management platform

responsible for managing computing resources in clusters and
using them for scheduling of users' applications; and
Hadoop MapReduce a programming model for large scale

data processing.
The term "Hadoop" has come to refer not just to the base
modules above, but also to the "ecosystem", or collection of
additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache
HBase, Apache Phoenix, Apache Spark, Apache Zookeeper,
Impala,Apache Flume, Apache Sqoop, Apache Oozie, Apache
Storm and others.
Apache Hadoop's MapReduce and HDFS components were
inspired by Google papers on their MapReduce and Google File
System.
The Hadoop framework itself is mostly written in the Java
programming language, with some native code in C and command
line utilities written as Shell script. For end-users, though
MapReduce Java code is common, any programming language can
be used with "Hadoop Streaming" to implement the "map" and
"reduce" parts of the user's program. Other related projects
expose other higher-level user interfaces.
PROJECT DETAILS
REQUIREMENTS:
VMware WORKSATATION
OPERATING SYSTEM (UBUNTU)
FRAMEWORK (HADOOP)
PROGRAMING LANGUAGE (JAVA)
STEPS FOLLOWED:STEP 1:MAKING A VIRTUAL MACHINE
Installing VMware WORKSTATION to make a
virtual machine on your system.

Make a virtual machine using ubuntu.iso file.
Update ubuntu after installation by executing command
(sudo apt-get update) in terminal.
Install java by command in terminal.
(sudo apt-get install java-8-oracle)
k@laptop:~$ cd ~
# Update the source list
k@laptop:~$ sudo apt-get update
# The OpenJDK project is the default version of Java
# that is provided from a supported Ubuntu repository.
k@laptop:~$ sudo apt-get install java-8-oracle
k@laptop:~$ java -version
java version "1.8.0_52"
STEP 2:-
Installing SSH
ssh has two main components:

1. ssh : The command we use to connect to remote machines - the
client.
2. sshd : The daemon that is running on the server and allows clients
to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we
need to install sshfirst. Use this command to do that :
k@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the
following, we can think it is setup properly:
k@laptop:~$ which ssh
/usr/bin/ssh
k@laptop:~$ which sshd
/usr/sbin/sshd
Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines
plus our local machine. For our single-node setup of Hadoop, we therefore
need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it
to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the
user to enter a password. However, this requirement can be eliminated by
creating and setting up SSH certificates using the following commands. If
asked for a filename just leave it blank and press the enter key to continue.
k@laptop:~$ su hduser
Password:
k@laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter
file
in
which
(/home/hduser/.ssh/id_rsa):
to
save
the
key
been
saved
in
been
saved
in
Created directory '/home/hduser/.ssh'.

Your
identification
/home/hduser/.ssh/id_rsa.
Your
public
key
/home/hduser/.ssh/id_rsa.pub.
has
has
The key fingerprint is:

50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
|
.oo.o
. .o=. o
. + .
o =
|
|
|
|
|
o . |
E
S +
|
|
. +
O +
O o
o..
|
|
|
+-----------------+
hduser@laptop:/home/k$
cat
$HOME/.ssh/authorized_keys
$HOME/.ssh/id_rsa.pub
>>
The second command adds the newly created key to the list of authorized
keys so that Hadoop can use ssh without prompting for a password.
We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity
established.
of
host
'localhost
(127.0.0.1)'
can't
ECDSA
key
fingerprint
e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
be
is
Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of
known hosts.
Welcome
x86_64)
to
Ubuntu
14.04.1
LTS
(GNU/Linux
3.13.0-40-generic
...
STEP 3:DISABLE IPV6

Edit the file using following command
$ sudo gedit /etc/sysctl.conf
Add the the following line of code at the end of the file
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Restart system
TO check whether the ipv6 is disabled or not by using the
following command
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
It should return value 1.
STEP 4:INSTALL HADOOP
(a) Extract and Modify Permissions

Move the hadoop package to /usr/local.
Then change the directory to /usr/local.
Extract the package using tar command.
Then move the extraced file to the directory hadoop after that
change the owner of the hadoop directory, all files and directorys
in it.
The following commands are using for this purpose.
sudo mv /home/hduser/Downloads/hadoop-1.2.1.tar.gz /usr/local
cd /usr/local
sudo tar xzf hadoop-1.2.1.tar.gz
sudo mv hadoop-1.2.1 hadoop
sudo chown R hduser:hadoop hadoop
Note:Sometimes -R option make some error the just use option
recursive
Setup Configuration Files
The following files will have to be modified to complete the

Hadoop setup:
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to
find the path where Java has been installed to set
the JAVA_HOME environment variable using the following
command:
hduser@laptop update-alternatives --config java
There is only one alternative in link group java (providing
/usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
Nothing to configure.
Now we can append the following to the end of ~/.bashrc:

hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc
note that the JAVA_HOME should be set as the path just before
the '.../bin/':
hduser@ubuntu-VirtualBox:~$ javac -version
javac 1.7.0_75
hduser@ubuntu-VirtualBox:~$ which javac
/usr/bin/javac
hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures

that the value of JAVA_HOME variable will be available to
Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains
configuration properties that Hadoop uses when starting up.
This file can be used to override the default settings that
Hadoop starts with.
hduser@laptop:~$ sudo mkdir -p /app/hadoop/tmp
hduser@laptop:~$ sudo chown hduser:hadoop /app/hadoop/tmp
Open the file and enter the following in between the

<configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A
directories.</description>
base
for
other
temporary
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.
whose
A URI
scheme
and
implementation. The
authority
determine
the
FileSystem
uri's scheme determines the config property (fs.SCHEME.impl)

naming
the FileSystem implementation class.
used to
determine
the
filesystem.</description>
host,
The uri's authority is
port,
etc.
for
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapredsite.xml:
hduser@laptop:~$
cp
/usr/local/hadoop/etc/hadoop/mapredsite.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
The mapred-site.xml file is used to specify which framework is

being used for MapReduce.
We need to enter the following content in between the
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The
tracker runs
host
and
port
that
the
MapReduce
job
at.
map
If "local", then jobs are run in-process as a single
and reduce task.

</description>
</property>
</configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to
be configured for each host in the cluster that is being used.
It is used to specify the directories which will be used as
the namenode and the datanode on that host.
Before editing this file, we need to create two directories which
will contain the namenode and the datanode for this Hadoop
installation.
This can be done using the following commands:
hduser@laptop:~$
sudo
/usr/local/hadoop_store/hdfs/namenode
mkdir
-p
hduser@laptop:~$
sudo
/usr/local/hadoop_store/hdfs/datanode
mkdir
-p
hduser@laptop:~$
sudo
/usr/local/hadoop_store
chown
-R
hduser:hadoop
Open the file and enter the following content in between the
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the
file is created.
The default is used if replication is not specified in
create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Format the New Hadoop Filesystem

Now, the Hadoop file system needs to be formatted so that we
can start to use it. The format command should be issued with
write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
hduser@laptop:~$ hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is

deprecated.
Instead use the hdfs command for it.
15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:
host = laptop/192.168.1.1
STARTUP_MSG:
args = [-format]
STARTUP_MSG:
version = 2.6.0
STARTUP_MSG:
classpath = /usr/local/hadoop/etc/hadoop
...
STARTUP_MSG:
java = 1.7.0_65
************************************************************/
15/04/18 14:43:03 INFO namenode.NameNode:
signal handlers for [TERM, HUP, INT]
15/04/18
format]
14:43:03
INFO
namenode.NameNode:
registered
UNIX
createNameNode
[-
15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load

native-hadoop library for your platform... using builtin-java
classes where applicable
Formatting
using
5b1100a2bf7f
clusterid:
CID-e2f515ac-33da-45bc-8466-
15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider

found.
15/04/18
14:43:09
fair:true
INFO
namenode.FSNamesystem:
fsLock
is
15/04/18
14:43:10
INFO
blockmanagement.DatanodeManager:
dfs.block.invalidate.limit=1000
15/04/18
14:43:10
INFO
blockmanagement.DatanodeManager:
dfs.namenode.datanode.registration.ip-hostname-check=true
15/04/18
14:43:10
INFO
blockmanagement.BlockManager:
dfs.namenode.startup.delay.block.deletion.sec
is
set
to
000:00:00:00.000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: The block
deletion will start around 2015 Apr 18 14:43:10
15/04/18 14:43:10 INFO util.GSet: Computing capacity for map

BlocksMap
15/04/18 14:43:10 INFO util.GSet: VM type
= 64-bit
15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB =

17.8 MB
15/04/18 14:43:10 INFO util.GSet: capacity
2097152 entries
= 2^21 =
15/04/18
14:43:10
INFO
dfs.block.access.token.enable=false
15/04/18
14:43:10
defaultReplication
INFO
= 1
15/04/18
14:43:10
maxReplication
INFO
= 512
15/04/18
14:43:10
minReplication
INFO
= 1
15/04/18
14:43:10
maxReplicationStreams
INFO
= 2
15/04/18
14:43:10
INFO
shouldCheckForEnoughRacks = false
15/04/18
14:43:10
INFO
replicationRecheckInterval = 3000
15/04/18
14:43:10
encryptDataTransfer
INFO
= false
15/04/18
14:43:10
maxNumBlocksToLog
INFO
= 1000
15/04/18
14:43:10
INFO
= hduser (auth:SIMPLE)
15/04/18
14:43:10
= supergroup
INFO
15/04/18
14:43:10
isPermissionEnabled = true
15/04/18
false
14:43:10
INFO
INFO
fsOwner
supergroup
HA
Enabled:
15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled:

true
INodeMap
= 64-bit
15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9

MB
1048576 entries
= 2^20 =
15/04/18 14:43:11 INFO namenode.NameNode: Caching file names

occuring more than 10 times
cachedBlocks
= 64-bit
15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB =

2.2 MB
262144 entries
= 2^18 =
15/04/18
14:43:11
INFO
dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/04/18
14:43:11
INFO
dfs.namenode.safemode.min.datanodes = 0
15/04/18
14:43:11
INFO
dfs.namenode.safemode.extension
= 30000
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on

namenode is enabled
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache will
use 0.03 of total heap and retry cache entry expiry time is
600000 millis
NameNodeRetryCache
15/04/18 14:43:11 INFO util.GSet:
memory 889 MB = 273.1 KB
= 64-bit
0.029999999329447746%

entries
max
= 2^15 = 32768
15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false

15/04/18 14:43:11 INFO namenode.NNConf: XAttrs enabled? true
15/04/18 14:43:11 INFO namenode.NNConf: Maximum size of an
xattr: 16384
15/04/18
14:43:12
INFO
namenode.FSImage:
Allocated
BlockPoolId: BP-130729900-192.168.1.1-1429393391595
new
15/04/18 14:43:12 INFO common.Storage:

/usr/local/hadoop_store/hdfs/namenode has
formatted.
Storage directory
been successfully
15/04/18
14:43:12
INFO
namenode.NNStorageRetentionManager:
Going to retain 1 images with txid >= 0
15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0
15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/
Note that hadoop namenode -format command should be

executed once before we start using Hadoop.
If this command is executed again after Hadoop has been used,
it'll destroy all the data on the Hadoop file system.
Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)
k@laptop:~$ cd /usr/local/hadoop/sbin
k@laptop:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh
start-all.cmd
stop-balancer.sh
hadoop-daemon.sh
start-all.sh
stop-dfs.cmd
hadoop-daemons.sh
start-balancer.sh
stop-dfs.sh
hdfs-config.cmd
dns.sh
start-dfs.cmd
stop-secure-
hdfs-config.sh
start-dfs.sh
stop-yarn.cmd
httpfs.sh
start-secure-dns.sh
stop-yarn.sh
kms.sh
start-yarn.cmd
yarn-daemon.sh
mr-jobhistory-daemon.sh
start-yarn.sh
yarn-daemons.sh
refresh-namenodes.sh
stop-all.cmd
slaves.sh
stop-all.sh
k@laptop:/usr/local/hadoop/sbin$ sudo su hduser

hduser@laptop:/usr/local/hadoop/sbin$ start-all.sh
hduser@laptop:~$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and startyarn.sh
Starting namenodes on [localhost]
localhost:
starting
namenode,
logging
/usr/local/hadoop/logs/hadoop-hduser-namenode-laptop.out
to
localhost:
starting
datanode,
logging
/usr/local/hadoop/logs/hadoop-hduser-datanode-laptop.out
to
Starting secondary namenodes [0.0.0.0]

0.0.0.0:
starting
secondarynamenode,
logging
/usr/local/hadoop/logs/hadoop-hduser-secondarynamenodelaptop.out
to

starting yarn daemons
starting
resourcemanager,
logging
to
/usr/local/hadoop/logs/yarn-hduser-resourcemanager-laptop.out
localhost:
starting
nodemanager,
logging
/usr/local/hadoop/logs/yarn-hduser-nodemanager-laptop.out
We can check if it's really up and running:
to
hduser@laptop:/usr/local/hadoop/sbin$ jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
Stopping Hadoop
$ pwd
/usr/local/hadoop/sbin
$ ls
distribute-exclude.sh httpfs.sh
start-yarn.cmd
stop-dfs.cmd
start-all.sh
yarn-daemon.sh
hadoop-daemon.sh
balancer.sh
start-yarn.sh
stop-dfs.sh
daemons.sh
hadoop-daemons.sh
stop-all.cmd
stop-secure-dns.sh
hdfs-config.cmd
stop-all.sh
slaves.sh
stop-yarn.cmd
hdfs-config.sh
start-all.cmd
dns.sh stop-balancer.sh stop-yarn.sh
startyarn-
start-dfs.cmd
start-dfs.sh
start-secure-
We run stop-all.sh or (stop-dfs.sh and stop-yarn.sh) to stop all

the daemons running on our machine:
hduser@laptop:/usr/local/hadoop/sbin$ pwd
/usr/local/hadoop/sbin
hduser@laptop:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh httpfs.sh
start-secure-dns.sh stop-balancer.sh
start-all.cmd
stop-yarn.sh
hadoop-daemon.sh
start-yarn.cmd
start-all.sh
yarn-daemon.sh
kms.sh
stop-dfs.cmd
hadoop-daemons.sh
balancer.sh start-yarn.sh
daemons.sh
stop-dfs.sh
hdfs-config.cmd
stop-all.cmd
stop-secure-dns.sh
hdfs-config.sh
stop-all.sh
slaves.sh
stop-yarn.cmd
startyarn-
start-dfs.cmd
start-dfs.sh
hduser@laptop:/usr/local/hadoop/sbin$
hduser@laptop:/usr/local/hadoop/sbin$ stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stopyarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: no secondarynamenode to stop
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
WORDCOUNT PROGRAM
Step 1:
Open Eclipse IDE and create a new project with 3 class files
WordCount.java , WordCountMapper.java and WordCountReducer.java
Step 2:
Open WordCount.java and paste the following code.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends Configured implements Tool{

public int run(String[] args) throws Exception
{
//creating a JobConf object and assigning a job name for identification purposes
JobConf conf = new JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");
//Setting configuration object with the Data Type of output Key and Value
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
//Providing the mapper and reducer class names

conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReducer.class);
//We wil give 2 arguments at the run time, one in input path and other is output path
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
//the hdfs input and output directory to be fetched from the command line
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception
{
// this main function will call run method defined above.
int res = ToolRunner.run(new Configuration(), new WordCount(),args);

System.exit(res);
}
}
Step 3:
Open WordCountMapper.java and paste the following code.
import java.io.IOException;
import java.util.StringTokenizer;
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text,

Text, IntWritable>
{
//hadoop supported data types
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//map method that performs the tokenizer job and framing the initial key value pairs
// after all lines are converted into key-value pairs, reducer is called.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{ //taking one line at a time from input file and tokenizing the same
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
//iterating through all the words available in that line and forming the key value pair
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
//sending to output collector which inturn passes the same to reducer
output.collect(word, one);
}
}
}
Step 4:
Open WordCountReducer.java and paste the following code.
import java.io.IOException;
import java.util.Iterator;
public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable>
{
//reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys
and produce the final out put
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
int sum = 0;
/*iterates through all the values available with a key and add them together and give the
final result as the key and sum of its values*/

while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Big Data

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data

Transféré par

Droits d'auteur :

Formats disponibles

Guru Nanak Dev Engg.

Big Data Using

Growth of and Digitization of Global Information

analytics or other certain advanced methods to extract value

Gartners definition of the 3Vs is still widely used, and in

Volume: big data doesn't sample. It just observes and

Digital footprint: big data is often a cost-free byproduct of

Volume:- The quantity of generated data is important in this

context. The size of the data determines the value

Velocity:- In this context, the speed at which the data is

Veracity:- The quality of captured data, which can vary greatly.

specialists in that Software AG, Oracle

The use and adoption of Big Data within governmental processes

United States of America

The Indian Government utilises numerous techniques to

via high-speed networking.

Hadoop Common contains libraries and utilities needed by

Hadoop YARN a resource-management platform

Hadoop MapReduce a programming model for large scale

expose other higher-level user interfaces.

STEPS FOLLOWED:STEP 1:MAKING A VIRTUAL MACHINE

Installing VMware WORKSTATION to make a

virtual machine on your system.

ssh has two main components:

Create and Setup SSH Certificates

Created directory '/home/hduser/.ssh'.

The key fingerprint is:

Are you sure you want to continue connecting (yes/no)? yes

STEP 3:DISABLE IPV6

STEP 4:INSTALL HADOOP

(a) Extract and Modify Permissions

Setup Configuration Files

The following files will have to be modified to complete the

Now we can append the following to the end of ~/.bashrc:

Adding the above statement in the hadoop-env.sh file ensures

Open the file and enter the following in between the

uri's scheme determines the config property (fs.SCHEME.impl)

The uri's authority is

The mapred-site.xml file is used to specify which framework is

If "local", then jobs are run in-process as a single

and reduce task.

Format the New Hadoop Filesystem

DEPRECATED: Use of this script to execute hdfs command is

15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load

15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider

15/04/18 14:43:10 INFO util.GSet: Computing capacity for map

15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB =

15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled:

15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9

15/04/18 14:43:11 INFO namenode.NameNode: Caching file names

15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB =

15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on

15/04/18 14:43:11 INFO util.GSet: capacity

15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false

15/04/18 14:43:12 INFO common.Storage:

Note that hadoop namenode -format command should be

k@laptop:/usr/local/hadoop/sbin$ sudo su hduser

Starting secondary namenodes [0.0.0.0]

15/04/18 16:43:58 WARN util.NativeCodeLoader: Unable to load

We can check if it's really up and running:

We run stop-all.sh or (stop-dfs.sh and stop-yarn.sh) to stop all

public class WordCount extends Configured implements Tool{

//Providing the mapper and reducer class names

int res = ToolRunner.run(new Configuration(), new WordCount(),args);