Big Data & Hadoop

DISCLAIMER
This book is designed to provide information on Big

Data and HADOOP only. This book does not contain
all information available on the subject. This book has
not been created to be specific to any individual’s or
organizations’ situation or needs. Every effort has
been made to make this book as accurate as
possible. However, there may be typographical and
or content errors. Therefore, this book should serve
only as a general guide and not as the ultimate
source of subject information. This book contains
information that might be dated and is intended only
to educate and entertain. The management shall
have no liability or responsibility to any person or
entity regarding any loss or damage incurred, or
alleged to have incurred, directly or indirectly, by the
information contained in this book. You hereby agree
to be bound by this disclaimer or you may return this
book within a week of receipt of this book.
Copyright© 2017-2018, DexLab Solutions Corp
All Rights Reserved.
No part of this book may be reproduced or distributed in any form or by any electronic
or mechanical means including information storage and retrieval systems, without
permission in writing from the management.
Contents
Chapters Topic Page
1. Big Data 10 - 11
1.1 Big Data 10
1.2 How the Data convert into Big Data 10
1.3 Problem with Big Data 11
2. Hadoop Introduction 12 - 13
2.1 What, When, Why Hadoop? 12
2.2 Modules of Hadoop 12
2.3 Advantage of Hadoop 13
3. HDFS Services 14 - 17
3.1 Namenode 14
3.2 Secondary Namenode 15
3.3 Data Node 15
3.4 Job Tracker 16
3.5 Task Tracker 17
4. Hadoop Admin 18 - 36
4.1 Linux Basic Commands 18
4.2 Some Hadoop Basic Shell Command 24
4.3 Hadoop Installation 27
4.4 Hadoop Installation & File Configuration 28
4.5 Hadoop Modes 30
4.6 Hadoop Architecture 31
4.7 YARN – Application Startup 32
4.8 Input Splits 33
4.9 Rack Awareness 34
4.10 Hadoop Rack Awareness Vs. Hadoop Namenode 34
4.11 Why we are using Hadoop Rack Awareness 35
4.12 What is a Rack in Hadoop 36
4.13 What is Rack Awareness in Hadoop 36
5. Hadoop Namespace 37 - 39
5.1 Block 37
5.2 Metadata 38
5.3 Namespace 38
5.4 Namespace Issue 39
6. Data Replication 40 - 41
6.1 File Placement 40
6.2 Data Replication 40
6.3 Block Replication 41
6.4 Replication Factor 41
Chapter 1: BIG DATA | 6
Chapters Topic Page
7. Communication 42 - 44
7.1 Name node – Data node 42
7.2 Data Comunication 42
7.3 Heart Beat 42
7.4 Block Report 43
8. Failure Management 45 - 48
8.1 Check Point 45
8.2 FSImage 45
8.3 EditLog 46
8.4 Backup Node 46
8.5 Block Scanner 47
8.6 Failure Type 48
9. Map Reduce 49 - 58
9.1 What is Mapreduce? 49
9.2 The Algorithm 49
9.3 Inputs & Outputs (Java Perspective) 51
9.4 Terminology 51
9.5 Important Commands 52
9.6 How to interact with Mapreduce? 52
9.7 Mapreduce Program for Word Count 54
9.8 The Mapper 54
9.9 The Shuffle 55
9.10 The Reducer 55
9.11 Running the Hadoop Job 56
10. Hive 61 - 83
10.1 Hive Overview 61
10.2 Hive is Not 61
10.3 Merits on Hive 61
10.4 Architecture of Hive 61
10.5 Hive Installation 62
10.6 Hive Data Types 70
10.7 Create Database 72
10.8 Comparison with Hive and other database retrieving
Information 72
10.9 Metadata 73
10.10 Current SQL Compatibility 74
10.11 Hive DDL Commands 75
10.12 Hive DML Commands 75
10.13 Joins 76
10.14 Hive Bucket 80
10.15 Advantage with Hive Bucket 80
10.16 Creating a View 81
10.17 Dropping a View 82
10.18 Creating an Index 82
Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Chapters Topic Page
10.19 Dropping an Index 83
11. Apache HBase 84 - 116

11.1 HBase Overview 84
11.2 Regions 84
11.3 HBase Master 85
11.4 Zookeeper: The Coordinator 85
11.5 How the component work together 86
11.6 HBase first Read & Write 87
11.7 HBase Meta table 87
11.8 Region Server Components 88
11.9 HBase Memstore 89
11.10 HBase Region Flush 89
11.11 HBase HFile 90
11.12 HBase HFile Structure 90
11.13 HFile Index 91
11.14 HBase Read Merge 91
11.15 HBase Read Merge 92
11.16 HBase Minor Compaction 92
11.17 HBase Major Compaction 93
11.18 Regions = Contiguous Key’s 93
11.19 Region Split 94
11.20 Read Load Balancing 94
11.21 Starting HBase Shell 99
11.22 HBase Basics 103
11.23 Stopping HBase 110
11.24 Inserting data using HBase Shell 110
11.25 Updaing data using HBase Shell 111
11.26 Reading data using HBase Shell 112
11.27 Deleting all cells in a Table 113
11.28 Scanning using HBase Shell 114
11.29 HBase Security 115
12. Sqoop 117 - 132

12.1 Introduction 117
12.2 What is Sqoop 117
12.3 Why we used Sqoop 117
12.4 Where is Sqoop Used 118
12.5 Sqoop Architecture 118
12.6 Sqoop – Import 119
12.7 Sqoop – Installation 119
12.8 Sqoop – Import 124
12.9 Import-all-tables 128
12.10 Sqoop – Export 129
12.11 List – Databases 132
12.12 List – tables 132

Chapters Topic Page
13. Apache Pig 133 - 154

13.1 What is Apache Pig 133
13.2 Why do we need Apache Pig 133
13.3 Apache Pig vs. Mapreduce 134
13.4 Pig vs. SQL 134
13.5 Pig vs. Hive 134
13.6 Pig Architecture 135
13.7 Apache Pig Component 135
13.8 Pig Installation 137
13.9 Pig Executions 138
13.10 Pig Shell Command 140
13.11 Pig Basics 145
13.12 Pig Latin Data Types 145
13.13 Pig Reading Data 149
13.14 The Load Operator 151
13.15 The Pig Grunt Shell 151
13.16 Reading Data 152
13.17 Pig Diagnostic Operator 154
14. Apache Flume 155 - 179

14.1 What is Flume 155
14.2 Applications of Flume 155
14.3 Advantage of Flume 155
14.4 Features of Flume 156
14.5 Apache Flume Data Transfer in Hadoop 156
14.6 Apache Flume Architecture 158
14.7 Flume Event 159
14.8 Flume Agent 159
14.9 Additional Component of Flume Agent 160
14.10 Apache Flume Data Flow 160
14.11 Apache Flume – Environment 162
14.12 Apache Flume – Configuration 163
14.13 Apache Flume – Fetching Twitter Data 168
14.14 Creating Twitter Aqpplication 169
15. Apache Spark 180 - 193

15.1 Apache Spark 180
15.2 Evolution of Apache Spark 180
15.3 Features of Apache Spark 180
15.4 Spark built on Hadoop 181
15.5 Component of Spark 182
15.6 Apache Spark Core 182
15.7 Resilient Distributed Data Sets 183
15.8 Spark Installation 185

HADOOP
CHAPTER
Big Data
1
1.1 Big Data
Data which are very large in size is called Big Data. Normally we work on data
of size MB(WordDoc, Excel) or maximum GB(Movies, Codes) but data in Peta
bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of
today's data has been generated in the past 3 years & because we are
continuously using different data generators factors.
OR
The data is beyond to your storage capacity and your processing power is called
big data.
1.2 How the data convert into Big Data?

There are so many data generators factors.
Social Networking Sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
E-Commerce Site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
Telecom Company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its million
users.
Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

1.3 Problem with Big Data

Basicslly there are three types of problem which we are facing
3V's of Big Data

Velocity: The data is increasing at a
very fast rate. It is estimated that the
volume of data will double in every 2
years.
Veracity: Nowadays data are not

stored in rows and column. Data is
structured as well as unstructured.
Log file, CCTV footage is unstructured
data. Data which can be saved in
tables are structured data like the
transaction data of the bank.
Figure 1: 3 Vs of Big Data
Volume: The amount of data which we deal with is of very large size of Peta
bytes.
Use case: An e-commerce site XYZ (having 100 million users) wants to offer a
gift voucher of 100$ to its top 10 customers who have spent the most in the
previous year.Moreover, they want to find the buying trend of these customers so
that company can suggest more items related to them.
Issues: Huge amount of unstructured data which needs to be stored, processed

and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed
File System) which uses commodity hardware to form clusters and store data in
a distributed fashion. It works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network

to find the required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.

Chapter 2: HADOOP INTRODUCTION | 12
CHAPTER
Hadoop
Introduction 2
Apache Hadoop is an open source framework that allows to store and process big
data.
Hadoop has its own cluster (set of machines) with commodity hardware where
numbers of machines are working in distributed way.
2.1 What, When, Why Hadoop?

Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume. Hadoop is written in Java and
is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more. Moreover it can be scaled up just by adding nodes in the cluster.
2.2 Modules of Hadoop

HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output
of Map task is consumed by reduce task and then the out of reducer gives the
desired result.

Chapter 2: HADOOP INTRODUCTION | 13
2.3 Advantages of Hadoop

Fast: In HDFS the data distributed over the cluster and are mapped which helps
in faster retrieval. Even the tools to process the data are often on the same
servers, thus reducing the processing time. It is able to process terabytes of data
in minutes and Peta bytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store
data so it really cost effective as compared to traditional relational database
management system.
Resilient to failure: HDFS has the property with which it can replicate data
over the network, so if one node is down or some other network failure happens,
then Hadoop takes the other copy of data and use it. Normally, data are
replicated thrice but the replication factor is configurable.

Chapter 3: HDFS SERVICES | 14
CHAPTER
HDFS Services
3
Hadoop 1.x and Hadoop 2.x core daemons are as follow:
3.1 Name Node

The Name Node is a master service. It is working in a single node and as a
cluster manager. The primary job of namenode is to manage the file system
namespace. The file system tree and the metadata for all the files and directories
are maintained in the namenode. It is the arbitrator and repository for all HDFS
metadata. It maintains and stores the namespace tree and the mapping of file
blocks to Data Nodes persent on the local disk in the form of two files:
 the namespace image
 the edit log
All the file system metadata is stored on a metadata server. All metadata
operations may be handled by a single metadata server, but a cluster will
configure multiple metadata servers as primary-backup failover pairs. This
includes the namespace, data location and access permissions.
In Hadoop 2.x, the NameNode was a single point of failure (SPOF) in an
HDFS cluster. With Zookeeper the HDFS High Availability feature addresses
this problem by providing the option of running two redundant NameNodes in
the same cluster in an Active/Passive configuration with a hot standby.
Operations
Clients contact to the Name Node in order to perform common file system
operations, such as open, close, rename, and delete. The Name Node does not
store HDFS data itself, but rather maintains a mapping between HDFS file
name, a list of blocks in the file, and the Data Node(s) on which those blocks are
stored.The system is designed in such a way that user data never flows through
the Name Node.
It periodically receives a Heartbeat and a Block report from each of the
Data Nodes which is present in the cluster. When namenode periodically
receives a Heartbeat from the Data Node that mean datanode is functioning
properly. A Block report contains a list of all blocks on a Data Node.

Namenode Format
When the NameNode is formatted a namespace ID is generated, which
essentially identifies that specific instance of the distributed filesystem. When
DataNodes first connect to the NameNode they store that namespace ID along
with the data blocks, because the blocks have to belong to a specific filesystem.
If a DataNode later connects to a NameNode, and the namespace ID
which the NameNode declares does not match the namespace ID stored on the
DataNode, it will refuse to operate with the "incompatible namespace ID" error.
It means that the DataNode has connected to a different NameNode, and the
blocks which it is storing don't belong to that distributed file system.
3.2 Secondary NameNode

Secondary NameNode: This is not a backup NameNode. In fact, it is a poorly
named component of the Hadoop platform. It performs some housekeeping
functions for the NameNode.
The goal of the edits file is to accumulate the changes during the system
operation. If the system is restarted, the contents of the edits file can be rolled
into fsimage during the restart.
The role of the Secondary NameNode is to periodically merge the contents
of the edits file in the fsimage file. To this end, the Secondary NameNode
periodically executes the following sequence of steps:
1. It asks the Primary to roll over the edits file, which ensures that
new edits go to a new file. This new file is called edits.new.
2. The Secondary NameNode requests the fsimage file and the edits
file from the Primary.
3. The Secondary NameNode merges the fsimage file and the edits file
into a new fsimage file.
4. The NameNode now receives the new fsimage file from the
Secondary NameNode with which it replaces the old file. The edits
file is now replaced with the contents of the edits. new file created
in the first step.
5. The fstime file is updated to record when the checkpoint operation
took place.
3.3 Data Node

In HDFS, the daemon responsible for storing and retrieving block data is called
the datanode (DN). The data nodes are responsible for serving read and write
requests from clients and perform block operations upon instructions from name
node. Each Data Node stores HDFS blocks on behalf of local or remote clients.

Each block is saved as a separate file in the node’s local file system. Because the
Data Node abstracts away details of the local storage arrangement, all nodes do
not have to use the same local file system. Blocks are created or destroyed on
Data Nodes at the request of the Name Node, which validates and processes
requests from clients. Although the Name Node manages the namespace, clients
communicate directly with Data Nodes in order to read or write data at the
HDFS block level. A Data node normally has no knowledge about HDFS files.
While starting up, it scans through the local file system and creates a list of
HDFS data blocks corresponding to each of these local files and sends this report
to the Name node.
Individual files are broken into blocks of a fixed size and distributed
across multiple DataNodes in the cluster. The Name Node maintains metadata
about the size and location of blocks and their replicas.
Hadoop was designed with an idea that DataNodes are "disposable
workers", servers that are fast enough to do useful work as a part of the cluster,
but cheap enough to be easily replaced if they fail.
The data block is stored on multiple computers, improving both resilience
to failure and data locality, taking into account that network bandwidth is a
scarce resource in a large cluster.
3.4 JobTracker
One of the master components, it is responsible for managing the overall
execution of a job. It performs functions such as scheduling child tasks
(individual Mapper and Reducer) to individual nodes, keeping track of the health
of each task and node, and even rescheduling failed tasks. As we will soon
demonstrate, like the NameNode, the Job Tracker becomes a bottleneck when it
comes to scaling Hadoop to very large clusters. The JobTracker daemon is
responsible for launching and monitoring MapReduce jobs.
 JobTracker process runs on a separate node and not usually on a
DataNode.
 JobTracker is an essential Daemon for MapReduce execution in
MRv1. It is replaced by ResourceManager/ApplicationMaster in
MRv2.
 JobTracker receives the requests for MapReduce execution from the
client.
 JobTracker talks to the NameNode to determine the location of the
data.
 JobTracker finds the best TaskTracker nodes to execute tasks based
on the data locality (proximity of the data) and the available slots to
execute a task on a given node.

 JobTracker monitors the individual TaskTrackers and the submits

back the overall status of the job back to the client.
 JobTracker process is critical to the Hadoop cluster in terms of
MapReduce execution.
 When the JobTracker is down, HDFS will still be functional but the
MapReduce execution can not be started and the existing
MapReduce jobs will be halted.
3.5 Task Tracker

The Task Tracker is a service daemon. It Runs on individual DataNodes. It is
responsible for starting and managing individual Map/Reduce tasks. It
communicates with the JobTracker. It runs on each compute node of the Hadoop
cluster, accepts requests for individual tasks such as Map, Reduce, and Shuffle
operations. which are present on every node of the cluster. The actual execution
of the tasks is controlled by TaskTrackers. It is responsible to start map jobs.
Each TaskTracker is configured with a set of slots that is usually set up as
the total number of cores available on the machine. When a request is received
from the JobTracker to launch a task, the TaskTracker initiates a new JVM for
the task. The TaskTracker is assigned a task depending on how many free slots
it has (total number of tasks = actual tasks running).
The TaskTracker is responsible for sending heartbeat messages to the
JobTracker. Apart from telling the JobTracker that it is healthy, these messages
also tell the JobTracker about the number of available free slots.
 TaskTracker runs on DataNode. Mostly on all DataNodes.

 TaskTracker is replaced by Node Manager in MRv2.
 Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
 TaskTrackers will be assigned Mapper and Reducer tasks to
execute by JobTracker.
 TaskTracker will be in constant communication with the
JobTracker signalling the progress of the task in execution.
 TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task executed by
the TaskTracker to another node.

Chapter 4: HADOOP ADMIN | 18
CHAPTER
Hadoop Admin
4
4.1 Linux Basic Commands
tar command examples
Extract from an existing tar archive.
$ tar xvf archive_name.tar
grep command examples

Search for a given string in a file (case in-sensitive search).
$ grep -i "the" demo_file
Print the matched line, along with the 3 lines after it.
$ grep -A 3 -i "example" demo_text
Search for a given string in all files recursively

$ grep -r "dexlab" *
find command examples

Find files using file-name ( case in-sensitve find)
$ find -iname "MyCProgram.c"
Find all empty files in home directory

$ find ~ -empty
ssh command examples

Login to remote host
ssh -l jsmith remotehost.example.com
Debug ssh client

ssh -v -l jsmith remotehost.example.com
Display ssh client version

$ ssh –V

vim command examples

Go to the 143rd line of file
$ vim +143 filename.txt
Go to the first match of the specified

$ vim +/search-term filename.txt
Open the file in read only mode.
sort command examples

Sort a file in ascending order
$ sort names.txt
Sort a file in descending order

$ sort -r names.txt
Sort passwd file by 3rd field.

$ sort -t: -k 3n /etc/passwd | more
ls command examples
Display filesize in human readable format (e.g. KB, MB etc.,)
$ ls –lh
Order Files Based on Last Modified Time (In Reverse Order) Using ls -ltr
$ ls –ltr
Visual Classification of Files With Special Characters Using ls -F

$ ls –F
cd command examples
Use cd to toggle between the last two directories
$ cd ..
$ cd
gzip command examples

To uncompress a *.gz file
$ gzip -d test.txt.gz
shutdown command examples

Shutdown the system and turn the power off immediately

$ shutdown -h now
Shutdown the system after 10 minutes

$ shutdown -h +10
Reboot the system using shutdown command

$ shutdown -r now
Force the filesystem check during reboot

$ shutdown -Fr now
free command examples

This command is used to display the free, used, swap memory available in the
system.
Typical free command output. The output is displayed in bytes.
$ free
If you want to quickly check, how many GB of RAM your system has, use the -g
option. -b option displays in bytes, -k in kilo bytes, -m in mega bytes.
$ free –g
If you want to see a total memory (including the swap), use the -t switch, which
will display a total line as shown below
dexlab@dexlab-laptop:~$ free –t
kill command examples

Use kill command to terminate a process. First get the process id using ps -ef
command, then use kill -9 to kill the running Linux process as shown below. You
can also use killall, pkill, xkill to terminate a unix process.
$ ps -ef | grep vim
dexlab 7243 7222 9 22:43 pts/2 00:00:00 vim
$ kill -9 7243
rm command examples
Get confirmation before removing the file
$ rm -i filename.txt
It is very useful while giving shell metacharacters in the file name argument.
Print the filename and get confirmation before removing the file.
$ rm -i file*

Following example recursively removes all files and directories under the
example directory. This also removes the example directory itself.
$ rm -r example
cp command examples
Copy file1 to file2 preserving the mode, ownership and timestamp
$ cp -p file1 file2
Copy file1 to file2. if file2 exists prompt for confirmation before overwritting it.
$ cp -i file1 file
mv command examples
Rename file1 to file2. if file2 exists prompt for confirmation before overwritting
it.
$ mv -i file1 file2
Note:
mv -f is just the opposite, which will overwrite file2 without prompting.
mv -v will print what is happening during file rename, which is useful while
specifying shell metacharacters in the file name argument.
$ mv -v file1 file2
cat command examples

You can view multiple files at the same time. Following example prints the
content of file1 followed by file2 to stdout.
$ cat file1 file2
While displaying the file, following cat -n command will prepend the line number
to each line of the output.
$ cat -n /etc/logrotate.conf
Some Importants Commands

chmod command examples
chmod command is used to change the permissions for a file or directory.
Give full access to user and group (i.e read, write and execute) on a specific file.
$ chmod ug+rwx file.txt
Revoke all access for the group (i.e read, write and execute ) on a specific file.

$ chmod g-rwx file.txt
Apply the file permissions recursively to all the files in the sub-directories.
$ chmod -R ug+rwx file.txt
chown command examples

chown command is used to change the owner and group of a file.
To change owner to oracle and group to db on a file. i.e Change both owner and
group at the same time.
$ chown oracle:dba dbora.sh
Use -R to change the ownership recursively

$ chown -R oracle:dba /home/oracle
passwd command examples

Change your password from command line using passwd. This will prompt for
the old password followed by the new password.
$ passwd
Super user can use passwd command to reset others password. This will not
prompt for current password of the user.
$ passwd dexlab
Remove password for a specific user. Root user can disable password for a
specific user. Once the password is disabled, the user can login without entering
the password.
$ passwd -d dexlab
mkdir command examples

Following example creates a directory called temp under your home directory.
$ mkdir ~/temp
Create nested directories using one mkdir command. If any of these directories
exist already, it will not display any error. If any of these
$ mkdir -p dir1/dir2/dir3/dir4/
uname command examples

Uname command displays important information about the system such as
Kernel name, Host name, Kernel release number,Processor type, etc.,
Sample uname output from a Ubuntu laptop is shown below

$ uname –a
whereis command examples

When you want to find out where a specific Unix command exists (for example,
where does ls command exists?), you can execute the following command.
$ whereis ls
When you want to search an executable from a path other than the whereis
default path, you can use -B option and give path as argument to it. This
searches for the executable lsmk in the /tmp directory, and displays it, if it is
available.
$ whereis -u -B /tmp -f lsmk
whatis command examples

Whatis command displays a single line description about a command
$ whatis ls
tail command examples

Print the last 10 lines of a file by default.
$ tail filename.txt
Print N number of lines from the file named filename.txt

$ tail -n N filename.txt
View the content of the file in real time using tail -f. This is useful to view the log
files, that keeps growing. The command can be terminated using CTRL-C.
$ tail -f log-file
less command examples

less is very efficient while viewing huge log files, as it doesn’t need to load the
full file while opening.
$ less huge-log-file.log
One you open a file using less command, following two keys are very helpful.
CTRL+F forward one window
CTRL+B backward one window
su command examples
Switch to a different user account using su command. Super user can switch to
any other user without entering their password.

$ su – dexlab
Execute a single command from a different account name. In the following

example, john can execute the ls command as raj username. Once the command
is executed, it will come back to johnâ€™s account.
[dexlab@dexlab]$ su - raj -c 'ls'
mysql command examples

mysql is probably the most widely used open source database on Linux. Even if
you run a mysql database on your server, you might end-up using the mysql
command (client) to connect to a mysql database running on the remote server.
To connect to a remote mysql database this will prompt for a password

$ mysql -u root -p -h 192.168.1.2
To connect to a local mysql database

$ mysql -u root –p
If you want to specify the mysql root password in the command line itself, enter
it immediately after -p (without any space).
to install package in Linux

$ sudo-apt
4.2 Some Hadoop Basic Shell

Commands
Print the Hadoop version
$ hadoop version
List the contents of the root directory in HDFS

$ hadoop fs -ls /
Report the amount of space used and available on currently mounted filesystem
$ hadoop fs -df hdfs:/
Count the number of directories,files and bytes under the paths that match the
specified file pattern

hadoop fs -count hdfs:/
Run a cluster balancing utility

$ hadoop balancer
Create New Hdfs Directory

$ hadoop fs -mkdir /home/dexlab/hadoop
Add a sample text file from the local directory named sample to the new
directory you created in HDFS during the previous step.
Create new sample file
$ vim sample.txt -> i -> "text" -> :wq
$ hadoop fs -put /home/dexlab/pg/sample.csv /"directory name"
$ hadoop fs -ls /vivek
$ hadoop fs -cat /vivek/sample.csv
hadoop fs -put data/sample.txt /home/dexlab
List the contents of this new directory in HDFS.

$ hadoop fs -ls /home/dexlab/hadoop
Add the entire local directory in to /home/dexlab/training directory in HDFS

$ hadoop fs -put data/retail /home/dexlab/hadoop
Since /user/training is your home directory in HDFS, any command that does not
have an absolute path is interpreted as relative to that directory. The next
command will therefore list your home directory, and should show the items you
have just added there.
$ hadoop fs –ls
See how much space this directory occupies in HDFS

$ hadoop fs -du -s -h hadoop/retail
Ensure this file is no longer in HDFS

$ hadoop fs -ls hadoop/retail/customers
Delete all files from the retail directory using a wildcard.

$ hadoop fs -rm hadoop/retail/*

To empty the trash

$ hadoop fs –expunge
Finally, remove the entire retail directory and all of its contents in HDFS.
$ hadoop fs -rm -r hadoop/retail
List the hadoop directory again

$ hadoop fs -ls hadoop
Add the purchases.txt file from the local directory named purchases.txt /home/
dexlab/training/ purchases.txt to the hadoop directory you created in HDFS
$ hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
To view the contents of your text file purchases.txt which is present in your
hadoop directory
$ hadoop fs -cat hadoop/purchases.txt
cp is used to copy files between directories present in HDFS

$ hadoop fs -cp /user/training/*.txt /home/dexlab/hadoop
get command can be used alternaively to -copyToLocal command

$ hadoop fs -get hadoop/sample.txt /home/dexlab/training/
Display last kilobyte of the file purchases.txt

$ hadoop fs -tail hadoop/purchases.txt
Default file permissions are 666 in HDFS

Use chmod command to change permissions of a file
$ hadoop fs -ls hadoop/purchases.txt
$ sudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt
Default names of owner and group are training,training

Use chown to change owner name and group name simultaneously
$ sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt

Default name of group is training

Use chgrp command to change group name
$ sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt
Move a directory from one location to other

$ hadoop fs -mv hadoop apache_hadoop
Default replication factor to a file is 3.

Use setrep command to change replication factor of a file
$ hadoop fs -setrep -w 2 apache_hadoop/sample.txt
Copy a directory from one node in the cluster to another

Use distcp command to copy,
-overwrite option to overwrite in an existing files
-update command to synchronize both directories
$ hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
Command to make the name node leave safe mode

$ hadoop fs -expunge
$ sudo -u hdfs hdfs dfsadmin -safemode leave
List all the hadoop file system shell commands

$ hadoop fs
Last but not least, always ask for help!

$ hadoop fs –help
4.3 Hadoop Installation

Environment required for Hadoop
The production environment of Hadoop is UNIX, but it can also be used in
Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce
Programs. For Hadoop installation from tar ball on the UNIX environment you
need
 Java Installation
 SSH installation

4.4 Hadoop Installation and File

Configuration
Java Installation
$ sudo apt-get install default-jdk
SSH Installation
SSH is used to interact with the master and slaves computer without any prompt
for password. First of all create a dexlab user on the master and slave systems
$ useradd dexlab
$ passwd dexlab
To map the nodes open the hosts file present in /etc/ folder on all the machines
and put the ip address along with their host name.
vi /etc/hosts
Enter the lines below

190.12.1.114 hadoop-master
190.12.1.121 hadoop-salve-one
190.12.1.143 hadoop-slave-two
Set up SSH key in every node so that they can communicate among themselves
without password. Commands for the same are:
$ su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab@dexlab-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab_tp1@dexlab-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab_tp2@dexlab-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Hadoop Installation
Hadoop can be downloaded from
$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0
Now extract the Hadoop and copy it to a location.

$ mkdir /home/dexlab/hadoop
$ sudo tar vxzf hadoop-2.6.0.tar.gz /home/dexlab/hadoop

Change the ownership of Hadoop folder

$sudo chown -R hadoop usr/hadoop
Change the Hadoop configuration files:

All the files are present in /home/dexlab/Hadoop/etc/hadoop
1. In hadoop-env.sh file add

export JAVA_HOME=/usr/
2. In core-site.xml add following between configuration tabs

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000 </value>
</property>
</configuration>
3. In hdfs-site.xml add following between configuration tabs

<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/dexlab/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/dexlab/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4. Open the Mapred-site.xml and make the change as shown below

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

5. Finally, update your $HOME/.bahsrc

$ cd $HOME
$ vim .bashrc
Append following lines in the end and save and exit Hadoop variables
export HADOOP_INSTALL=/home/dexlab/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
After this format the name node and start all the daemons
$ su dexlab
$ cd /home/dexlab/hadoop
$ bin/hadoop namenode -format
$ start-all.sh
After this use jps command to check daemons status

$ jps
4.5 Hadoop Modes

Standalone Mode
 It’s a default mode of Hadoop
 HDFS is not be utilized in this mode.
 Local file system (local hard disk) is used for input and output
purpose.
 Used for debugging purpose
 No Custom Configuration is required in 3 hadoop (mapred-site.xml,
core-site.xml, hdfs-site.xml) files.
Pseudo Distributed Mode (Single Node Cluster/Testing mode)

 Configuration is required in given 3 files for this mode
 Replication factory is one for HDFS.
 Here one node will be used as Master Node / Data Node / Job
Tracker / Task Tracker
 Pseudo distributed cluster is a single node cluster where all
daemons are running on one node itself.

Fully Distributed Mode (or Multi Node Cluster)

 This is used for Production Purpose.
 Data are used and distributed across many nodes.
 Different Nodes will be used as Master Node / Data Node / Job
Tracker / Task Tracker.
4.6 Hadoop Architecture Overview

Apache Hadoop is an open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware. There are mainly five
building blocks inside this runtime environment (from bottom to top):
The cluster is the set of
machines (nodes). Nodes are
placed in racks. This is the
hardware part of the
infrastructure.
The YARN Infrastructure
(Yet Another Resource
Negotiator) is the framework
responsible for providing the
Figure 2: Hadoop Architecture computational resources (e.g.,
CPUs, memory, etc.) needed for
application executions. Two mean important elements are: the Resource
Manager (one per cluster) is the master and work under namenode. It knows
where the slaves are located (Rack Awareness) and how many resources they
have. It runs several services, the most important is the Resource Scheduler
which decides how to assign the resources.
The Node Manager (many per cluster) is the slave node of the
infrastructure. When it starts, it announces himself to the Resource Manager.
Periodically, it sends an
heartbeat to the Resource
Manager. Each Node Manager
offers some resources to the
cluster. At run-time, the
Resource Scheduler will decide
Figure 3: Resource Manager
how to use this capacity:
A Container is a fraction of the NM
capacity and it is used by the client for
running a program.
The HDFS Federation is the
framework responsible for providing
permanent, reliable and distributed
Figure 4: Node Manager and Containers

storage. This is typically used for storing inputs and output (but not
intermediate ones).
Other alternative storage solutions, for instance, Amazon uses the Simple
Storage Service (S3).
The MapReduce Framework is the software layer implementing the
MapReduce paradigm.
The YARN infrastructure and the HDFS federation are completely
decoupled and independent: the first one provides resources for running an
application while the second one provides storage. The MapReduce framework is
only one of many possible framework which runs on top of YARN (although
currently is the only one implemented).
4.7 YARN: Application Startup
Figure 5: YARN
In YARN, there are at least three actors:

 the Job Submitter (the client)
 the Resource Manager (the master)
 the Node Manager (the slave)
The application startup process is the following:
 a client submits an application to the Resource Manager
 the Resource Manager allocates a container
 the Resource Manager contacts the related Node Manager
 the Node Manager launches the container
 the Container executes the Application Master
The Application Master is responsible for the execution of a single application. It
asks for containers to the Resource Scheduler
(Resource Manager) and executes specific
programs (e.g., the main of a Java class) on the
obtained containers. The Application Master knows
the application logic and thus it is framework-
specific. The MapReduce framework provides its
own implementation of an Application Master.
Figure 6: Application Master

The Resource Manager is a single point of failure in YARN. Using

Application Masters, YARN is spreading over the cluster the metadata related to
running applications. This reduces the load of the Resource Manager and makes
it fast recoverable.
4.8 Input Splits

The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an Application
Master daemon to look after the lifecycle of the job. (In Hadoop 1, the JobTracker
monitored individual jobs as well as handling job scheduling and cluster resource
management.)
One of the first things the Application Master does is determine which file
blocks are needed for processing. The Application Master requests details from
the NameNode on where the replicas of the needed data blocks are stored. Using
the location data for the file blocks, the Application Master makes requests to
the Resource Manager to have map tasks process specific blocks on the slave
nodes where they’re stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks.
For example, the sample data set contains information about completed flights
within the United States between 1987 and 2008.
You have one large file for each year, and within every file, each
individual line represents a single flight. In other words, one line represents one
record. Now, remember that the block size for the Hadoop cluster is 64MB, which
means that the light data files are broken into chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a
specific data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block.
To solve this problem, Hadoop uses a logical representation of the data
stored in file blocks, known as input splits. When a MapReduce job client
calculates the input splits, it figures out where the first whole record in a block
begins and where the last record in the block ends.

In cases where the last record in a block is incomplete, the input split
includes location information for the next block and the byte offset of the data
needed to complete the record.
The figure shows this relationship between data blocks and input splits.
Figure 7: Application Master
MapReduce data processing is driven by this concept of input splits. The

number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are
processed locally.
4.9 Rack Awareness

In Hadoop the two concepts are important. They are Namenode and Hadoop
Rack awareness. Namenode is the centerpiece of an HDFS system. Namenode
keep all the files in the file system as a directory tree.
Hadoop rack awareness is the strategy to choose the nearest datanode.
This process can perform based on the Hadoop rack information.
4.10 Hadoop Rack Awareness Vs

Hadoop Namenode
Before comparing these two elements you must understand some basic
infrastructure. Also please refer the given image for clear understanding about
rack awareness in hadoop.

What Hadoop Namenode will do?

Namenode keeps all files like a directory tree. But remember it does not store
any data of these files itself.
Figure 8: Name Node
 Generally client applications contact with Namenode for locating

file or add / remove / move processes.
 It began with Data node send heartbeats
 A function performing for every 12th heartbeat is a block report
 Metadata developed by Name node from block reports
 Transmission control protocol (TCP) may every 3 seconds
 If Namenode is down, the HDFS system also simultaneously down.
4.11 Why we are using Hadoop Rack

Awareness?
Figure 9: Rack Awareness
Hadoop rack awareness helps you to define manually the rack number of
each slave datanode in the cluster. So far Hadoop has the concept functioning
behind Rack Awareness. We are manually defined the rack numbers because we
can prevent the data loss and enhance the network performance.
Therefore the each block of data will transmitted to multiple machines. It
prevent if some machine failure, we are not losing all copies of data.

 Never lose of data if entire rack fails.

 If possible the system can keep bulky flows in rack.
 Mostly the rack is higher bandwidth and lower latency.
4.12 What is a Rack in Hadoop?

Before switch over to rack awareness, must understanding about Hadoop Racks.
Always Hadoop has two major components build with. They are namely
Hadoop Distributed File System (HDFS) – A distributed file system for

Hadoop also support IBM GPFS-FPO.
The MapReduce component – A popular framework used for performing

calculations on the data inside distributed file system.
Here the Hadoop Rack is a set of 30 or 40 nodes physically stored close

together. These nodes are connected to the same network using a switch.
Similarly Hadoop cluster is a collection of racks. I refer this blog post for detailed
information about Hadoop racks.
4.13 What is Rack Awareness in Hadoop?

Hadoop Rack awareness is a setup to improve network traffic in the time of
reading or writing HDFS files. If you have Hadoop clusters more than 30 to 40
nodes Hadoop rack aware configuration helpful to you because data transmitting
between two nodes on the same rack is more efficient than data transfer between
different racks. From above diagrams are prefect Hadoop rack awareness
examples for learning.

Chapter 5: HADOOP NAMESPACE | 37
CHAPTER
Hadoop
Namespace 5
5.1 Block
A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystems for a single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size. Filesystem blocks are
typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
This is generally transparent to the filesystem user who is simply reading or
writing a file of whatever length.
Having a block abstraction for a distributed filesystem brings several
benefits. The first benefit is the most obvious: a file can be larger than any single
disk in the network. There’s nothing that requires the blocks from a file to be
stored on the same disk, so they can take advantage of any of the disks in the
cluster. Like in a filesystem for a single disk, files in HDFS are broken into
block-sized chunks, which are stored as independent units. HDFS, too, has the
concept of a block, but it is a much larger unit—64 MB by default. HDFS blocks
are large compared to disk blocks, and the reason is to minimize the cost of
seeks. By making a block large enough, the time to transfer the data from the
disk can be significantly longer than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk
transfer rate.
A file can be made up of several blocks, which are stored on different
DataNodes chosen randomly on a block-by-block basis. As a result, access to a
file usually requires access to multiple DataNodes, which means that HDFS
supports file sizes far larger than a single-machine disk capacity. The DataNode
stores each HDFS data block in a separate file on its local filesystem with no
knowledge about the HDFS files themselves.
In fact, it would be possible, if unusual, to store a single file on an HDFS
cluster whose blocks filled all the disks in the cluster.
The default block size and replication factor are specified by Hadoop
configuration, but can be overwritten on a per-file basis. An application can

specify block size, the number of replicas, and the replication factor for a specific
file at its creation time.
There are tools to perform filesystem maintenance, such as df and fsck,
that operate on the filesystem block level.
5.2 MetaData
Because of the relatively low amount of metadata per file (it only tracks
filenames, permissions, and the locations of each block), the NameNode stores all
of the metadata in the main memory, thus allowing for a fast random access. The
metadata storage is designed to be compact. As a result, a NameNode with 4 GB
of RAM is capable of supporting a huge number of files and directories.
Modern distributed and parallel file systems such as pNFS , PVFS, HDFS,
and GoogleFS treat metadata services as an independent system component,
separately from data servers. A reason behind this separation is to ensure that
metadata access does not obstruct the data access path. Another reason is design
simplicity and the ability to scale the two parts of the system independently.
Files and directories are represented on the NameNode by inodes, which
record attributes like permissions, modification and access times, namespace and
disk space quotas. The NameNode maintains the file system namespace. Any
change to the file system namespace or its properties is recorded by the
NameNodeHDFS keeps the entire namespace in RAM.
Metadata are the most important management information replicated for
namenode failover. In our solution, the metadata include initial metadata which
are replicated in initialization phase and two types of runtime metadata which
are replicated in replication phase. The initial metadata include two types of
files: version file which contains the version information of running HDFS and
file system image (fsimage) file which is a persistent checkpoint of the file
system. Both files are replicated only once in initialization phase, because their
replication are time-intensive processes. Slave node updates fsimage file based
on runtime metadata to make the file catch up with that of primary node
The name node has an in-memory data structure called FsImage that
contains the entire file system namespace and maps the files on to blocks. The
NameNode stores all Metadata in a file called FsImage.
5.3 NameSpace
Traditional local file systems support a persistent name space. Local file system
views devices as being locally attached, the devices are not shared, and hence
there is no need in the file system design to enforce device sharing semantics.
HDFS supports a traditional hierarchical file organization.The file system
namespace hierarchy is similar to most other existing file systems; one can

create and remove files, move a file from one directory to another, or rename a
file. The NameNode exposes a file system namespace and allows data to be
stored on a cluster of nodes while allowing the user a single system view of the
file system. HDFS exposes a hierarchical view of the file system with files stored
in directories, and directories can be nested. The NameNode is responsible for
managing the metadata for the files and directories. The current HDFS
architecture allows only a single namespace for the entire cluster. This
namespace is managed by a single namenode. This architectural decision made
HDFS simpler to implement.
Files can be organized under Directory, which together form the
namespace of a file system. A file system typically organizes its namespace with
a tree-structured hierarchical organization. A distributed file system is a file
system that allows access to files from multiple hosts across a network.
A user or an application can create directories and store files inside these
directories.
Namespace partitioning has been a research topic for a long time, and
several methods have been proposed to solve this problem in academia. These
can be generally categorized into four types:
1. Static Subtree Partitioning
2. Hashing
3. Lazy Hybrid
4. Dynamic Subtree Partitioning
5.4 Namespace Issue

The large size of namespace catering millions of clients and billons of files and
directories imposes a big challenge to provide high scalability and performance of
metadata services. In such systems a structured, decentralized, self organizing
and self healing approach is required.

Chapter 6: DATA REPLICATION | 40
CHAPTER
Data
Replication 6
6.1 File Placement
HDFS uses replication to maintain at least three copies (one primary and two
replicas) of every chunk. Applications that require more copies can specify a
higher replication factor typically at file create time. All copies of a chunk are
stored on different data nodes using a rack-aware replica placement policy. The
first copy is always written to the local storage of a data node to lighten the load
on the network. To handle machine failures, the second copy is distributed at
random on different data nodes on the same rack as the data node that stored
the first copy. This improves network bandwidth utilization because inter-rack
communication is faster than cross-rack communication which often goes
through intermediate network switches. To maximize data availability in case of
a rack failure, HDFS stores a third copy distributed at random on data nodes in
a different rack.
HDFS uses a random chunk layout policy to map chunks of a file on to
different data nodes. At file create time; the name node randomly selects a data
node to store a chunk. This random chunk selection may often lead to sub-
optimal file layout that is not uniformly load balanced. The name node is
responsible to maintain the chunk to data node mapping which is used by clients
to access the desired chunk.
6.2 Data Replication

HDFS is designed to reliably store very large files across machines in a large
cluster. It stores each file as a sequence of blocks; all blocks in a file except the
last block are the same size. The blocks of a file are replicated for fault tolerance.
The block size and replication factor are configurable per file. An application can
specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. Files in HDFS are write-once and
have strictly one writer at any time.

Chapter 6: DATA REPLICATION | 41
6.3 Block Replication

The NameNode is responsible for block replication. The Name Node makes all
decisions regarding replication of blocks.Replica placement determines HDFS
reliability, availability and performance. Each replica on unique racks helps in
preventing data loses on entire rack failure and allows use of bandwidth from
multiple racks when reading data. This policy evenly distributes replicas in the
cluster which makes it easy to balance load on component failure. However, this
policy increases the cost of writes because a write needs to transfer blocks to
multiple racks. The NameNode keeps checking the number of replicas.
If a block is under replication, then it is put in the replication priority
queue. The highest priority is given to low replica value. Placement of new
replica is also based on priority of replication. If the number of existing replicas
is one, then a different rack is chosen to place the next replica. In case of two
replicas of the block on the same rack, the third replica is placed on a different
rack. Otherwise, the third replica is placed on a different node in the same rack
as an existing replica. The NameNode also checks that all replica of a block
should not be at one rack. If so, NameNode treats the block as under-replicated
and replicates the block to a different rack and deletes the old replica.
6.4 Replication Factor

The Name Node maintains the file system namespace. Any change to the file
system namespace or its properties is recorded by the Name Node. An
application can specify the number of replicas of a file that should be maintained
by HDFS. The number of copies of a file is called the replication factor of that
file. This information is stored by the Name Node.

Chapter 7: COMMUNICATION | 42
CHAPTER
Communication
7
7.1 NameNode-DataNode
Data Nodes and Name Node connections are established by handshake where
namespace ID and the software version of the Data Nodes are verified. The
namespace ID is assigned to the file system instance when it is formatted. The
namespace ID is stored persistently on all nodes of the cluster. A different
namespace ID node cannot join the cluster.
7.2 Data Communication

All HDFS communication protocols are layered on top of the TCP/IP protocol. A
client establishes a connection to a configurable TCP port on the NameNode
machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to
the NameNode using the DataNodes Protocol. A Remote Procedure Call (RPC)
abstraction wraps both the Client Protocol and the DataNodes Protocol. By
design, the NameNode never initiates any RPCs. Instead, it only responds to
RPC requests issued by DataNodes or clients.
7.3 Heart Beat

Heartbeats carry information about total storage capacity, fraction of storage in
use, and the number of data transfers currently in progress. Heartbeats also
carry information about total and used disk capacity and the number of data
transfers currently performed by the node, which plays an important role in the
name-node’s space and load balancing decisions. These statistics are used for the
Name Node’s space allocation and load balancing decisions. The Name Node can
process thousands of heartbeats per second without affecting other Name Node
operations.
Name Node considers Data Nodes as alive as long as it receives Heartbeat
message (default Heartbeat interval is three seconds) from Data Nodes. If the
Name Node does not receive a heartbeat from a Data Nodes in XX minutes the
Name Node considers the Data Nodes as dead and stops forwarding IO request
to it. The Name Node then schedules the creation of new replicas of those blocks
on other Data Nodes.

The NameNode can process thousands of heartbeats per second without

affecting other NameNode operations. The data nodes send regular heartbeats to
the name node so the name node can detect data node failure. During normal
operation DataNodes send heartbeats to the NameNode to confirm that the
DataNode is operating and the block replicas it hosts are available. The default
heartbeat interval is three seconds. If the name node does not receive heartbeats
from data nodes for a predetermined period, it marks them as dead and does not
forward any new read, write or replication requests to them. The NameNode
then schedules creation of new replicas of those blocks on other DataNodes.
The heartbeat message includes the BlockReport from the data node. By
design, the name node never initiates any remote procedure calls (RPCs).
Instead, it only responds to RPC requests issued by data nodes or clients. It
replies to heartbeats with replication requests for the specific data node.
Heartbeats from a DataNode also carry information about total storage capacity,
fraction of storage in use, and the number of data transfers currently in progress.
These statistics are used for the NameNode’s space allocation and load balancing
decisions.
Heart Beat Contents

The contents of the heartbeat message are:
 Progress report of tasks currently running on sender TaskTracker.
 Lists of completed or failed tasks.
 State of resources – virtual memory, disk space, etc.
 A Boolean flag (acceptNewTasks) indicating whether the sender
TaskTracker should be accept new tasks *
The NameNode does not directly call DataNodes. It uses replies to

heartbeats to send instructions to the DataNodes. The instructions include
commands to:
 replicate blocks to other nodes;
 remove local block replicas;
 re-register or to shut down the node;
 send an immediate block report.
7.4 Block Report

The DataNode stores HDFS data in files in its local file system. The DataNode
has no knowledge about HDFS files. It stores each block of HDFS data in a
separate file in its local file system. The DataNode does not create all files in the
same directory. Instead, it uses a heuristic to determine the optimal number of
files per directory and creates subdirectories appropriately. It is not optimal to
create all local files in the same directory because the local file system might not

be able to efficiently support a huge number of files in a single directory. When a

DataNode starts up, it scans through its local file system, generates a list of all
HDFS data blocks that correspond to each of these local files and sends this
report to the NameNode: this is the Blockreport.
A DataNode identifies block replicas in its possession to the NameNode by
sending a block report. A block report contains the block id, the generation stamp
and the length for each block replica the server hosts. The first block report is
sent immediately after the DataNodes registrations. Subsequent block reports
are sent every hour and provide the NameNode with an up-to date view of where
block replicas are located on the cluster.

Chapter 8: FAILURE MANAGEMENT | 45
CHAPTER
Failure
Management 8
8.1 Checkpoint
Checkpoint is an image record written persistently to disk. Name Node uses two
types of files to persist its namespace:
Fsimage: the latest checkpoint of the namespace
Edits: logs containing changes to the namespace; these logs are also called
journals.
The NameNode keeps an image of the entire file system namespace and
file Blockmap in memory. This key metadata item is designed to be compact,
such that a NameNode with 4 GB of RAM is plenty to support a huge number of
files and directories. When the NameNode starts up, it reads the FsImage and
EditLog from disk, applies all the transactions from the EditLog to the in-
memory representation of the FsImage, and flushes out this new version into a
new FsImage on disk. It can then truncate the old EditLog because its
transactions have been applied to the persistent FsImage. This process is called
a checkpoint.
The Checkpoint node uses parameter fs.checkpoint.period to check the
interval between two consecutive checkpoints. The Interval time is in seconds
(default is 3600 second). The Edit log file size is specified by parameter
fs.checkpoint.size (default size 64MB) and a checkpoint triggers if size exceeds.
Multiple checkpoint nodes may be specified in the cluster configuration file.
8.2 FSImage
The entire filesystem namespace is contained in a file called the FsImage stored
as a file in the NameNode’s local filesystem. The image file represents an HDFS
metadata state at a point in time.

The entire file system namespace, including the mapping of blocks to files
and file system properties, is stored in a file called the FsImage. The FsImage is
stored as a file in the NameNode’s local file system. Name Node creates an
updated file system metadata by merging both files i.e. fsimage and edits on
restart. The Name Node then overwrites fsimage with the new HDFS state and
begins a new edits journal. The Checkpoint node periodically downloads the
latest fsimage and edits from the active NameNode to create checkpoints by
merging them locally and then to upload new checkpoints back to the active
NameNode. This requires the same memory space as that of NameNode and so
checkpoint needs to be run on separate machine. Namespace information lost if
either the checkpoint or the journal is missing, so it is highly recommended to
configure HDFS to store the checkpoint and journal in multiple storage
directories.
The fsimage file is a persistent checkpoint of the filesystem metadata.
However, it is not updated for every filesystem write operation, because writing
out the fsimage file, which can grow to be gigabytes in size, would be very slow.
This does not compromise resilience, however, because if the namenode fails,
then the latest state of its metadata can be reconstructed by loading the fsimage
from disk into memory, and then applying each of the operations in the edit log.
8.3 EditLog
The NameNode also uses a transaction log to persistently record every change
that occurs in filesystem metadata (metadata store). This log is stored in the
EditLog file on the NameNode’s local filesystem. Edit log is a transactional log of
every filesystem metadata change since the image file was created.
The Name Node uses a transaction log called the Edit Log to persistently
record every change that occurs to file system metadata. For example, creating a
new file in HDFS causes the Name Node to insert a record into the Edit Log
indicating this. Similarly, changing the replication factor of a file causes a new
record to be inserted into the Edit Log. The Name Node uses a file in its local
host OS file system to store the Edit Log.
8.4 Backup Node

The Backup node provides the same check pointing functionality as the
Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the
file system namespace that is always synchronized with the active Name Node
state. Along with accepting a journal stream of file system edits from the Name
Node and persisting this to disk, the Backup node also applies those edits into its
own copy of the namespace in memory, thus creating a backup of the namespace.

The Backup node does not need to download fsimage and edits files from the
active NameNode in order to create a checkpoint, as would be required with a
Checkpoint node or Secondary NameNode, since it already has an up-to-date
state of the namespace state in memory. The Backup node checkpoint process is
more efficient as it only needs to save the namespace into the local fsimage file
and reset edits. As the Backup node maintains a copy of the namespace in
memory, its RAM requirements are the same as the NameNode. The NameNode
supports one Backup node at a time. No Checkpoint nodes may be registered if a
Backup node is in use. Using multiple Backup nodes concurrently will be
supported in the future.
The Backup node is configured in the same manner as the Checkpoint
node. It is started with bin/hdfs namenode -checkpoint. The location of the
Backup (or Checkpoint) node and its accompanying web interface are configured
via the dfs.backup.address and dfs.backup.http.address configuration variables.
Use of a Backup node provides the option of running the NameNode with no
persistent storage, delegating all responsibility for persisting the state of the
namespace to the Backup node. To do this, start the NameNode with the –i
8.5 Block Scanner

Each DataNode runs a block scanner that periodically scans its block replicas
and verifies that stored checksums match the block data. In each scan period,
the block scanner adjusts the read bandwidth in order to complete the
verification in a configurable period. If a client reads a complete block and
checksum verification succeeds, it informs the DataNode. The DataNode treats it
as a verification of the replica.
The verification time of each block is stored in a human readable log file.
At any time there are up to two files in toplevel DataNode directory, current and
prev logs. New verification times are appended to current file. Correspondingly
each DataNode has an in-memory scanning list ordered by the replica’s
verification time.
Whenever a read client or a block scanner detects a corrupt block, it
notifies the NameNode. The NameNode marks the replica as corrupt, but does
not schedule deletion of the replica immediately. Instead, it starts to replicate a
good copy of the block. Only when the good replica count reaches the replication
factor of the block the corrupt replica is scheduled to be removed. This policy
aims to preserve data as long as possible. So even if all replicas of a block are
corrupt, the policy allows the user to retrieve its data from the corrupt replicas

8.6 Failure Types

The primary objective of the HDFS is to store data reliably even in the presence
of failures. The three common types of failures are
 NameNode Failures
 DataNodes Failures
 Network Partitions
Several things can cause loss of connectivity between name node and data
nodes. Therefore, each data node is expected to send a periodic heartbeat
messages to its name node. This is required to detect loss of connectivity if it
stops receiving them. The name node marks data nodes as dead data nodes if
they are not responding to heartbeats and refrains from sending further requests
to them. Data stored on a dead node is no longer available to an HDFS client
from that node, which is effectively removed from the system.

Chapter 9: MAPREDUCE | 49
CHAPTER
MapReduce
9
9.1 What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a
data processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
9.2 The Algorithm

 Generally MapReduce paradigm is based on sending the computer to
where the data resides!
 MapReduce program executes in three stages, namely map stage, shuffle
stage, and reduce stage.
Map Stage: The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several
small chunks of data.
Reduce Stage: This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.

 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
 The framework manages all the details of data-passing such as issuing
tasks, verifying task completion, and copying data around the cluster
between the nodes.
 Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
 After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop
server.
Figure 10: Mapreduce Flow
 Hadoop limits the amount of communication which can be performed by

the processes, as each individual record is processed by a task in isolation
from one another
Figure 11: The overall mapreduce word count process
 By restricting the communication between nodes, Hadoop makes the

distributed system much more reliable. Individual node failures can be
worked around by restarting tasks on other machines.
 The other workers continue to operate as though nothing went wrong,
leaving the challenging aspects of partially restarting the program to the
underlying Hadoop layer.
Map : (in_value,in_key)→(out_key, intermediate_value)
Reduce: (out_key, intermediate_value)→ (out_value list)

9.3 Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces
a set of <key, value> pairs as the output of the job, conceivably of different
types.
The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface. Additionally,
the key classes have to implement the Writable-Comparable interface to
facilitate sorting by the framework. Input and Output types of a MapReduce job:
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
9.4 Terminology
PayLoad Applications implement the Map and the Reduce functions,
and form the core of the job.
Mapper Mapper maps the input key/value pairs to a set of
intermediate key/value pair.
NamedNode Node that manages the Hadoop Distributed File System
(HDFS).
DataNode Node where data is presented in advance before any
processing takes place.
MasterNode Node where JobTracker runs and which accepts job requests
from clients.
SlaveNode Node where Map and Reduce program runs.
JobTracker Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker Tracks the task and reports status to JobTracker.
Job A program is an execution of a Mapper and Reducer across a
dataset.
Task An execution of a Mapper or a Reducer on a slice of data.
Task Attempt A particular instance of an attempt to execute a task on a
SlaveNode.

9.5 Important Commands

All Hadoop commands are invoked by the $HADOOP_HOME/bin/
hadoop command. Running the Hadoop script without any arguments prints
the description for all commands.
Usage : hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
Options Description
namenode –format Formats the DFS filesystem.
secondarynamenode Runs the DFS secondary namenode.
Namenode Runs the DFS namenode.
Datanode Runs a DFS datanode.
Dfsadmin Runs a DFS admin client.
Mradmin Runs a Map-Reduce admin client.
Fsck Runs a DFS filesystem checking utility.
Fs Runs a generic filesystem user client.
Balancer Runs a cluster balancing utility.
Oiv Applies the offline fsimage viewer to an fsimage.
Fetchdt Fetches a delegation token from the NameNode.
Jobtracker Runs the MapReduce job Tracker node.
Pipes Runs a Pipes job.
Tasktracker Runs a MapReduce task Tracker node.
Historyserver Runs job history servers as a standalone daemon.
Job Manipulates the MapReduce jobs.
Queue Gets information regarding JobQueues.
Version Prints the version.
jar <jar> Runs a jar file.
distcp <srcurl> <desturl> Copies file or directories recursively.
distcp2 <srcurl> <desturl> DistCp version 2.
archive -archiveName NAME –p Creates a hadoop archive.
<parent path> <src>* <dest>
Classpath Prints the class path needed to get the Hadoop jar and
the required libraries.
Daemonlog Get/Set the log level for each daemon
9.6 How to Interact with MapReduce Jobs

Usage: hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.

GENERIC_OPTIONS Description
-submit <job-file> Submits the job.
-status <job-id> Prints the map and reduce completion percentage and all
job counters.
-counter <job-id> <group-name> Prints the counter value.
<countername>
-kill <job-id> Kills the job.
-events <job-id> <fromevent-#> <#- Prints the events' details received by jobtracker for the
of-events> given range.
-history [all] <jobOutputDir> - Prints job details, failed and killed tip details. More
history < jobOutputDir> details about the job such as successful tasks and task
attempts made for each task can be viewed by specifying
the [all] option.
-list[all] Displays all jobs. -list displays only jobs which are yet to
complete.
-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed
attempts.
-fail-task <task-id> Fails the task. Failed tasks are counted against failed
attempts.
-set-priority <job-id> <priority> Changes the priority of the job. Allowed priority values are
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004
To see the history of job output-dir

$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output
To kill the job

$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

9.7 Map Reduce Program for Word

Count
The Canonical Wordcount Example

Counting the number of words in a large document is the “hello world” of
map/reduce, and it is among the simplest of full map+reduce Hadoop jobs you
can run. Recalling that the map step transforms the raw input data into
key/value pairs and the reduce steptransforms the key/value pairs into your
desired output, we can conceptually describe the act of counting words as
 The map step will take the raw text of our document (I will use Herman
Melville’s classic, Moby Dick) and convert it to key/value pairs. Each key is
a word, and all keys (words) will have a value of 1.
 The reduce step will combine all duplicate keys by adding up their
values. Since every key (word) has a value of 1, this will reduce our output
to a list of unique keys, each with a value corresponding to that key’s
(word’s) count.
With Hadoop Streaming, we need
to write a program that acts as the
mapper and a program that acts as
the reducer. These applications
must interface with input/output
streams in such a way equivalent
to the following series of pipes:
Figure 12: Mapreduce Working
$ cat input.txt | ./mapper.py | sort | ./reducer.py > output.txt
That is, Hadoop Streaming sends raw data to the mapper via stdin, then
sends the mapped key/value pairs to the reducer via stdin.
9.8 The Mapper

The mapper, as described above, is quite simple to implement in Python. It will
look something like this:
#!/usr/bin/env python
import sys
for line in sys.stdin:

line = line.strip()
keys = line.split()

for key in keys:

value = 1
print( "%s\t%d" % (key, value) )
where
1. Hadoop sends a line of text from the input file (“line” being defined by a
string of text terminated by a linefeed character, \n)
2. Python strips all leading/trailing whitespace (line.strip()
3. Python splits that line into a list of individual words along whitespace
(line.split())
4. For each word (which will become a key), we assign a value of 1 and then
print the key-value pair on a single line, separated by a tab (\t)
A more detailed explanation of this process can be found in Yahoo’s excellent
Hadoop Tutorial.
9.9 The Shuffle

A lot happens between the map and reduce steps that is largely transparent to
the developer. In brief, the output of the mappers is transformed and distributed
to the reducers (termed the shuffle step) in such a way that
1. All key/value pairs are sorted before being presented to the reducer
function
2. All key/value pairs sharing the same key are sent to the same reducer
These two points are important because
1. As you read in key/value pairs, if you encounter a key that is different
from the last key you processed, you know that that previous key will
never appear again
2. If your keys are all the same, you will only use one reducer and gain no
parallelization. You should come up with a more unique key if this
happens!
9.10 The Reducer

The output from our mapper step will go to the reducer step sorted. Thus, we can
loop over all input key/pairs and apply the following logic:
If this key is the same as the previous key,
add this key's value to our running total.
Otherwise,
print out the previous key's name and the running total,
reset our running total to 0,
add this key's value to the running total, and
"this key" is now considered the "previous key"

Translating this into Python and adding a little extra code to tighten up
the logic, we get
#!/usr/bin/env python
import sys
last_key = None
running_total = 0
for input_line in sys.stdin:

input_line = input_line.strip()
this_key, value = input_line.split("\t", 1)
value = int(value)
if last_key == this_key:
running_total += value
else:
if last_key:
print( "%s\t%d" % (last_key, running_total) )
running_total = value
last_key = this_key
if last_key == this_key:
print( "%s\t%d" % (last_key, running_total) )
9.11 Running the Hadoop Job

If we name the mapper script mapper.py and the reducing script reducer.py, we
would first want to download the input data (Moby Dick) and load it into HDFS.
I purposely am renaming the copy stored in HDFS to mobydick.txt instead of
the original pg2701.txt to highlight the location of the file:
$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
$ hadoop dfs -mkdir wordcount
$ hadoop dfs -copyFromLocal ./pg2701.txt wordcount/mobydick.txt
You can verify that the file was loaded properly:

$ hadoop dfs -ls wordcount/mobydick.txt
Found 1 items
-rw-r--r-- 2 glock supergroup 1257260 2013-07-17 13:24 /user/glock/wordcount/mobydick.txt
Before submitting the Hadoop job, you should make sure your mapper
and reducer scripts actually work.
This is just a matter of running them through pipes on a little bit of sample data
(e.g., the first 1000 lines of Moby Dick):

$ head -n1000 pg2701.txt | ./mapper.py | sort | ./reducer.py

...
young 4
your 16
yourself 3
zephyr 1
Once you know the mapper/reducer scripts work without errors, we can
plug them into Hadoop. We accomplish this by running the Hadoop Streaming
jar file as our Hadoop job. This hadoop-streaming-X.Y.Z.jar file comes with the
standard Apache Hadoop distribution and should be in $HADOOP_HOME/
contrib/streaming where $HADOOP_HOME is the base directory of your Hadoop
installation and X.Y.Z is the version of Hadoop you are running. On Gordon the
location is /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar, so our
actual job launch command would look like
$ hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
packageJobJar: [/scratch/glock/819550.gordon-fe2.local/hadoop-glock/data/hadoop-
unjar4721749961014550860/] [] /tmp/streamjob7385577774459124859.jar tmpDir=null
13/07/17 19:26:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/17 19:26:16 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/17 19:26:16 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 19:26:16 INFO streaming.StreamJob: getLocalDirs(): [/scratch/glock/819550.gordon-
fe2.local/hadoop-glock/data/mapred/local]
13/07/17 19:26:16 INFO streaming.StreamJob: Running job: job_201307171926_0001
13/07/17 19:26:16 INFO streaming.StreamJob: To kill this job, run:
13/07/17 19:26:16 INFO streaming.StreamJob: /opt/hadoop/libexec/../bin/hadoop job -
Dmapred.job.tracker=gcn-13-34.ibnet0:54311 -kill job_201307171926_0001
13/07/17 19:26:16 INFO streaming.StreamJob: Tracking URL: http://gcn-13-
34.ibnet0:50030/jobdetails.jsp?jobid=job_201307171926_0001
And at this point, the job is running. That “tracking URL” is a bit
deceptive in that you probably won’t be able to access it. Fortunately, there is a
command-line interface for monitoring Hadoop jobs that is somewhat similar
to qstat. Noting the Hadoop jobid (highlighted in red above), you can do:
$ hadoop -status job_201307171926_0001
Job: job_201307171926_0001
file: hdfs://gcn-13-34.ibnet0:54310/scratch/glock/819550.gordon-fe2.local/hadoop-
glock/data/mapred/staging/glock/.staging/job_201307171926_0001/job.xml
tracking URL: http://gcn-13-34.ibnet0:50030/jobdetails.jsp?jobid=job_201307171926_0001
map() completion: 1.0
reduce() completion: 1.0

Counters: 30
Job Counters
Launched reduce tasks=1
SLOTS_MILLIS_MAPS=16037
Total time spent by all reduces waiting after reserving slots (ms)=0
Total time spent by all maps waiting after reserving slots (ms)=0
Launched map tasks=2
Data-local map tasks=2
...
Since the hadoop streaming job runs in the foreground, you will have to
use another terminal (with HADOOP_CONF_DIR properly exported) to check on
the job while it runs. However, you can also review the job metrics after the job
has finished. In the example highlighted above, we can see that the job only used
one reduce task and two map tasks despite the cluster having more than two
nodes.
Adjusting Parallelism
Unlike with a traditional HPC job, the level of parallelism a Hadoop job is not
necessarily the full size of your compute resource. The number of map tasks is
ultimately determined by the nature of your input data due to how HDFS
distributes chunks of data to your mappers. You can “suggest” a number of
mappers when you submit the job though. Doing so is a matter of applying the
change highlighted in green:
$ hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-D mapred.map.tasks=4 \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input wordcount/mobydick.txt \
-output wordcount/output
...
$ hadoop -status job_201307172000_0001
...
Job Counters
Launched reduce tasks=1
SLOTS_MILLIS_MAPS=24049
Total time spent by all reduces waiting after reserving slots (ms)=0
Total time spent by all maps waiting after reserving slots (ms)=0
Rack-local map tasks=1
Launched map tasks=4
Data-local map tasks=3
Similarly, you can add -D mapred.reduce.tasks=4 to suggest the number of

reducers. The reducer count is a little more flexible, and you can set it to zero if
you just want to apply a mapper to a large file.

With all this being said, the fact that Hadoop defaults to only two mappers
says something about the problem we’re trying to solve–that is, this entire
example is actually very stupid. While it illustrates the concepts quite neatly,
counting words in a 1.2 MB file is a waste of time if done through Hadoop
because, by default, Hadoop assigns chunks to mappers in increments of 64 MB.
Hadoop is meant to handle multi-gigabyte files, and actually getting Hadoop
streaming to do something useful for your research often requires a bit more
knowledge than what I’ve presented above.
To fill in these gaps, the next part of this tutorial, Parsing VCF Files with
Hadoop Streaming, shows how I applied Hadoop to solve a real-world problem
involving Python, some exotic Python libraries, and some not-completely-uniform
files.

HIVE

Chapter 10: HIVE | 61
CHAPTER
Hive
10
10.1 Hive Overview
What is Hive?
Hive is a data warehouse package used for processing, managing and
querying structured data in Hadoop. It eases analysing process and summarises
big data.
It acts as a platform used to develop SQL type scripts to do MapReduce
operations.
Initially Hive was started by Facebook and after that Apache Software
Foundation uses it and further modify as an open source named Apache Hive.
Several companies are now using hive. For example, Amazon uses it in Amazon
Elastic MapReduce.
10.2 Hive is not

● A relational database
● A design for OnLine Transaction Processing (OLTP) but
● A language for real-time queries and row-level updates
10.3 Merits of Hive

● It processes data into HDFS whose schema is stored in database.
● It is favourable where data is stored and implemented according to its
processing.
● It correlates with SQL but not a standard SQL and implement SQL
type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.
10.4 Architecture of Hive

The following component diagram depicts the architecture of Hive:

Figure 13: Hive Architecture
This component diagram contains different units. The following table

describes each unit:
Operation Unit Name
Hive is a data warehouse infrastructure software that can create User Interface
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Hive chooses respective database servers to store the schema or Meta Store
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the HiveQL Process Engine
Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Execution Engine
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.
Hadoop distributed file system or HBASE are the data storage HDFS or HBASE
techniques to store data into file system.
10.5 Hive - Installation

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating
system. Therefore, you need to install any Linux flavored OS. The following
simple steps are executed for Hive installation:

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java
installation using the following command:
$ java –version
If Java is already installed on your system, you get to see the following
response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for
installing java.
Installing Java
Step I: Following Command are used to install
Sudo apt-get install default-jdk
Step II: For setting up PATH and JAVA_HOME variables, add the
following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step III: Now verify the installation using the command java -version from
the terminal as explained above.
Step 2: Verifying Hadoop Installation

Hadoop must be installed on your system before installing Hive. Let us verify
the Hadoop installation using the following command:
$ hadoop version
If Hadoop is already installed on your system, then you will get the following
response:
Hadoop 2.6.0Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop
Download and extract Hadoop 2.6.0 from Apache Software Foundation using the
following commands.
$ cd /usr/local
$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/
$ hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz
$ mv hadoop-2.6.0/* to hadoop/
Installing Hadoop in Pseudo Distributed Mode

The following steps are used to install Hadoop 2.6.0 in pseudo distributed mode.
Step I: Setting up Hadoop

You can set Hadoop environment variables by appending the following
commands to ~/.bashrc file.
export HADOOP_HOME=/home/dexlab/hadoop
export HADOOP_MAPRED_HOME=$/home/dexlab/hadoop
export HADOOP_COMMON_HOME=$/home/dexlab/hadoop
export HADOOP_HDFS_HOME=$/home/dexlab/hadoop
export YARN_HOME=$/home/dexlab/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$/home/dexlab/hadoop/lib/native export
PATH=$PATH:$/home/dexlab/hadoop/sbin:$/home/dexlab/hadoop/bin
$ source ~/.bashrc
Step II: Hadoop Configuration

You can find all the Hadoop configuration files in the location
“/home/dexlab/hadoop/etc/hadoop”. You need to make suitable changes in those
configuration files according to your Hadoop infrastructure.
$ cd $/home/dexlab/hadoop/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java
environment variables in hadoop-env.sh file by replacing JAVA_HOME
value with the location of java in your system.
Given below are the list of files that you have to edit to configure Hadoop.

core-site.xml
The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication
data, the namenode path, and the datanode path of your local file systems. It
means the place where you want to store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /dexlab/ is the user name.

/home/dexlab/hadoop/hdfs/namenode is the directory created by hdfs file system.)
namenode path = /home/dexlab/hadoop/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = /home/dexlab/hadoop/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<Configuration>
<Property>
<value>1</value>
</property>
<property>
<value>/home/dexlab/hadoop/hdfs/namenode </value>
</property>
<property>
<value>/home/dexlab/hadoop/hdfs/datanode </value >

</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can
make changes according to your Hadoop user.
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By
default, Hadoop contains a template of yarn-site.xml. First of all, you need to
copy the file from mapred-site.xml.template to mapred-site.xml file using the
following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.
Step I: Name Node Setup
Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hadoop namenode –format
Step II: Verifying Hadoop dfs

The following command is used to start dfs. Executing this command will start
your Hadoop file system.
$ start-all.sh
Step III: Verifying Yarn Script

The following command is used to start the yarn script. Executing this command
will start your yarn daemons.
$ start-yarn.sh

Step IV: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following url to get
Hadoop services on your browser.
http://localhost:50070/
Figure 14: Namemode WebUI Overview
Step V: Verify all applications for cluster

The default port number to access all applications of cluster is 8088. Use the
following url to visit this service.
Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded
onto the /Downloads directory. Here, we download Hive archive named “apache-
hive-0.14.0-bin.tar.gz” for this tutorial. The following command is used to verify
the download:
$ cd Downloads
$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz
Step 4: Installing Hive

The following steps are required for installing Hive on your system. Let us
assume the Hive archive is downloaded onto the /Downloads directory.
Extracting and verifying Hive Archive

The following command is used to verify the download and extract the hive
archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
Copying files to /home/dexlab/hive directory

We need to copy the files from the super user “su -”. The following commands are
used to copy the files from the extracted directory to the /usr/local/hive”
directory.
$ cd /home/user/Download
$ mv apache-hive-0.14.0-bin /home/dexlab/hive
Setting up environment for Hive

You can set up the Hive environment by appending the following lines
to ~/.bashrc file:
export HIVE_HOME=/home/dexlab/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/home/dexlab/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/home/dexlab/hive/lib/*:.
The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc
Step 5: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is
placed in the $HIVE_HOME/conf directory. The following commands redirect
to Hive config folder and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:

Hive installation is completed successfully. Now you require an external

database server to configure Metastore. We use Apache Derby database.
Step 6: Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the database is stored.
You can do this by editing the hive-site.xml file, which is in the
$HIVE_HOME/conf directory. First of all, copy the template file using the
following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration>
and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Step 7: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive
folder in HDFS. Here, we use the /user/hive/warehouse folder. You need to set
write permission for these newly created folders as shown below:
$ Chmod 777 –R hive
Now set them in HDFS before verifying Hive. Use the following commands:
$ $/home/dexlab/hadoop/bin/hadoop fs -mkdir /tmp
$ /home/dexlab/hadoop fs -mkdir /user/hive/warehouse
$ $/home/dexlab/hadoop/bin/hadoop fs -chmod g+w /tmp
$ $/home/dexlab/hadoop/bin/hadoop fs -chmod g+w /user/hive/warehouse
The following commands are used to verify Hive installation:

$ cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following response:

Logging initialized using configuration in jar:file:/home/dexlab/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>

The following sample command is executed to display all the tables:

hive> show tables;
OK
Time taken: 2.798 seconds
hive>
10.6 Hive - Data Types

There are 4 types of catogeries on which all data types in hive are classified:
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Column Types
For column data types of Hive Column type are used as follows:
Integral Types (INT)-

To use integral data types integer type data can be specified. If the data range of
INT goes beyond the specified range of INT, then BIGINT is used and if the data
range is smaller than the INT, then SMALLINT and TINYINT is used whearas
TINYINT smaller than SMALLINT.
The following table depicts various INT data types:
Example Postfix Type
10Y Y TINYINT
10S S SMALLINT
10 - INT
10L L BIGINT
String Types
Single quotes (' ') or double quotes (" ") are used to specify string type data types.
It consists of two data types:
 VARCHAR
 CHAR
The following table depicts various CHAR data types:
Length Data Type
1 to 65355 VARCHAR
255 CHAR

Timestamp
 Timestamp supports java.sql ans traditional UNIX timestamp with
optinal nanosecond precision.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
 It is suitable for HIVE 0.8.0
Dates
 It is suitable for HIVE 0.12.0
 DATE values are described in year/month/day format in the form {{YYYY-
MM-DD}}.
Decimals
 DECIMAL type in Hive correlates withBig Decimal format of Java which
is used for representing immutable arbitrary precision.
 It is supported by HIVE 0.11.0 and HIVE 0.13.0
The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
A set of heterogeneous data types is union. You can create an instance
using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following types of literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally,
this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.

Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax:
ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax:
MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
10.7 Create Database

Create Database is a statement used to create a database in Hive. A database in
Hive is a namespace or a collection of tables. The syntax for this statement is
as follows:
hive> CREATE SCHEMA dexlabdb;
The following query is used to verify a databases list:

hive> SHOW DATABASES;
10.8 Comparison with hive and other

databases Retrieving Information
Hive MySQL Function
SELECT from_columns Retrieving

SELECT from_columns FROM
FROM table WHERE Information
table WHERE conditions;
conditions; (General)
Retrieving All
SELECT * FROM table; SELECT * FROM table;
Values
SELECT * FROM table SELECT * FROM table WHERE Retrieving Some

WHERE rec_name = "value"; rec_name = "value"; Values

SELECT * FROM TABLE SELECT * FROM TABLE

Retrieving With
WHERE rec1= "value1" AND WHERE rec1 = "value1" AND
Multiple Criteria
rec2 ="value2"; rec2 ="value2";
SELECT column_name SELECT column_name FROM Retrieving

FROM table; table; Specific Columns
SELECT DISTINCT SELECT DISTINCT Retrieving

column_name FROM table; column_name FROM table; Unique Output
SELECT col1, col2 FROM SELECT col1, col2 FROM table

Sorting
table ORDER BY col2; ORDER BY col2;
SELECT col1, col2 FROM

SELECT col1, col2 FROM table
table ORDER BY col2 Sorting Reverse
ORDER BY col2 DESC;
DESC;
SELECT COUNT(*) FROM SELECT COUNT(*) FROM

Counting Rows
table; table;
SELECT SELECT
Grouping With
owner, COUNT(*) FROM owner, COUNT(*) FROM table
Counting
table GROUP BY owner; GROUP BY owner;
SELECT MAX(col_name) AS SELECT MAX(col_name) AS

Maximum Value
label FROM table; label FROM table;
SELECT
Selecting from
pet.name, comment FROM SELECT pet.name, comment
multiple tables
pet FROM pet, event WHERE
(Join same table
JOIN event ON (pet.name = pet.name =event.name;
using alias w/”AS”)
event.name)
10.9 Metadata
Hive MySQL
Function
Selecting a
USE database; USE database;
database
SHOW DATABASES; SHOW DATABASES; Listing databases
Listing tables in a
SHOW TABLES; SHOW TABLES;
database
Describing the
DESCRIBE (FORMATTED DESCRIBE table;
format of a table

|EXTENDED)table;
CREATE DATABASE CREATE DATABASE

Creating a database
db_name; db_name;
DROP DATABASE Dropping a

DROP DATABASE db_name;
db_name (CASCADE); database
10.10 Current Sql Compatibility
Figure 16: Hive & SQL Compatibility
Command Line
Hive Function
hive -e 'select a.col from tab1 a' Run Query
hive -S -e 'select a.col from tab1 a' Run Query Silent Mode
hive -e 'select a.col from tab1 a' -hiveconf

Set Hive Config Variables
hive.root.logger=DEBUG,console
hive -i initialize.sql Use Initialization Script
hive -f script.sql Run Non-Interactive Script

10.11 Hive DDL Commands

Create Database Statement
A database in Hive is a namespace or a collection of tables.
1. hive> CREATE SCHEMA dexlabdb;
2. hive> SHOW DATABASES;
Drop database
1. hive> DROP DATABASE IF EXISTS dexlabdb;
Creating Hive Tables

Create a table called Dexl_table with two columns, the first being an
integer and the other a string.
1. hive> CREATE TABLE Dexl_table(foo INT, bar STRING);
Create a table called HIVE_TABLE with two columns and a partition column
called ds. The partition column is a virtual column. It is not part of the data
itself but is derived from the partition that a particular dataset is loaded into.By
default, tables are assumed to be of text input format and the delimiters are
assumed to be ^A(ctrl-a).
1. hive> CREATE TABLE HIVE_TABLE (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
Browse the table

1. hive> Show tables;
Altering and Dropping Tables

1. hive> ALTER TABLE Dexl_table RENAME TO Kafka;
2. hive> ALTER TABLE Kafka ADD COLUMNS (col INT);
3. hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT
COMMENT 'a comment');
4. hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT,
weight STRING, baz INT COMMENT 'baz replaces new_col1');
10.12 Hive DML Commands

To understand the Hive
DML commands, let's
see the employee and
employee_department
table first.

LOAD DATA
hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO
TABLE Employee;
SELECTS and FILTERS

hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';
GROUP BY
hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
10.13 Joins
Hive transforms joins of various tables into a single map job in which according
to join clauses same column is used in every table .
Working of Joins
 Compilation of join operation into map reduced task.
 Traversing from join table performed by mapper.
 Expanding join key and join pair into intermediate file.
 SHUFFLE STAGE – Sorting and merging of those pairs done by Hadoop
called shuffle stage.
 Because of inclusion of sorting and merging process makes shuffle stage
expensive.
 REDUCER STAGE – Actual join work is done by reducer which takes
sorted result as an input.
 Covering the basics of joins in hive.
Figure 17: Types of Joins
We will be working with two tables customer and orders that we imported
in sqoop and going to perform following.

INNER JOIN – Select records that have matching values in both tables.
LEFT JOIN (LEFT OUTER JOIN) – returns all the values from the left table,
plus the matched values from the right table, or NULL in case of no matching
join predicate
RIGHT JOIN (RIGHT OUTER JOIN) A RIGHT JOIN returns all the values
from the right table, plus the matched values from the left table, or NULL in
case of no matching join predicate
FULL JOIN (FULL OUTER JOIN) – Selects all records that match either left or
right table records.
LEFT SEMI JOIN: Only returns the records from the left-hand table. Hive
doesn’t support IN subqueries so you can’t do
SELECT * FROM TABLE_A WHERE TABLE_A.ID IN (SELECT ID FROM TABLE_B);
Customer Table
Hive Tip: to print column headers in command line
hive> set hive.cli.print.header=true;
hive> select * from customers;
OK
customers.id customers.name
1 John
2 Kevin
19 Alex
3 Mark
4 Jenna
5 Robert
6 Zoya
7 Sam
8 George
9 Peter
Orders Table:
hive> select * from orders;
OK
order_id orders.order_date orders.customer_id orders.amount
101 2016-01-01 7 3540
102 2016-03-01 1 240
103 2016-03-02 6 2340
104 2016-02-12 3 5000
105 2016-02-12 3 5500

106 2016-02-14 9 3005

107 2016-02-14 1 20
108 2016-02-29 2 2000
109 2016-02-29 3 2500
110 2016-02-27 1 200
Inner Join
Select records that have matching values in both tables.
hive> select c.id, c.name, o.order_date, o.amount from customers c inner join orders o ON (c.id
= o.customer_id);
Output
c.id c.name o.order_date o.amount
7 Sam 2016-01-01 3540
1 John 2016-03-01 240
6 Zoya 2016-03-02 2340
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
9 Peter 2016-02-14 3005
1 John 2016-02-14 20
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
1 John 2016-02-27 200
Left Join (Left Outer Join)

Returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching join predicate
hive> select c.id, c.name, o.order_date, o.amount from customers c left outer join orders o ON
(c.id = o.customer_id);
Output
1 John 2016-03-01 240
1 John 2016-02-14 20
1 John 2016-02-27 200
2 Kevin 2016-02-29 2000
19 Alex NULL NULL
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
3 Mark 2016-02-29 2500
4 Jenna NULL NULL
5 Robert NULL NULL
6 Zoya 2016-03-02 2340
7 Sam 2016-01-01 3540
8 George NULL NULL
9 Peter 2016-02-14 3005
Time taken: 40.462 seconds, Fetched: 14 row(s)

Right Join (Right Outer Join)

Returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate
hive> select c.id, c.name, o.order_date, o.amount from customers c left outer join orders o ON
Output
7 Sam 2016-01-01 3540
1 John 2016-03-01 240
6 Zoya 2016-03-02 2340
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
9 Peter 2016-02-14 3005
1 John 2016-02-14 20
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
1 John 2016-02-27 200
Full Join (Full Outer Join)

Selects all records that match either left or right table records.
hive> select c.id, c.name, o.order_date, o.amount from customers c full outer join orders o ON
Output
1 John 2016-02-27 200
1 John 2016-02-14 20
1 John 2016-03-01 240
19 Alex NULL NULL
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
3 Mark 2016-02-12 5500
3 Mark 2016-02-12 5000
4 Jenna NULL NULL
5 Robert NULL NULL
6 Zoya 2016-03-02 2340
7 Sam 2016-01-01 3540
8 George NULL NULL
9 Peter 2016-02-14 3005
Left Semi Join

Find all the customers where at least one order exist or find
all customer who has placed an order.

hive> select * from customers left semi join orders ON

(customers.id = orders.customer_id);
OUTPUT
customers.id customers.name
1 John
2 Kevin
3 Mark
6 Zoya
7 Sam
9 Peter
Time taken: 56.362 seconds, Fetched: 6 row(s)
10.14 Hive Bucket

The spliting of table into a set of partitions is called Hive Partition. The Hive
Partition can be further splited into Clusters or Buckets.
Hive Buckets is a technique of disintegrating data or reducing the data
into equal parts which is easily manageable .CLUSTERED BY is used in Hive
Buckets.
For example we have table named student consists of columns like date,
student_name, student_id, attendence, leaves etc . In this table just use date
column as the top-level partition and the student_id as the second-level partition
leads to too many small partitions.
So here student table is partition by date and bucketing by student_id.The
value of this column will be hashed by a user-defined number into buckets.
Records with the same student_id will always be stored in the same bucket.
Instead of creating large number of partitions in hive buckets we only
create some number of hive buckets because if we create large number of hive
buckets declaration of those hive buckets for a table during table creation
becomes complex.
Bucket constitutes as a file in hive bucket partition whearas partition
constitutes as directory in hive partition.
10.15 Advantages with Hive Bucket

 Facilitate efficient sampling process and queries.
 No occurence of variations in data because there are fixed number of
buckets.
 Optimised query techniques are uses hive buckets .
 Equal sharing of column and number of buckets.
CREATE TABLE order (
username STRING,

orderdate STRING,
amount DOUBLE,
tax DOUBLE,
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Here we divided Hive Buckets into 25 parts.Set the maximum number

of reducers to the same number of buckets specified in the table metadata (i.e.
25)
set map.reduce.tasks = 25
Use the following command to enforce bucketing:
set hive.enforce.bucketing = true
Better to set default Hive Buckets is 25
Figure 18: Create Hive Bucket table

Load Data Into Table
Figure 19: Load data into Hive Bucket table
Hive Buckets table data load

Check below screen and you will realize three files names as 000000_0,
000001_0 and 000002_0 are created these are our data files.
10.16 Creating a View

You can create a view at the time of executing a SELECT statement. The syntax
is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT

column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Example
Let us take an example for view. Assume employee table as given below, with
the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve
the employee details who earn a salary of more than Rs 30000. We store the
result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
10.17 Dropping a View

Use the following syntax to drop a view:
DROP VIEW view_name
The following query drops a view named as emp_30000:

hive> DROP VIEW emp_30000;
10.18 Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an
index means creating a pointer on a particular column of a table. Its syntax is
as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]

[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example:
Let us take an example for index. Use the same employee table that we have
used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an
index named index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes

are stored using an index value.
10.19 Dropping an Index

The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:

hive> DROP INDEX index_salary ON employee;
Figure 20: Hive-Buckets-table-data-load-output

Chapter 11: APACHE HBASE | 84
CHAPTER
Apache HBase
11
11.1 HBase Overview
HBase Architectural Components
Physically, HBase is composed of three types of servers in a master slave type of
architecture. Region servers serve data for reads and writes. When accessing data,
clients communicate with HBase RegionServers directly. Region assignment, DDL
(create, delete tables) operations are handled by the HBase Master process. Zookeeper,
which is part of HDFS, maintains a live cluster state.
The Hadoop DataNode stores the data that the Region Server is managing. All
HBase data is stored in HDFS files. Region Servers are collocated with the HDFS
DataNodes, which enable data locality (putting the data close to where it is needed) for
the data served by the RegionServers. HBase data is local when it is written, but when a
region is moved, it is not local until compaction.
The NameNode maintains metadata information for all the physical data blocks
that comprise the files.
Figure 21: Hbase Architecture
11.2 Regions
HBase Tables are divided horizontally by row key range into “Regions.” A region
contains all rows in the table between the region’s start key and end key.
Regions are assigned to the nodes in the cluster, called “Region Servers,” and
these serve data for reads and writes. A region server can serve about 1,000
regions.

Figure 22: Hbase Region
11.3 HBase Master

Region assignment DDL (create, delete tables) operations are handled by the
HBase Master.
A master is responsible for:
● Coordinating the region servers
● Assigning regions on startup , re-assigning regions for recovery or
load balancing
● Monitoring all RegionServer instances in the cluster (listens for
notifications from Zookeeper)
● Admin functions
● Interface for creating, deleting, updating tables
Figure 23: Hbase Hmaster
11.4 Zookeeper: The Coordinator

HBase uses Zookeeper as a distributed coordination service to maintain server
state in the cluster. Zookeeper maintains which servers are alive and available,
and provides server failure notification. Zookeeper uses consensus to guarantee

common shared state. Note that there should be three or five machines for
consensus.
Figure 24: Zookeeper
11.5 How the Components Work

Together
Zookeeper is used to coordinate shared state information for members of
distributed systems. Region servers and the active HMaster connect with a
session to Zookeeper. The Zookeeper maintains ephemeral nodes for active
sessions via heartbeats.
Figure 25: Hbase Component Working
Each Region Server creates an ephemeral node. The HMaster monitors

these nodes to discover available region servers, and it also monitors these nodes
for server failures. HMasters vie to create an ephemeral node. Zookeeper
determines the first one and uses it to make sure that only one master is active.
The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster
listens for notifications of the active HMaster failure.
If a region server or the active HMaster fails to send a heartbeat, the
session is expired and the corresponding ephemeral node is deleted. Listeners for
updates will be notified of the deleted nodes. The active HMaster listens for

region servers, and will recover region servers on failure. The Inactive HMaster
listens for active HMaster failure, and if an active HMaster fails, the inactive
HMaster becomes active.
11.6 HBase First Read or Write

There is a special HBase Catalog table called the META table, which holds the
location of the regions in the cluster. Zookeeper stores the location of the META
table.
This is what happens the first time a client reads or writes to HBase:
1. The client gets the Region server that hosts the META table from
Zookeeper.
2. The client will query the .META. server to get the region server
corresponding to the row key it wants to access. The client caches
this information along with the META table location.
3. It will get the Row from the corresponding Region Server.
For future reads, the client uses the cache to retrieve the META location
and previously read row keys. Over time, it does not need to query the META
table, unless there is a miss because a region has moved; then it will re-query
and update the cache.
Figure 26: Hbase Read & Write Operation
11.7 HBase Meta Table

● This META table is an HBase table that keeps a list of all regions in
the system.
● The .META. Table is like a b tree.
● The .META. table structure is as follows:
● Key: region start key, region id
● Values: Region Server

Figure 27: Hbase Meta Table
11.8 Region Server Components

A Region Server runs on an HDFS data node and has the following components:
WAL: Write Ahead Log is a file on the distributed file system. The WAL is used
to store new data that hasn't yet been persisted to permanent storage; it is used
for recovery in the case of failure.
BlockCache: is the read cache. It stores frequently read data in memory. Least
Recently Used data is evicted when full.
MemStore: is the write cache. It stores new data which has not yet been written
to disk. It is sorted before writing to disk. There is one MemStore per column
family per region.
Hfiles store the rows

as sorted KeyValues on
disk.
Figure 28: Region Server Component

HBase Write Steps (1)
When the client issues a Put request, the first step is to write the data to the
write-ahead log, the WAL:
● Edits are appended to the end of the WAL file that is stored on disk.
● The WAL is used to recover not-yet-persisted data in case a server
crashes.

Figure 29: Hbase Write Step (1)
HBase Write Steps (2)

Once the data is written to the WAL, it is placed in the MemStore. Then, the put
request acknowledgement returns to the client.
Figure 30: Hbase Write Step (2)
11.9 HBase MemStore

The MemStore stores
updates in memory as
sorted Key Values, the
same as it would be
stored in an HFile. There
is one MemStore per
column family. The
updates are sorted per
column family.
Figure 31: Hbase Memstore
11.10 HBase Region Flush

When the MemStore accumulates enough data, the entire sorted set is written to
a new HFile in HDFS. HBase uses multiple HFiles per column family, which
contain the actual cells, or KeyValue instances. These files are created over time
as KeyValue edits sorted in the MemStores are flushed as files to disk.

Note that this is one reason why there is a limit to the number of column
families in HBase. There is one MemStore per CF; when one is full, they all
flush. It also saves the last written sequence number so the system knows what
was persisted so far.
The highest sequence number is stored as a meta field in each HFile, to
reflect where persisting has ended and where to continue. On region startup, the
sequence number is read, and the highest is used as the sequence number for
new edits.
Figure 32: Hbase Region Flush
11.11 HBase HFile

Data is stored in an HFile which contains sorted key/values. When the
MemStore accumulates enough data, the entire sorted KeyValue set is written to
a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids
moving the disk drive head.
Figure 33: Hbase Hfile
11.12 HBase HFile Structure

An HFile contains a multi-layered index which allows HBase to seek to the data
without having to read the whole file. The multi-level index is like a b+tree:
● Key value pairs are stored in increasing order
● Indexes point by row key to the key value data in 64KB “blocks”
● Each block has its own leaf-index

● The last key of each block is put in the intermediate index

● The root index points to the intermediate index
The trailer points to the meta blocks, and is written at the end of
persisting the data to the file. The trailer also has information like bloom filters
and time range info. Bloom filters help to skip files that do not contain a certain
row key. The time range info is useful for skipping the file if it is not in the time
range the read is looking for.
Figure 34: Hfile Structure
11.13 HFile Index

The index, which
we just discussed, is
loaded when the HFile
is opened and kept in
memory. This allows
lookups to be performed
with a single disk seek.
Figure 35: Hfile Index
11.14 HBase Read Merge

We have seen that the KeyValue cells corresponding to one row can be in
multiple places, row cells already persisted are in Hfiles, recently updated cells
are in the MemStore, and recently read cells are in the Block cache. So when you
read a row, how does the system get the corresponding cells to return? A Read
merges Key Values from the block cache, MemStore, and HFiles in the following
steps:
1. First, the scanner looks for the Row cells in the Block cache - the read
cache. Recently Read Key Values are cached here, and Least Recently
Used are evicted when memory is needed.
2. Next, the scanner looks in the MemStore, the write cache in memory
containing the most recent writes.

3. If the scanner does not find all of the row cells in the MemStore and Block
Cache, then HBase will use the Block Cache indexes and bloom filters to
load HFiles into memory, which may contain the target row cells.
Figure 36: Hbase Read Merge (1)
11.15 HBase Read Merge

As discussed earlier, there may be many HFiles per MemStore, which means for
a read, multiple files may have to be examined, which can affect the
performance. This is called read amplification.
Figure 37: Hbase Read Merge (2)
11.16 HBase Minor Compaction

HBase will automatically pick some smaller HFiles and rewrite them into fewer
bigger Hfiles. This process is called minor compaction. Minor compaction reduces
the number of storage files by rewriting smaller files into fewer but larger ones,
performing a merge sort.
Figure 38: Hbase Minor Compaction

11.17 HBase Major Compaction

Major compaction merges and rewrites all the HFiles in a region to one HFile per
column family, and in the process, drops deleted or expired cells. This improves
read performance; however, since major compaction rewrites all of the files, lots
of disk I/O and network traffic might occur during the process. This is called
write amplification.
Major compactions can be scheduled to run automatically. Due to write
amplification, major compactions are usually scheduled for weekends or
evenings. Note that MapR-
DB has made
improvements and does not
need to do compactions. A
major compaction also
makes any data files that
were remote, due to server
failure or load balancing,
local to the region server.
Figure 39: Hbase Major Compaction
11.18 Region = Contiguous Keys

Let’s do a quick review of regions:
● A table can be divided horizontally into one or more regions. A
region contains a contiguous, sorted range of rows between a start
key and an end key
● Each region is 1GB in size (default)
● A region of a table is served to the client by a RegionServer
● A region server can serve about 1,000 regions (which may belong to
the same table or different tables)
Figure 40: Region’s Contiguous Key’s

11.19 Region Split

Initially there is one region per table. When a region grows too large, it splits
into two child regions. Both child regions, representing one-half of the original
region, are opened in parallel on the same Region server, and then the split is
reported to the HMaster. For load balancing reasons, the HMaster may schedule
for new regions to be moved off to other servers.
Figure 41: Region Split
11.20 Read Load Balancing

Splitting happens initially on the same region server, but for load balancing
reasons, the HMaster may schedule for new regions to be moved off to other
servers. This results in the new Region server serving data from a remote HDFS
node until a major
compaction moves
the data files to the
Regions server’s
local node. HBase
data is local when it
is written, but when
a region is moved
(for load balancing or
recovery), it is not
local until major
compaction. Figure 42: Load Balancing
HDFS Data Replication

All writes and Reads are to/from the
primary node. HDFS replicates the WAL
and HFile blocks. HFile block replication
happens automatically. HBase relies on
Figure 43: HDFS Data Replication (1)

HDFS to provide the data safety as it stores its files. When data is written in
HDFS, one copy is written locally, and then it is replicated to a secondary node,
and a third copy is written to a tertiary node.
HDFS Data Replication (2)

The WAL file and the Hfiles are
persisted on disk and replicated,
so how does HBase recover the
MemStore updates not persisted
to HFiles? See the next section for
the answer.
Figure 44: HDFS Data Replication(2)
HBase Crash Recovery

When a RegionServer fails, Crashed Regions are unavailable until detection and
recovery steps have happened. Zookeeper will determine Node failure when it
loses region server heart beats. The HMaster will then be notified that the
Region Server has failed.
When the HMaster detects that a region server has crashed, the HMaster
reassigns the regions from the crashed server to active Region servers. In order
to recover the crashed region server’s memstore edits that were not flushed to
disk. The HMaster splits
the WAL belonging to the
crashed region server into
separate files and stores
these file in the new
region servers’ data
nodes. Each Region
Server then replays the
WAL from the respective
split WAL, to rebuild the
memstore for that region. Figure 45: Hbase Crash Recovery
Data Recovery
WAL files contain a list of edits, with one edit representing a single put or delete.
Edits are written chronologically, so, for persistence, additions are appended to
the end of the WAL file that is stored on disk.
What happens if there is a failure when the data is still in memory and
not persisted to an HFile? The WAL is replayed. Replaying a WAL is done by
reading the WAL, adding and sorting the contained edits to the current
MemStore. At the end, the MemStore is flush to write changes to an HFile.

Figure 46: Hbase Data Recovery
Apache HBase Architecture Benefits

HBase provides the following benefits:
Strong consistency model
● When a write returns, all readers will see same value
Scales automatically
● Regions split when data grows too large
● Uses HDFS to spread and replicate data
Built-in recovery
● Using Write Ahead Log (similar to journaling on file system)
Integrated with Hadoop
● MapReduce on HBase is straightforward
HBase Installation
We can install HBase in any of the three modes: Standalone mode, Pseudo
Distributed mode, and Fully Distributed mode.
Installing HBase in Standalone Mode
Download the latest stable version of HBase form http://www.interior-
dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the
tar “zxvf” command. See the following command.
$cd usr/local/
$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-
hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
Shift to super user mode and move the HBase folder to /home/dexlab as shown
below.
$su
$password: enter your password here
mv hbase-0.99.1/* Hbase/
Configuring HBase in Standalone Mode

Before proceeding with HBase, you have to edit the following files and configure

HBase.
hbase-env.sh
Set the java Home for HBase and open hbase-env.sh file from the conf folder.
Edit JAVA_HOME environment variable and change the existing path to your
current JAVA_HOME variable as shown below.
cd /home/dexlab/Hbase/conf
gedit hbase-env.sh
This will open the env.sh file of HBase. Now replace the existing JAVA_HOME
value with your current value as shown below.
hbase-site.xml
This is the main configuration file of HBase. Set the data directory to an
appropriate location by opening the HBase home folder in /usr/local/HBase.
Inside the conf folder, you will find several files, open the hbase-site.xml file as
shown below.
$ cd /home/dexlab/HBase/
$ cd conf
$ gedit hbase-site.xml
Inside the hbase-site.xml file, you will find the <configuration> and
</configuration> tags. Within them, set the HBase directory under the property
key with the name “hbase.rootdir” as shown below.
<configuration>
//Here you have to set the path where you want HBase to store its files.
<property>
<name>hbase.rootdir</name>
<value>/home/dexlab/HBase/HFiles</value>
</property>
//Here you have to set the path where you want HBase to store its built in Zookeeper files.
<property>
<name>hbase.Zookeeper.property.dataDir</name>
<value>/home/dexlab/Zookeeper</value>
</property>
</configuration>
With this, the HBase installation and configuration part is successfully

complete. We can start HBase by using start-hbase.sh script provided in the bin
folder of HBase. For that, open HBase Home Folder and run HBase start script
as shown below.
$cd /home/dexlab/HBase/bin

$./start-hbase.sh
If everything goes well, when you try to run HBase start script, it will prompt
you a message saying that HBase has started.
starting master, logging to /home/dexlab/HBase/bin/../logs/hbase-tpmaster-
localhost.localdomain.out
Installing HBase in Pseudo-Distributed Mode

Let us now check how HBase is installed in pseudo-distributed mode.
Configuring HBase
Before proceeding with HBase, configure Hadoop and HDFS on your local
system or on a remote system and make sure they are running. Stop HBase if it
is running.
hbase-site.xml
Edit hbase-site.xml file to add the following properties.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the
local file system, change the hbase.rootdir, your HDFS instance address, using
the hdfs://// URI syntax. We are running HDFS on the localhost at port 8030.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8030/hbase</value>
</property>
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using
the following command.
$cd /usr/local/HBase
$bin/start-hbase.sh
Note: Before starting HBase, make sure Hadoop is running.
Checking the HBase Directory in HDFS

HBase creates its directory in HDFS. To see the created directory, browse to
Hadoop bin and type the following command.
$ ./bin/hadoop fs -ls /hbase

If everything goes well, it will give you the following output.

Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs
Starting and Stopping RegionServers

You can run multiple region servers from a single system using the following
command.
$ .bin/local-regionservers.sh start 2 3
To stop a region server, use the following command.

$ .bin/local-regionservers.sh stop 3
11.21 Starting HBaseShell

After Installing HBase successfully, you can start HBase Shell. Below given are
the sequence of steps that are to be followed to start the HBase shell. Open the
terminal, and login as super user.
Start Hadoop File System

Browse through Hadoop home sbin folder and start Hadoop file system as shown
below.
$cd $HADOOP_HOME/sbin
$start-all.sh
Start HBase
Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/start-hbase.sh
Start HBase Master Server

This will be the same directory. Start it as shown below.
$./bin/local-master-backup.sh start 2 (number signifies specific server.)
Start Region
Start the region server as shown below.

$./bin/./local-regionservers.sh start 3
Start HBase Shell

You can start HBase shell using the following command.
$cd bin
$./hbase shell
This will give you the HBase Shell Prompt as shown below.
2014-12-09 14:24:27,526 INFO [main] Configuration.deprecation:
hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri
Nov 14 18:26:29 PST 2014
hbase(main):001:0>
HBase Web Interface

To access the web interface of HBase, type the following url in the browser.
http://localhost:60010
This interface lists your currently running Region servers, backup masters and
HBase tables.
HBase Region servers and Backup Masters
Figure 47: Hbase WebUI

HBase Tables
Figure 48: Hbase Tables WebUI
Set classpath for HBase libraries (lib folder in HBase) in it as shown below.
export CLASSPATH = $CLASSPATH:/home/dexlab/hbase/lib/*
This is to prevent the “class not found” exception while accessing the
HBase using java API.
HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase
uses the Hadoop File System to store its data. It will have a master server and
region servers. The data storage will be in the form of regions (tables). These
regions will be split up and stored in region servers.
The master server manages these region servers and all these tasks take
place on HDFS. Given below are some of the commands supported by HBase
Shell.
General Commands
status Provides the status of HBase, for example, the number of
servers.
version Provides the version of HBase being used.
table_help Provides help for table-reference commands.
whoami Provides information about the user.

Data Definition Language

These are the commands that operate on the tables in HBase.
create Creates a table.
list Lists all the tables in HBase.
disable Disables a table.
is_disabled Verifies whether a table is disabled.
enable Enables a table.
is_enabled Verifies whether a table is enabled.
describe Provides the description of a table.
alter Alters a table.
exists Verifies whether a table exists.
drop Drops a table from HBase.
drop_all Drops the tables matching the ‘regex’ given in the command.
Data Manipulation Language

put Puts a cell value at a specified column in a specified row in a
particular table.
get Fetches the contents of row or a cell.
delete Deletes a cell value in a table.
deleteall Deletes all the cells in a given row.
scan Scans and returns the table data.
count Counts and returns the number of rows in a table.
truncate Disables, drops, and recreates a specified table.
Starting HBase Shell

To access the HBase shell, you have to navigate to the HBase home folder.
cd /usr/localhost/
cd Hbase
You can start the HBase interactive shell using “hbase shell” command as shown
below.
./bin/hbase shell
If you have successfully installed HBase in your system, then it gives you the
HBase shell prompt as shown below.
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.23, rf42302b28aceaab773b15f234aa8718fff7eea3c, Wed Aug 27
00:54:09 UTC 2014
hbase(main):001:0>

To exit the interactive shell command at any moment, type exit or use <ctrl+c>.
Check the shell functioning before proceeding further. Use the list command for
this purpose. List is a command used to get the list of all the tables in HBase.
First of all, verify the installation and the configuration of HBase in your system
using this command as shown below.
hbase(main):001:0> list
When you type this command, it gives you the following output.
TABLE
11.22 HBase Basic

Creating a Table using HBase Shell
You can create a table using the create command, here you must specify the
table name and the Column Family name. The syntax to create a table in HBase
shell is shown below.
create ‘<table name>’,’<column family>’
Example: Given below is a sample schema of a table named emp. It has two
column families: “dexlab data” and “emp data”.
Row key dexlab data emp data
You can create this table in HBase shell as shown below.

hbase(main):002:0> create 'emp', 'dexlab data', 'emp data'
And it will give you the following output.

0 row(s) in 1.1300 seconds
=> Hbase::Table – emp
Verification
You can verify whether the table is created using the list command as shown
below. Here you can observe the created emp table.
TABLE
emp

Listing a Table using HBase Shell

list is the command that is used to list all the tables in HBase. Given below is
the syntax of the list command.
hbase(main):001:0 > list
When you type this command and execute in HBase prompt, it will display the
list of all the tables in HBase as shown below.
TABLE
emp
Here you can observe a table named emp.
Disabling a Table using HBase Shell

To delete a table or change its settings, you need to first disable the table using
the disable command. You can re-enable it using the enable command.
Given below is the syntax to disable a table:
disable ‘emp’
Example
Given below is an example that shows how to disable a table.
hbase(main):025:0> disable 'emp'
Verification
After disabling the table, you can still sense its existence through list and exists
commands. You cannot scan it. It will give you the following error.
hbase(main):028:0> scan 'emp'
ROW COLUMN + CELL
ERROR: emp is disabled.
is_disabled
This command is used to find whether a table is disabled. Its syntax is as
follows.
hbase> is_disabled 'table name'
The following example verifies whether the table named emp is disabled. If it is
disabled, it will return true and if not, it will return false.
hbase(main):031:0> is_disabled 'emp'
true

disable_all
This command is used to disable all the tables matching the given regex. The
syntax for disable_all command is given below.
hbase> disable_all 'r.*'
Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and
raju. The following code will disable all the tables starting with raj.
hbase(main):002:07> disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Enabling a Table using HBase Shell

Syntax to enable a table:
enable ‘emp’
Example
Given below is an example to enable a table.
hbase(main):005:0> enable 'emp'
Verification
After enabling the table, scan it. If you can see the schema, your table is
successfully enabled.
ROW COLUMN + CELL
1 column = dexlab data:city, timestamp = 1417516501, value = hyderabad
1 column = dexlab data:name, timestamp = 1417525058, value = ramu
1 column = emp data:designation, timestamp = 1417532601, value = manager
1 column = emp data:salary, timestamp = 1417524244109, value = 50000
2 column = dexlab data:city, timestamp = 1417524574905, value = chennai

2 column = dexlab data:name, timestamp = 1417524556125, value = ravi
2 column = emp data:designation, timestamp = 14175292204, value = sr:engg
3 column = dexlab data:city, timestamp = 1417524681780, value = delhi
3 column = dexlab data:name, timestamp = 1417524672067, value = rajesh
3 column = emp data:designation, timestamp = 14175246987, value = jr:engg
is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'
The following code verifies whether the table named emp is enabled. If it is
enabled, it will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'emp'
true
describe
This command returns the description of the table. Its syntax is as follows:
hbase> describe 'table name'
Given below is the output of the describe command on the emp table.
hbase(main):006:0> describe 'emp'
DESCRIPTION
ENABLED
alter
Alter is the command used to make changes to an existing table. Using this
command, you can change the maximum number of cells of a column family, set
and delete table scope operators, and delete a column family from a table.
Changing the Maximum Number of Cells of a Column Family

Given below is the syntax to change the maximum number of cells of a column
family.

hbase> alter 't1', NAME ⇒ 'f1', VERSIONS ⇒ 5
In the following example, the maximum number of cells is set to 5.

hbase(main):003:0> alter 'emp', NAME ⇒ 'dexlab data', VERSIONS ⇒ 5
Updating all regions with the new schema...
0/1 regions updated.
Done.
Table Scope Operators

Using alter, you can set and remove table scope operators such as
MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG
_FLUSH, etc.
Setting Read Only

Below given is the syntax to make a table read only.
hbase>alter 't1', READONLY(option)
In the following example, we have made the emp table read only.
hbase(main):006:0> alter 'emp', READONLY
Done.
Removing Table Scope Operators

We can also remove the table scope operators. Given below is the syntax to
remove ‘MAX_FILESIZE’ from emp table.
hbase> alter 't1', METHOD ⇒ 'table_att_unset', NAME ⇒ 'MAX_FILESIZE'
Deleting a Column Family

Using alter, you can also delete a column family. Given below is the syntax to
delete a column family using alter.
hbase> alter ‘ table name ’, ‘delete’ ⇒ ‘ column family ’
Given below is an example to delete a column family from the ‘emp’ table.
Assume there is a table named employee in HBase. It contains the
following data:
hbase(main):006:0> scan 'employee'

ROW COLUMN+CELL
row1 column = dexlab:city, timestamp = 1418193767, value = hyderabad
row1 column = dexlab:name, timestamp = 1418193806767, value = raju
row1 column = emp:designation, timestamp = 1418193767, value = manager
row1 column = emp:salary, timestamp = 1418193806767, value = 50000
Now let us delete the column family named emp using the alter command.
hbase(main):007:0> alter 'employee','delete'⇒'emp'
Done.
Now verify the data in the table after alteration. Observe the column family
‘emp’ is no more, since we have deleted it.
hbase(main):003:0> scan 'employee'
ROW COLUMN + CELL
row1 column = dexlab:city, timestamp = 14181936767, value = hyderabad
Existence of Table using HBase Shell

You can verify the existence of a table using the exists command. The following
example shows how to use this command.
hbase(main):024:0> exists 'emp'
Table emp does exist
==============================================================
hbase(main):015:0> exists 'student'

Table student does not exist

Dropping a Table using HBase Shell

Using the drop command, you can delete a table. Before dropping a table, you
have to disable it.
hbase(main):018:0> disable 'emp'
hbase(main):019:0> drop 'emp'

Verify whether the table is deleted using the exists command.

hbase(main):020:07gt; exists 'emp'
Table emp does not exist
drop_all
This command is used to drop the tables matching the “regex” given in the
command. Its syntax is as follows:
hbase> drop_all ‘t.*’
Note: Before dropping a table, you must disable it.
Example
Assume there are tables named raja, rajani, rajendra, rajesh, and raju.
TABLE
raja
rajani
rajendra
rajesh
raju
All these tables start with the letters raj. First of all, let us disable all these
tables using the disable_all command as shown below.
hbase(main):002:0> disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled

Now you can delete all of them using the drop_all command as given below.
hbase(main):018:0> drop_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Drop the above 5 tables (y/n)?
y
5 tables successfully dropped
exit
You exit the shell by typing the exit command.
hbase(main):021:0> exit
11.23 Stopping HBase

To stop HBase, browse to the HBase home folder and type the following
command.
./bin/stop-hbase.sh
11.24 Inserting Data using HBase Shell

This chapter demonstrates how to create data in an HBase table. To create data
in an HBase table, the following commands and methods are used:
● put command,
● add() method of Put class, and
● put() method of HTable class.
As an example, we are going to create the following table in HBase.
Using put command, you can insert rows into a table. Its syntax is as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row

Let us insert the first row values into the emp table as shown below.
hbase(main):005:0> put 'emp','1','dexlab data:name','raju'
hbase(main):006:0> put 'emp','1','dexlab data:city','hyderabad'
hbase(main):007:0> put 'emp','1','emp
data:designation','manager'
hbase(main):007:0> put 'emp','1','emp data:salary','50000'

Insert the remaining rows using the put command in the same way. If you insert
the whole table, you will get the following output.
ROW COLUMN+CELL
1 column=dexlab data:city, timestamp=1417524216501, value=hyderabad
1 column=dexlab data:name, timestamp=1417524185058, value=ramu
1 column=emp data:designation, timestamp=1417524232601,
value=manager
1 column=emp data:salary, timestamp=1417524244109, value=50000
2 column=dexlab data:city, timestamp=1417524574905, value=chennai
2 column=dexlab data:name, timestamp=1417524556125, value=ravi
value=sr:engg
2 column=emp data:salary, timestamp=1417524604221, value=30000
3 column=dexlab data:city, timestamp=1417524681780, value=delhi
3 column=dexlab data:name, timestamp=1417524672067, value=rajesh
value=jr:engg
3 column=emp data:salary, timestamp=1417524702514,
value=25000
11.25 Updating Data using HBase Shell

You can update an existing cell value using the put command. To do so, just
follow the same syntax and mention your new value as shown below.
put ‘table name’,’row ’,'Column family:column name',’new value’
The newly given value replaces the existing value, updating the row.
Example
Suppose there is a table in HBase called emp with the following data.


ROW COLUMN + CELL
row1 column = dexlab:city, timestamp = 1418275907, value = Hyderabad
row1 column = emp:designation, timestamp = 14180555,value = manager
row1 column = emp:salary, timestamp = 1418035791555,value = 50000
The following command will update the city value of the employee named ‘Raju’
to Delhi.
hbase(main):002:0> put 'emp','row1','dexlab:city','Delhi'
The updated table looks as follows where you can observe the city of Raju has
been changed to ‘Delhi’.
ROW COLUMN + CELL
row1 column = dexlab:city, timestamp = 1418274645907, value = Delhi
row1 column = emp:designation, timestamp = 141857555,value = manager
row1 column = emp:salary, timestamp = 1418039555, value = 50000
11.26 Reading Data using HBase Shell

The get command and the get() method of HTable class are used to read data
from a table in HBase. Using get command, you can get a single row of data at a
time. Its syntax is as follows:
get ’<table name>’,’row1’
Example
The following example shows how to use the get command. Let us scan the first
row of the emp table.
hbase(main):012:0> get 'emp', '1'
COLUMN CELL
dexlab : city timestamp = 1417521848375, value = hyderabad
dexlab : name timestamp = 1417521785385, value = ramu
emp: designation timestamp = 1417521885277, value = manager
emp: salary timestamp = 1417521903862, value = 50000

Reading a Specific Column

Given below is the syntax to read a specific column using the get method.
hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}
Example
Given below is the example to read a specific column in HBase table.
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'dexlab:name'}
COLUMN CELL
dexlab:name timestamp = 1418035791555, value = raju
Deleting a Specific Cell in a Table

Using the delete command, you can delete a specific cell in a table. The syntax of
delete command is as follows:
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
Example
Here is an example to delete a specific cell. Here we are deleting the salary.
hbase(main):006:0> delete 'emp', '1', 'dexlab data:city',
1417521848375
11.27 Deleting All Cells in a Table

Using the “deleteall” command, you can delete all the cells in a row. Given below
is the syntax of deleteall command.
deleteall ‘<table name>’, ‘<row>’,
Example
Here is an example of “deleteall” command, where we are deleting all the cells of
row1 of emp table.
hbase(main):007:0> deleteall 'emp','1'
Verify the table using the scan command. A snapshot of the table after deleting
the table is given below.
ROW COLUMN + CELL
2 column = dexlab data:city, timestamp = 1417524574905, value = chennai

2 column = dexlab data:name, timestamp = 1417524556125, value = ravi
2 column = emp data:designation, timestamp = 1417524204, value = sr:engg
3 column = dexlab data:city, timestamp = 1417524681780, value = delhi
3 column = dexlab data:name, timestamp = 1417524672067, value = rajesh
3 column = emp data:designation, timestamp = 1417523187, value = jr:engg
11.28 Scaning using HBase Shell

The scan command is used to view the data in HTable. Using the scan command,
you can get the table data. Its syntax is as follows:
scan ‘<table name>’
Example
The following example shows how to read data from a table using the scan
command. Here we are reading the emp table.
ROW COLUMN + CELL
1 column = dexlab data:city, timestamp = 1417521848375, value = hyderabad
1 column = dexlab data:name, timestamp = 1417521785385, value = ramu
1 column = emp data:designation, timestamp = 1417585277,value = manager
count
You can count the number of rows of a table using the count command. Its
syntax is as follows:
count ‘<table name>’
After deleting the first row, emp table will have two rows. Verify it as shown
below.

hbase(main):023:0> count 'emp'

⇒2
truncate
This command disables drops and recreates a table. The syntax of truncate is as
follows:
hbase> truncate 'table name'
Example
Given below is the example of truncate command. Here we have truncated the
emp table.
hbase(main):011:0> truncate 'emp'
Truncating 'one' table (it may take a while):
- Disabling table...
- Truncating table...
After truncating the table, use the scan command to verify. You will get a table
with zero rows.
hbase(main):017:0> scan ‘emp’
ROW COLUMN + CELL
11.29 HBase Security

grant
The grant command grants specific rights such as read, write, execute,
and admin on a table to a certain user. The syntax of grant command is as
follows:
hbase> grant <user> <permissions> [<table> [<column family> [<column;
qualifier>]]
We can grant zero or more privileges to a user from the set of RWXCA, where
R - represents read privilege.
W - represents write privilege.
X - represents execute privilege.
C - represents create privilege.
A - represents admin privilege.
Given below is an example that grants all privileges to a user named

‘Dexlabanalytics’.

hbase(main):018:0> grant 'Dexlabanalytics', 'RWXCA'
revoke
The revoke command is used to revoke a user's access rights of a table. Its
syntax is as follows:
hbase> revoke <user>
The following code revokes all the permissions from the user named
‘Dexlabanalytics’.
hbase(main):006:0> revoke 'Dexlabanalytics'
user_permission
This command is used to list all the permissions for a particular table.
The syntax of user_permission is as follows:
hbase>user_permission ‘tablename’
The following code lists all the user permissions of ‘emp’ table.
hbase(main):013:0> user_permission 'emp'

Chapter 12: SQOOP | 117
CHAPTER
Sqoop
12
12.1 Introduction
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
12.2 What is Sqoop?

Apache Sqoop is a tool designed for Easily transferring bulk Of data between
Hadoop Ecosystem and data stores such as relational databases (mysql,sql).
Sqoop is used to import data from data stores(DataBases) into Hadoop
Distributed File System(HDFS) or related Hadoop eco-systems like Hive and
HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-
systems and export it to external data stores such as RDMS. Sqoop works with
relational databases such as Oracle, MySQL, Postgres etc.
12.3 Why we Sqoop used?

For Hadoop developers, the interesting work starts after data is loaded into
HDFS. For this, the data which is residing in the RDBMS need to be transferred
to HDFS, play around the data and might need to transfer back to relational
database management systems(That’s Mean We Need to import or export
data).It’s not a easy to travel big data between HDFS to RDBMS and RDMS to
HDFS. So when Developers write custom scripts to transfer data in and out of
Hadoop. So it’s not a perfect way that’s why Apache Sqoop provides an
alternative.
Sqoop automates most of the process, depends on the database to describe
the schema of the data to be imported. Sqoop always use Map Reduce to import
and export the data, which provides parallel mechanism as well as fault
tolerance. Sqoop makes developers task easy by providing command line
interface. So we just need of basic information like source, destination and
database authentication details in the sqoop command. Sqoop takes care of
remaining part.

12.4 Where is Sqoop used?

Relational database systems are widely used to interact with the traditional
business applications. So, relational database systems has become one of the
sources that generate Big Data.
As we are dealing with Big Data, Hadoop stores and processes the Big
Data using different processing frameworks like MapReduce, Hive, HBase,
Cassandra, Pig etc and storage frameworks like HDFS to achieve benefit of
distributed computing and distributed storage. In order to store and analyze the
Big Data from relational databases, Data need to be transferred between
database systems and Hadoop Distributed File System (HDFS). Here, Sqoop
comes into picture. Sqoop acts like a intermediate layer between Hadoop and
relational database systems. You can import data and export data between
relational database systems and Hadoop and its eco-systems directly using
sqoop.
12.5 Sqoop Architecture

Sqoop provides command line interface to the end users. Sqoop can also be
accessed using Java APIs.
Sqoop command submitted by the end user is parsed by Sqoop and
launches Hadoop Map only job to import or export data because Reduce phase is
required only when aggregations are needed. Sqoop just imports and exports the
data; it does not do any aggregations.
Sqoop parses the arguments provided in the command line and prepares
the Map job. Map job launch multiple mappers depends on the number defined
by user in the command line. For Sqoop import, each mapper task will be
assigned with part of data to be imported based on key defined in the command
line. Sqoop distributes the input data among the mappers equally to get high
performance. Then
each mapper creates
connection with the
database using JDBC
and fetches the part of
data assigned by
Sqoop and writes it
into HDFS or Hive or
HBase based on the
option provided in the
Figure 49: Sqoop Architecture
command line.

12.6 Sqoop-Import
Sqoop import command imports a table from an RDBMS to HDFS. Each record
from a table is considered as a separate record in HDFS. Records can be stored
as text files, or in binary representation as Avro or SequenceFiles.
Generic Syntax:
$ sqoop import (generic args) (import args)
$ sqoop-import (generic args) (import args)
The Hadoop specific generic arguments must precede any import
arguments, and the import arguments can be of any order.
Importing a Table into HDFS

Syntax:
$ sqoop import --connect --table --username --password --target-dir
--connect Takes JDBC url and connects to database
--table Source table name to be imported
--username Username to connect to database
--password Password of the connecting user
--target-dir Imports data to the specified directory
12.7 Sqoop - Installation

Step 1: JAVA Installation
You need to have Java installed on your system before installing Sqoop. Let us
verify Java installation using the following command:
$ java –version
If Java is already installed on your system, you get to see the following response:
If Java is not installed on your system, then follow the steps given below.
Installing Java
Follow the simple steps given below to install Java on your system.
Step 1:
Sudo apt-get install default-jdk

Step 2: Set java path in vim ~/.bashrc

export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
$ source ~/.bashrc
Step 3: Verifying Hadoop Installation

Hadoop must be installed on your system before installing Sqoop. Let us verify
the Hadoop installation using the following command:
$ hadoop version
If Hadoop is already installed on your system, then you will get the following
response:
Hadoop 2.6.0
If Hadoop is not installed on your system, then proceed with the following steps:
Downloading Hadoop
Download and extract Hadoop 2.6.0 from Apache Software Foundation using the
following commands.
$wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/
hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz
$mv hadoop-2.4.1/* to hadoop/
Installing Hadoop in Pseudo Distributed Mode

Follow the steps given below to install Hadoop 2.6.0 in pseudo-distributed mode.
Step 1: Setting up Hadoop

Set Hadoop path in ~/.bashrc file.
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now, apply all the changes into the current running system.
$ source ~/.bashrc
Step 2: Hadoop Configuration

core-site.xml
The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<value>hdfs://localhost:9000 </value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data,
namenode path, and datanode path of your local file systems. It means the place
where you want to store the Hadoop infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.

/home/dexlab/hadoop/hdfs/namenode is the directory created by hdfs file system.)
namenode path = /home/dexlab/hadoop/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = /home/dexlab/hadoop/hdfs/datanode
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>
<property>
<value>1</value>
</property>
<property>
<value>/home/dexlab/hadoop/hdfs/namenode </value>
</property>
<property>
<value>/home/dexlab/hadoop/hdfs/datanode </value>
</property>

</configuration>
Note: In the above file, all the property values are user-defined and you can
make changes according to your Hadoop infrastructure.
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By
default, Hadoop contains a template of yarn-site.xml. First of all, you need to
copy the file from mapred-site.xml.template to mapred-site.xml file using the
following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.
Step 1: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode –format
Step 2: Verifying Hadoop dfs

The following command is used to start all deamons of hadoop. Executing this
command will start your Hadoop file system.
$ start-all.sh
Step 3: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following URL to get
Hadoop services on your browser.
The following image depicts a Hadoop browser.

Figure 50: Namenode WebUI
Step 4: Verify All Applications for Cluster

The default port number to access all applications of cluster is 8088. Use the
following url to visit this service.
The following image depicts the Hadoop cluster browser.
Step 5: Downloading Sqoop

We can download the latest version of Sqoop from the following link For this
tutorial, we are using version 1.99.7, that is, sqoop-1.99.7
Step 6: Installing Sqoop

The following commands are used to extract the Sqoop tar ball and move it to
“/usr/lib/sqoop” directory.
$tar -xvf sqoop-1.99.7.bin__hadoop-2.6.0-alpha.tar.gz
$mv sqoop-1.4.4.bin__hadoop-2.6.0-alpha /home/dexlab/sqoop
Step 7: Configuring bashrc

You have to set up the Sqoop environment by appending the following lines to
~/.bashrc file:
#Sqoop
export SQOOP_HOME=/home/dexlab/sqoop export PATH=$PATH:$SQOOP_HOME/bin
The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc
Step 8: Configuring Sqoop

To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is
placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop
config directory and copy the template file using the following command:
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Open sqoop-env.sh and edit the following lines:

export HADOOP_COMMON_HOME=/home/dexlab/hadoop
export HADOOP_MAPRED_HOME=/home/dexlab/hadoop
Step 9: Download and Configure mysql-connector-java

We can download mysql-connector-java-5.1.30.tar.gz file from the following link.
The following commands are used to extract mysql-connector-java tarball
and move mysql-connector-java-5.1.30-bin.jar to /home/dexlab/sqoop/lib directory
$ cd mysql-connector-java-5.1.30
$ mv mysql-connector-java-5.1.30-bin.jar /home/dexlab/sqoop/lib
Step 10: Verifying Sqoop

The following command is used to verify the Sqoop version.
$ cd $SQOOP_HOME/bin
$ sqoop-version
Sqoop installation is complete.
12.8 Sqoop Import

Syntax
The following syntax is used to import data into HDFS.
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
Example
Let us take an example of three tables named as emp, emp_add, and
emp_contact, which are in a database called dexlabdb in a MySQL database
server. The three tables and their data are as follows.

emp:
Id Name Deg salary dept
1201 Denial Manager 50,000 TP
1202 Manish Proof reader 50,000 TP
1203 Khalil php dev 30,000 AC
1204 Prashant php dev 30,000 AC
1204 Kirti Admin 20,000 TP
emp_add:
Id Hno Street city
1201 288A Vgiri jublee
1202 108I Aoc sec-bad
1203 144Z Pgutta hyd
1204 78B old city sec-bad
1205 720X Hitec sec-bad
emp_contact:
Id Phno Email
1201 2356742 denial@tp.com
1202 1661663 manish@tp.com
1203 8887776 khalil@ac.com
1204 9988774 prashant@ac.com
1205 1231231 kirti@tp.com
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file
system as a text file or a binary file.
The following command is used to import the emp table from MySQL
database server to HDFS.

$ sqoop import \
--connect jdbc:mysql://localhost/dexlabdb \
--username root \
--table emp --m 1
To verify the imported data in HDFS, use the following command.

$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
It shows you the emp table data and fields are separated with comma (,).
1201, denial, manager, 50000, TP
1202, manish, preader, 50000, TP
1203, kapil, php dev, 30000, AC
1204, prashant, php dev, 30000, AC
1205, kirti, admin, 20000, TP
Importing into Target Directory

We can specify the target directory while importing table data into HDFS using
the Sqoop import tool.
Following is the syntax to specify the target directory as option to the
Sqoop import command.
--target-dir <new or exist directory in HDFS>
The following command is used to import emp_add table data into ‘/queryresult’
directory.
$ sqoop import \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
The following command is used to verify the imported data in /queryresult

directory form emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad

Import Subset of Table Data

We can import a subset of a table using the ‘where’ clause in Sqoop import tool.
It executes the corresponding SQL query in the respective database server and
stores the result in a target directory in HDFS.
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of emp_add table data. The
subset query is to retrieve the employee id and address, who lives in
Secunderabad city.
$ sqoop import \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
The following command is used to verify the imported data in /wherequery

directory from the emp_add table.
$ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*
It will show you the emp_add table data with comma (,) separated fields.
1202, 108I, aoc, sec-bad
1204, 78B, old city, sec-bad
1205, 720C, hitech, sec-bad
Incremental Import
Incremental import is a technique that imports only the newly added rows in a
table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options
to perform the incremental import.
The following syntax is used for the incremental option in Sqoop import
command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows:
1206, bunny p, grp des, 20000, GR
The following command is used to perform the incremental import in the
emp table.
$ sqoop import \
--username root \

--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to
HDFS emp/ directory.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
It shows you the emp table data with comma (,) separated fields.
The following command is used to see the modified or newly added rows from the
emp table.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1
It shows you the newly added rows to the emp table with comma (,) separated
fields.
12.9 Import-all-tables
Syntax
The following syntax is used to import all tables.
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Example
Let us take an example of importing all tables from the dexlabdb database. The
list of tables that the database dexlabdb contains is as follows.
+------------------------+
| Tables |
+-----------------------+
| emp |
| emp_add |
| emp_contact |
+-----------------------+

The following command is used to import all the tables from the dexlabdb
database.
$ sqoop import-all-tables \
--username root
Note: If you are using the import-all-tables, it is mandatory that every table in
that database must have a primary key field.
The following command is used to verify all the table data to the dexlabdb
database in HDFS
$ $HADOOP_HOME/bin/hadoop fs –ls
It will show you the list of table names in dexlabdb database as

directories.
Output
drwxr-xr-x - dexlab supergroup 0 2014-12-22 22:50 _sqoop
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:50 emp_add
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:52 emp_contact
12.10 Sqoop Export

Syntax
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Example
Let us take an example of the employee data in file, in HDFS. The employee data
is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as
follows.
It is mandatory that the table to be exported is created manually and is

present in the database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql

mysql> USE db;

mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data
file on HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
If the given data is stored successfully, then you can find the following table of
given employee data.
+-------+--------------+-----------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+-------+--------------+-----------------+-------------------+--------+
| 1201 | denial | manager | 50000 | TP |
| 1202 | manish | preader | 50000 | TP |
| 1203 | kapil | php dev | 30000 | AC |
| 1204 | prashant | php dev | 30000 | AC |
| 1205 | kirti | admin | 20000 | TP |
| 1206 | bunny p | grp des | 20000 | GR |
+------+--------------+------------------+-------------------+--------+
Sqoop Job
Syntax
The following is the syntax for creating a Sqoop job.
$ sqoop job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
Create Job (--create)

Here we are creating a job with the name dexjob, which can import the table
data from RDBMS table to HDFS. The following command is used to create a job

that is importing data from the employee table in the db database to the HDFS
file.
$ sqoop job --create dexjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)

‘--list’ argument is used to verify the saved jobs. The following command is used
to verify the list of saved Sqoop jobs.
$ sqoop job –list
It shows the list of saved jobs.

Available jobs:
dexjob
Inspect Job (--show)

‘--show’ argument is used to inspect or verify particular jobs and their details.
The following command and sample output is used to verify a job called dexjob.
$ sqoop job --show dexjob
It shows the tools and their options, which are used in dexjob.
Job: dexjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...
Execute Job (--exec)

‘--exec’ option is used to execute a saved job. The following command is used to
execute a saved job called dexjob.
$ sqoop job --exec dexjob

12.11 List-Databases
Syntax
The following syntax is used for Sqoop list-databases command.
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
Sample Query
The following command is used to list all the databases in the MySQL database
server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
12.12 List-Tables
Syntax
The following syntax is used for Sqoop list-tables command.
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
Sample Query
The following command is used to list all the tables in the dexlabdb database of
MySQL database server.
$ sqoop list-tables \
--username root

Chapter 13: APACHE PIG | 133
CHAPTER
Apache Pig
13
13.1 What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language
known as Pig Latin. This language provides various operators using which
programmers can develop their own functions for reading, writing, and
processing data.
To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
13.2 Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working
with Hadoop, especially while performing any MapReduce tasks. Apache Pig is
a boon for all such programmers.
● Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
● Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines
of code (LoC) in Java can be easily done by typing as less as just 10 LoC
in Apache Pig. Ultimately Apache Pig reduces the development time by
almost 16 times.
● Pig Latin is SQL-like language and it is easy to learn Apache Pig when
you are familiar with SQL.
● Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.

13.3 Apache Pig Vs MapReduce

Listed below are the major differences between Apache Pig and MapReduce.
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing

paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty It is quite difficult in MapReduce to

simple. perform a Join operation between
datasets.
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20
reducing the length of the codes to a great extent. times more the number of lines to
perform the same task.
There is no need for compilation. On execution, MapReduce jobs have a long

every Apache Pig operator is converted internally compilation process.
into a MapReduce job.
13.4 Pig Vs SQL

Listed below are the major differences between Apache Pig and SQL.
Pig SQL
Pig Latin is a procedural language. SQL is a declarative

language.
In Apache Pig, schema is optional. We can store data without Schema is mandatory in
designing a schema (values are stored as $01, $02 etc.) SQL.
The data model in Apache Pig is nested relational. The data model used in SQL
is flat relational.
Apache Pig provides limited opportunity for Query There is more opportunity
optimization. for query optimization in
SQL.
13.5 Pig Vs Hive

Both Apache Pig and Hive are used to create MapReduce jobs. And in some
cases, Hive operates on HDFS in a similar way Apache Pig does. In the

following table, we have listed a few significant points that set Apache Pig apart
from Hive.
Apache Pig Hive
Apache Pig uses a language called Pig Latin. It Hive uses a language called HiveQL. It was
was originally created at Yahoo. originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in HiveQL is a declarative language.

pipeline paradigm.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
13.6 Pig Architecture

The language used to analyze data in Hadoop using Pig is known as Pig Latin.
It is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need
to write a Pig script using the Pig Latin language, and execute them using any
of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution,
these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce
jobs, and thus, it makes the programmer’s job easy. The architecture of Apache
Pig is shown below.
13.7 Apache Pig Components

As shown in the figure, there are various
components in the Apache Pig framework. Let us
take a look at the major components.
Parser
Initially the Pig Scripts are handled by the
Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The
output of the parser will be a DAG (directed acyclic
graph), which represents the Pig Latin statements
and logical operators.
In the DAG, the logical operators of the
script are represented as the nodes and the data
flows are represented as edges.
Figure 52: Apache Pig
Components

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic
datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an
Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple
can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a
a table in RDBMS, it is not necessary
that every tuple contain the same
number of fields or that the fields in the
same position (column) have the same
type. Figure 53: Tuple and Bag

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
13.8 Pig Installation

Download Apache Pig
First of all, download the latest version of Apache Pig from the following website
− https://pig.apache.org/
Install Apache Pig

After downloading the Apache Pig software, install it in your Linux environment
by following the steps given below.
Step 1 - Create a directory with the name Pig in the same directory where the
installation directories of Hadoop, Java, and other software were installed. (In
our tutorial, we have created the Pig directory in the user named Hadoop).
$ mkdir Pig
Step 2 - Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3 - Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created
earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/dexlab/Pig/
Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit
two files − bashrc and pig.properties.
.bashrc file
In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s installation folder,

 PATH environment variable to the bin folder, and
 PIG_CLASSPATH environment variable to the etc (configuration) folder
of your Hadoop installations (the directory that contains the core-site.xml,
hdfs-site.xml and mapred-site.xml files).
export PIG_HOME = /home/dexlab/Pig
export PATH = PATH:/home/dexlab/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
Verifying the Installation

Verify the installation of Apache Pig by typing the version command. If the
installation is successful, you will get the version of Apache Pig as shown below.
$ pig –version
Apache Pig version 0.15.0 (r1682971)

compiled Jun 01 2015, 11:44:35
13.9 Pig Execution

Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file
system. There is no need of Hadoop or HDFS. This mode is generally used for
testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop
File System (HDFS) using Apache Pig. In this mode, whenever we execute the
Pig Latin statements to process the data, a MapReduce job is invoked in the
back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode,
batch mode, and embedded mode.
 Interactive Mode (Grunt shell) − You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode by writing
the Pig Latin script in a single file with .pig extension.

 Embedded Mode (UDF) − Apache Pig provides the provision of defining

our own functions (User Defined Functions) in programming languages
such as Java, and using them in our script.
Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the
−x option as shown below.
Local mode MapReduce mode
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly
entering the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x
command. Let us suppose we have a Pig script in a file named sample_script.pig
as shown below.
Sample_script.pig
dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump dexlab;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Note − We will discuss in detail how to run a Pig script in Bach mode and in
embedded mode in subsequent chapters.
13.10 Pig Shell Commands

Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to
that, we can invoke any shell commands using sh and fs.
Sh Command
Using sh command, we can invoke any shell commands from the Grunt shell.
Using sh command from the Grunt shell, we cannot execute the commands that
are a part of the shell environment (ex – cd).
Syntax
Given below is the syntax of sh command.
Grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh
option as shown below. In this example, it lists out the files in the /pig/bin/
directory.
Grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any Fs Shell commands from the Grunt
shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command.
In the following example, it lists the files in the HDFS root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Dexlab supergroup 0 2015-09-08 14:13 Hbase

drwxr-xr-x - Dexlab supergroup 0 2015-09-09 14:52 seqgen_data

drwxr-xr-x - Dexlab supergroup 0 2015-09-08 11:30 twitter_data
In the same way, we can invoke all the other file system shell commands
from the Grunt shell using the fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility
commands such as clear, help, history, quit, and set; and commands such as
exec, kill, and run to control Pig from the Grunt shell. Given below is the
description of the utility commands provided by the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell using the clear command as shown
below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Usage
You can get a list of Pig commands using the help command as shown below.
grunt> help
history Command
This command displays a list of statements executed / used so far since the
Grunt sell is invoked.
Usage
Assume we have executed three statements since opening the Grunt shell.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',');
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
grunt> dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab.txt' USING PigStorage(',');
Then, using the history command will produce the following output.
grunt> history
customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');

dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab.txt' USING PigStorage(',');
Set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
Key Description and values
default_parallel You can set the number of reducers for a map job by passing any whole
number as a value to this key.
Debug You can turn off or turn on the debugging freature in Pig by passing
on/off to this key.
job.name You can set the Job name to the required job by passing a string value
to this key.
job.priority You can set the job priority to a job by passing one of the following
values to this key −
● very_low
● low
● normal
● high
● very_high
stream.skippath For streaming, you can set the path from where the data is not to be
transferred, by passing the desired path in the form of a string to this
key.
Quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
Let us now take a look at the commands using which you can control
Apache Pig from the Grunt shell.
Exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility command exec.

grunt> exec [–param param_name = param_value] [–param_file

file_name] [script]
Example
Let us assume there is a file named dexlab.txt in the /pig_data/ directory of
HDFS with the following content.
Dexlab.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the

/pig_data/ directory of HDFS with the following content.
Sample_script.pig
PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump dexlab;
Now, let us execute the above script from the Grunt shell using the exec
command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As directed in
the script, it loads the dexlab.txt file into Pig and gives you the result of the
Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from the

Grunt shell using the kill command, as shown below.

grunt> kill Id_0055
Run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file
file_name] script
Example
Let us assume there is a file named dexlab.txt in the /pig_data/ directory of
HDFS with the following content.
Dexlab.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the local

filesystem with the following content.
Sample_script.pig
PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run
command as shown below.
grunt> run /sample_script.pig
You can see the output of the script using the Dump operator as shown
below.
grunt> Dump;
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Note − The difference between exec and the run command is that if we use run,
the statements from the script are available in the command history.

13.11 Pig Basics
Pig Latin – Data Model

As discussed in the previous chapters, the data model of Pig is fully nested. A
Relation is the outermost structure of the Pig Latin data model. And it is a bag
where −
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
Pig Latin – Statemets

While processing data using Pig Latin, statements are the basic constructs.
● These statements work with relations. They include expressions and
schemas.
● Every statement ends with a semicolon (;).
● We will perform various operations using operators provided by Pig
Latin, through statements.
● Except LOAD and STORE, while performing all other operations, Pig
Latin statements take a relation as input and produce another relation as
output.
● As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you need
to use the Dump operator. Only after performing the dump operation, the
MapReduce job for loading the data into the file system will be carried
out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Dexlab_data = LOAD 'dexlab_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
13.12 Pig Latin – Data types

Given below table describes the Pig Latin data types.
S.N. Data Type Description & Example
1 Int Represents a signed 32-bit integer.

Example : 8
2 Long Represents a signed 64-bit integer.

Example : 5L

3 Float Represents a signed 32-bit floating point.

Example : 5.5F
4 Double Represents a 64-bit floating point.

Example : 10.5
5 Chararray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.

Example : true/ false.
8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.

Example : 60708090709
10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883
Complex Types
11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)
12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}
13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values
in a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be the
result of an operation.
Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a =
10 and b = 20.
Operator Description Example
+ Addition − Adds values on either side of the operator a + b will give 30
− Subtraction − Subtracts right hand operand from left a − b will give −10
hand operand
* Multiplication − Multiplies values on either side of the a * b will give 200

operator
/ Division − Divides left hand operand by right hand b / a will give 2

operand
% Modulus − Divides left hand operand by right hand b % a will give 0

operand and returns remainder
Bincond − Evaluates the Boolean operators. It has b = (a == 1)? 20:

three operands as shown below. 30;
variable x = (expression) ? value1 if true : value2 if if a = 1 the value
?:
false. of b is 20.
if a!=1 the value of
b is 30.
CASE Case − The case operator is equivalent to nested CASE f2 % 2

WHEN bincond operator. WHEN 0 THEN
THEN 'even'
ELSE WHEN 1 THEN
END 'odd'
END
Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.
== Equal − Checks if the values of two operands are equal or (a = b) is not

not; if yes, then the condition becomes true. true
!= Not Equal − Checks if the values of two operands are equal (a != b) is

or not. If the values are not equal, then condition becomes true.
true.
> Greater than − Checks if the value of the left operand is (a > b) is not
greater than the value of the right operand. If yes, then the true.
condition becomes true.
< Less than − Checks if the value of the left operand is less (a < b) is
than the value of the right operand. If yes, then the condition true.
becomes true.
>= Greater than or equal to − Checks if the value of the left (a >= b) is
operand is greater than or equal to the value of the right not true.
operand. If yes, then the condition becomes true.
<= Less than or equal to − Checks if the value of the left operand (a <= b) is
is less than or equal to the value of the right operand. If yes, true.
then the condition becomes true.
Matches Pattern matching − Checks whether the string in the left- f1 matches
hand side matches with the constant in the right-hand side. '.*tutorial.*'

Pig Latin – Type Construction Operators

The following table describes the Type construction operators of Pig Latin.
() Tuple constructor operator − This operator is used to (Raju, 30)

construct a tuple.
{} Bag constructor operator − This operator is used to {(Raju, 30),

construct a bag. (Mohammad, 45)}
[] Map constructor operator − This operator is used to [name#Raja,

construct a tuple. age#30]
Pig Latin – Relational Operations

The following table describes the relational operators of Pig Latin.
Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a
relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, To generate data transformations based on columns of data.

GENERATE
STREAM To transform a relation using an external program.
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more

fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to

compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.
13.13 Pig Reading Data

Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the results
back in HDFS. Therefore, let us start HDFS and create the following sample
data in HDFS.
Dexlab ID First Name Last Name Phone City
001 Rajiv Reddy 9848022337 Hyderabad
002 Siddarth Battacharya 9848022338 Kolkata
003 Rajesh Khanna 9848022339 Delhi
004 Preethi Agarwal 9848022330 Pune
005 Trupthi Mohanthy 9848022336 Bhuwaneshwar
006 Archana Mishra 9848022335 Chennai
The above dataset contains personal details like id, first name, last name,
phone number and city, of six dexlabs.
Step 1: Verifying Hadoop

First of all, verify the installation using Hadoop version command, as shown
below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH variable, then
you will get the following output
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z

From source with checksum 18e43357c8f927c0695f1e9522859d6a

This command was run using /home/dexlab/hadoop/share/dexlab/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS

Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs
(distributed file system) as shown below.
$Start-all.sh
Step 3: Create a Directory in HDFS

In Hadoop DFS, you can create directories using the command mkdir. Create a
new directory in HDFS with the name Pig_Data in the required path as shown
below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS

The input file of Pig contains each tuple/record in individual lines. And the
entities of the record are separated by a delimiter (In our example we used “,”).
In the local file system, create an input file dexlab_data.txt containing
data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS using put command
as shown below. (You can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/dexlab/Pig/Pig_Data/dexlab_data.txt dfs://localhost:9000/pig_data/
Verifying the file

You can use the cat command to verify whether the file has been moved into the
HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/dexlab_data.txt
Output

You can see the content of the file as shown below.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
006,Archana,Mishra,9848022335,Chennai
13.14 The Load Operator

You can load data into Apache Pig from the file system (HDFS/ Local) using
LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-
hand side, we need to mention the name of the relation where we want to store
the data, and on the right-hand side, we have to define how we store the data.
Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' USING function as
schema;
Where,
relation_name − We have to mention the relation in which we want to store
the data.
Input file path − We have to mention the HDFS directory where the file is
stored. (In MapReduce mode)
function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define the required
schema as follows:
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case, the
columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in dexlab_data.txt in Pig under the schema
named Dexlab using the LOAD command.
13.15 Pig Grunt Shell

First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce
mode as shown below.

$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.

grunt>
Execute the Load Statement

Now load the data from the file dexlab_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
grunt> dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Following is the description of the above statement.

Relation We have stored the data in the schema dexlab.
name
Input file We are reading data from the file dexlab_data.txt, which is in the /pig_data/
path directory of HDFS.
Storage We have used the PigStorage() function. It loads and stores data as
function structured text files. It takes a delimiter using which each entity of a tuple is
separated, as a parameter. By default, it takes ‘\t’ as a parameter.
Schema We have stored the data using the following schema.

Column Id Firstname Lastname Phone City
Datatype In char array char array char array char array

t
Note − The load statement will simply load the data into the specified relation
in Pig. To verify the execution of the Load statement, you have to use the
Diagnostic Operators which are discussed in the next chapters.
13.16 Reading Data

Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file dexlab_data.txt in HDFS with the following content.

And we have read it into a relation dexlab using the LOAD operator as
shown below.
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown
below.
grunt> STORE dexlab INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Verification
You can verify the stored data as shown below.
Step 1 - First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Dexlab supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Dexlab supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing the store
statement.
Step 2 - Using cat command, list the contents of the file named part-m-00000 as
shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
6,Archana,Mishra,9848022335,Chennai

13.17 Pig Diagnostic Operator

The load statement will simply load the data into the specified relation in
Apache Pig. To verify the execution of the Load statement, you have to use the
Diagnostic Operators. Pig Latin provides four different types of diagnostic
operators −
● Dump operator
● Describe operator
● Explanation operator
● Illustration operator
In this chapter, we will discuss the Dump operators of Pig Latin.
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the
results on the screen. It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Example
Assume we have a file dexlab_data.txt in HDFS with the following content.
And we have read it into a relation dexlab using the LOAD operator as
shown below.
city:chararray );
Now, let us print the contents of the relation using the Dump operator as shown
below.
grunt> Dump dexlab
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

Chapter 14: APACHE FLUME | 155
CHAPTER
Apache Flume
14
14.1 What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log files,
events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various web servers
to HDFS.
Figure 54: Apache Flume
14.2 Applications of Flume

Assume an e-commerce web application wants to analyze the customer behavior
from a particular region. To do so, they would need to move the available log
data in to Hadoop for analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into
HDFS at a higher speed.
14.3 Advantages of Flume

Here are the advantages of using Flume
 Using Apache Flume we can store the data in to any of the centralized
stores (HBase, HDFS).
 When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data

producers and the centralized stores and provides a steady flow of data
between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
 Flume is reliable, fault tolerant, scalable, manageable, and customizable.
14.4 Features of Flume

Some of the notable features of Flume are as follows −
 Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
 Using Flume, we can get the data from multiple servers immediately into
Hadoop.
 Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and Twitter,
and e-commerce websites like Amazon and Flipkart.
 Flume supports a large set of sources and destinations types.
 Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
 Flume can be scaled horizontally.
14.5 Apache Flume - Data Transfer in

Hadoop
Big Data, as we know, is a collection of large datasets that cannot be processed
using traditional computing techniques. Big Data, when analyzed, gives
valuable results. Hadoop is an open-source framework that allows to store and
process Big Data in a distributed environment across clusters of computers
using simple programming models.
Streaming / Log Data

Generally, most of the data that is to be analyzed will be produced by various
data sources like applications servers, social networking sites, cloud servers, and
enterprise servers. This data will be in the form of log files and events.
Log file − In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.
On harvesting such log data, we can get information about −

 the application performance and locate various software and hardware

failures.
 the user behavior and derive better business insights.
The traditional method of transferring data into the HDFS system is to
use the put command. Let us see how to use the put command.
HDFS put Command

The main challenge in handling the log data is in moving these logs produced by
multiple servers to the Hadoop environment.
Hadoop File System Shell provides commands to insert data into
Hadoop and read from it. You can insert data into Hadoop using
the put command as shown below.
$ Hadoop fs –put /path of the required file /path in HDFS where to save the file
Problem with put Command

We can use the put command of Hadoop to transfer data from these sources to
HDFS. But, it suffers from the following drawbacks −
 Using put command, we can transfer only one file at a time while the
data generators generate data at a much higher rate. Since the analysis
made on older data is less accurate, we need to have a solution to transfer
data in real time.
 If we use put command, the data is needed to be packaged and should be
ready for the upload. Since the webservers generate data continuously, it
is a very difficult task.
What we need here is a solutions that can overcome the drawbacks

of put command and transfer the "streaming data" from data generators to
centralized stores (especially HDFS) with less delay.
Problem with HDFS

In HDFS, the file exists as a directory entry and the length of the file will be
considered as zero till it is closed. For example, if a source is writing data into
HDFS and the network was interrupted in the middle of the operation (without
closing the file), then the data written in the file will be lost.
Therefore we need a reliable, configurable, and maintainable system to
transfer the log data into HDFS.
Note − In POSIX file system, whenever we are accessing a file (say performing
write operation), other programs can still read this file (at least the saved
portion of the file). This is because the file exists on the disc before it is closed.

Available Solutions
To send streaming data (log files, events etc..,) from various sources to HDFS,
we have the following tools available at our disposal −
Facebook’s Scribe
Scribe is an immensely popular tool that is used to aggregate and stream log
data. It is designed to scale to a very large number of nodes and be robust to
network and node failures.
Apache Kafka
Kafka has been developed by Apache Software Foundation. It is an open-source
message broker. Using Kafka, we can handle feeds with high-throughput and
low-latency.
Apache Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool that is principally
designed to transfer streaming data from various sources to HDFS.
In this tutorial, we will discuss in detail how to use Flume with some
examples.
14.6 Apache Flume - Architecture

The following illustration depicts the basic architecture of Flume. As shown in
the illustration, data generators (such as Facebook, Twitter) generate data
which gets collected by individual Flume agents running on them. Thereafter,
a data collector (which is also an agent) collects the data from the agents
which is aggregated and pushed into a centralized store such as HDFS or HBase
Figure 55: Apache Flume Architecture

14.7 Flume Event

An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the destination
accompanied by optional headers. A typical Flume event would have the
following structure
14.8 Flume Agent

An agent is an independent daemon process
(JVM) in Flume. It receives the data (events)
from clients or other agents and forwards it to its
next destination (sink or agent). Flume may have
more than one agent. Following diagram
represents a Flume Agent
As shown in the diagram a Flume Agent
contains three main components
namely, source, channel, and sink.
Source
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives
events from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.
Channel
A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
These channels are fully transactional and they can work with any
number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
Sink
A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the destination.
The destination of the sink might be another agent or the central stores.

Example − HDFS sink
Note − A flume agent can have multiple sources, sinks and channels. We have
listed all the supported sources, sinks, channels in the Flume configuration
chapter of this tutorial.
14.9 Additional Components of

Flume Agent
What we have discussed above are the primitive components of the agent. In
addition to this, we have a few more components that play a vital role in
transferring the events from the data generator to the centralized stores.
Interceptors
Interceptors are used to alter/inspect flume events which are transferred
between source and channel.
Channel Selectors
These are used to determine which channel is to be opted to transfer the data in
case of multiple channels. There are two types of channel selectors −
Default channel selectors − These are also known as replicating channel

selectors they replicates all the events in each channel.
Multiplexing channel selectors − These decides the channel to send an event

based on the address in the header of that event.
Sink Processors
These are used to invoke a particular sink from the selected group of sinks.
These are used to create failover paths for your sinks or load balance events
across multiple sinks from a channel.
14.10 Apache Flume - Data Flow

Flume is a framework which is used to move log data into HDFS. Generally
events and log data are generated by the log servers and these servers have
Flume agents running on them. These agents receive the data from the data
generators.

The data in these agents will be collected by an intermediate node known

as Collector. Just like agents, there can be multiple collectors in Flume.
Finally, the data from all these collectors will be aggregated and pushed
to a centralized store such as HBase or HDFS. The following diagram explains
the data flow in Flume.
Figure 56: Apache Flume Data Flow
Multi-hop Flow
Within Flume, there can be multiple agents and before reaching the final
destination, an event may travel through more than one agent. This is known
as multi-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out flow. It
is of two types:
Replicating − The data flow where the data will be replicated in all the
configured channels.
Multiplexing − The data flow where the data will be sent to a selected channel
which is mentioned in the header of the event.
Fan-in Flow
The data flow in which the data will be transferred from many sources to one
channel is known as fan-in flow.
Failure Handling
In Flume, for each event, two transactions take place: one at the sender and one
at the receiver. The sender sends events to the receiver. Soon after receiving the
data, the receiver commits its own transaction and sends a “received” signal to
the sender. After receiving the signal, the sender commits its transaction.
(Sender will not commit its transaction till it receives a signal from the receiver.)

14.11 Apache Flume – Environment

We already discussed the architecture of Flume in the previous chapter. In this
chapter, let us see how to download and setup Apache Flume.
Before proceeding further, you need to have a Java environment in your
system. So first of all, make sure you have Java installed in your system. For
some examples in this tutorial, we have used Hadoop HDFS (as sink). Therefore,
we would recommend that you go install Hadoop along with Java. To collect
more information, follow the link
−http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
Installing Flume
First of all, download the latest version of Apache Flume software from the
website https://flume.apache.org/.
Step 1 - Open the website. Click on the download link on the left-hand side of
the home page. It will take you to the download page of Apache Flume.
Figure 57: Apache Flume Download Screen
Step 2 - In the Download page, you can see the links for binary and source files
of Apache Flume. Click on the link apache-flume-1.6.0-bin.tar.gz
You will be redirected to a list of mirrors where you can start your
download by clicking any of these mirrors. In the same way, you can download
the source code of Apache Flume by clicking on apache-flume-1.6.0-src.tar.gz.

Step 3 - Create a directory with the name Flume in the same directory where
the installation directories of Hadoop, HBase, and other software were
installed (if you have already installed any) as shown below.
$ mkdir Flume
Step 4 - Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf apache-flume-1.6.0-bin.tar.gz
$ tar zxvf apache-flume-1.6.0-src.tar.gz
Step 5 - Move the content of apache-flume-1.6.0-bin.tar file to the Flume

directory created earlier as shown below. (Assume we have created the Flume
directory in the local user named Hadoop.)
$ mv apache-flume-1.6.0-bin.tar/* /home/Hadoop/Flume/
14.12 Apache Flume – Configuration

To configure Flume, we have to modify three files namely, flume-env.sh,
flumeconf.properties, and bash.rc.
Setting the Path / Classpath

In the .bashrc file, set the home folder, the path, and the classpath for Flume as
shown below.
Figure 58: Setting home folder, path and classpath for Flume

conf Folder
If you open the conf folder of Apache Flume, you will have the following four
files −
 flume-conf.properties.template,
 flume-env.sh.template,
 flume-env.ps1.template, and
 log4j.properties.
Figure 59: Flume Configuration

Now rename
 flume-conf.properties.template file as flume-conf.properties and
 flume-env.sh.template as flume-env.sh
flume-env.sh
Open flume-env.sh file and set the JAVA_Home to the folder where Java was
installed in your system.
Figure 60: Java Home

Verifying the Installation

Verify the installation of Apache Flume by browsing through the bin folder and
typing the following command.
$ ./flume-ng
If you have successfully installed Flume, you will get a help prompt of
Flume as shown below.
Figure 61: Flume Help
Apache Flume – Configuration

After installing Flume, we need to configure it using the configuration file which
is a Java property file having key-value pairs. We need to pass values to the
keys in the file.
In the Flume configuration file, we need to
 Name the components of the current agent.
 Describe/Configure the source.
 Describe/Configure the sink.
 Describe/Configure the channel.
 Bind the source and the sink to the channel.
Usually we can have multiple agents in Flume. We can differentiate each
agent by using a unique name. And using this name, we have to configure each
agent.

Naming the Components

First of all, you need to name/list the components such as sources, sinks, and the
channels of the agent, as shown below.
agent_name.sources = source_name
agent_name.sinks = sink_name
agent_name.channels = channel_name
Flume supports various sources, sinks, and channels. They are listed in
the table given below.
Sources Channels Sinks
 Avro Source  Memory Channel  HDFS Sink

 Thrift Source  JDBC Channel  Hive Sink
 Exec Source  Kafka Channel  Logger Sink
 JMS Source  File Channel  Avro Sink
 Spooling Directory  Spillable Memory  Thrift Sink
Source Channel  IRC Sink
 Twitter 1% firehose  Pseudo Transaction  File Roll Sink
Source Channel  Null Sink
 Kafka Source  HBaseSink
 NetCat Source  AsyncHBaseSink
 Sequence Generator  MorphlineSolrSink
Source  ElasticSearchSink
 Syslog Sources  Kite Dataset Sink
 Syslog TCP Source  Kafka Sink
 Multiport Syslog
TCP Source
 Syslog UDP Source
 HTTP Source
 Stress Source
 Legacy Sources
 Thrift Legacy Source
 Custom Source
 Scribe Source
You can use any of them. For example, if you are transferring Twitter
data using Twitter source through a memory channel to an HDFS sink, and the
agent name id TwitterAgent, then
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
After listing the components of the agent, you have to describe the
source(s), sink(s), and channel(s) by providing values to their properties.

Describing the Source

Each source will have a separate list of properties. The property named “type” is
common to every source, and it is used to specify the type of the source we are
using.
Along with the property “type”, it is needed to provide the values of all
the required properties of a particular source to configure it, as shown below.
agent_name.sources. source_name.type = value
agent_name.sources. source_name.property2 = value
agent_name.sources. source_name.property3 = value
For example, if we consider the twitter source, following are the properties to
which we must provide values to configure it.
TwitterAgent.sources.Twitter.type = Twitter (type name)
TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =
Describing the Sink

Just like the source, each sink will have a separate list of properties. The
property named “type” is common to every sink, and it is used to specify the type
of the sink we are using. Along with the property “type”, it is needed to provide
values to all the required properties of a particular sink to configure it, as
shown below.
agent_name.sinks. sink_name.type = value
agent_name.sinks. sink_name.property2 = value
agent_name.sinks. sink_name.property3 = value
For example, if we consider HDFS sink, following are the properties to

TwitterAgent.sinks.HDFS.type = hdfs (type name)
TwitterAgent.sinks.HDFS.hdfs.path = HDFS directory’s Path to store the data
Describing the Channel

Flume provides various channels to transfer data between sources and sinks.
Therefore, along with the sources and the channels, it is needed to describe the
channel used in the agent.
To describe each channel, you need to set the required properties, as
shown below.
agent_name.channels.channel_name.type = value
agent_name.channels.channel_name. property2 = value
agent_name.channels.channel_name. property3 = value

For example, if we consider memory channel, following are the properties to

TwitterAgent.channels.MemChannel.type = memory (type name)
Binding the Source and the Sink to the Channel

Since the channels connect the sources and sinks, it is required to bind both of
them to the channel, as shown below.
agent_name.sources.source_name.channels = channel_name
agent_name.sinks.sink_name.channels = channel_name
The following example shows how to bind the sources and the sinks to a channel.
Here, we consider twitter source, memory channel, and HDFS sink.
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channels = MemChannel
Starting a Flume Agent

After configuration, we have to start the Flume agent. It is done as follows
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent
where −
agent − Command to start the Flume agent
--conf ,-c<conf> − Use configuration file in the conf directory
-f<file> − Specifies a config file path, if missing
--name, -n <name> − Name of the twitter agent
-D property =value − Sets a Java system property value.
14.13 Apache Flume - Fetching Twitter Data

Using Flume, we can fetch data from various services and transport it to
centralized stores (HDFS and HBase). This chapter explains how to fetch data
from Twitter service and store it in HDFS using Apache Flume.
As discussed in Flume Architecture, a webserver generates log data and
this data is collected by an agent in Flume. The channel buffers this data to a
sink, which finally pushes it to centralized stores.
In the example provided in this chapter, we will create an application and
get the tweets from it using the experimental twitter source provided by Apache
Flume. We will use the memory channel to buffer these tweets and HDFS sink
to push these tweets into the HDFS.

Figure 62: Flume Twitter Data Fetch
To fetch Twitter data, we will have to follow the steps given below
 Create a twitter Application
 Install / Start HDFS
 Configure Flume
14.14 Creating a Twitter Application

In order to get the tweets from Twitter, it is needed to create a Twitter
application. Follow the steps given below to create a Twitter application.
Step 1 - To create a
Twitter application, click
on the following link
https://apps.twitter.com/.
Sign in to your Twitter
account. You will have a
Twitter Application
Management window
where you can create,
delete, and manage
Twitter Apps.
Figure 63: Creating Twitter Application
Step 2 - Click on the

Create New App button.
You will be redirected to a
window where you will get
an application form in
which you have to fill in
your details in order to
create the App. While
filling the website
address, give the complete
URL pattern, for example,
Figure 64: Application Details
http://example.com.

Step 3 - Fill in the

details, accept the
Developer Agreement
when finished, click on
the Create your
Twitter application
button which is at the
bottom of the page. If
everything goes fine, an
App will be created with
the given details as
shown below. Figure 65: Application Name
Step 4 - Under keys and

Access Tokens tab at
the bottom of the page,
you can observe a button
named Create my
access token. Click on it
to generate the access
token.
Figure 66: Application Access Token
Step 5 - Finally, click on

the Test OAuth button
which is on the right side
top of the page. This will
lead to a page which
displays your Consumer
key, Consumer secret,
Access token, and
Access token secret.
Copy these details. These
are useful to configure the
agent in Flume.
Figure 67: Application Overview
Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start
Hadoop and create a folder in it to store Flume data. Follow the steps given
below before configuring Flume.

Step 1: Install / Verify Hadoop

Install Hadoop. If Hadoop is already installed in your system, verify the
installation using Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the path variable, then you
will get the following output
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar
Step 2: Starting Hadoop

Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs
(distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoop-secondarynamenode-localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-nodemanager-localhost.localdomain.out
Step 3: Create a Directory in HDFS

In Hadoop DFS, you can create directories using the command mkdir. Browse
through it and create a directory with the name twitter_data in the required
path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/user/Hadoop/twitter_data

Configuring Flume
We have to configure the source, the channel, and the sink using the
configuration file in the conf folder. The example given in this chapter uses an
experimental source provided by Apache Flume named Twitter 1%
Firehose Memory channel and HDFS sink.
Twitter 1% Firehose Source

This source is highly experimental. It connects to the 1% sample Twitter
Firehose using streaming API and continuously downloads tweets, converts
them to Avro format, and sends Avro events to a downstream Flume sink.
We will get this source by default along with the installation of Flume.
The jar files corresponding to this source can be located in the lib folder as
shown below.
Figure 68: Twitter Jarfiles
Setting the classpath

Set the classpath variable to the lib folder of Flume in Flume-env.sh file as
shown below.
export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*
This source needs the details such as Consumer key, Consumer

secret, Access token, and Access token secret of a Twitter application.
While configuring this source, you have to provide values to the following
properties
Channels
Source type: org.apache.flume.source.twitter.TwitterSource
consumerKey The OAuth consumer key
consumerSecret OAuth consumer secret
accessToken OAuth access token
accessTokenSecret OAuth token secret

maxBatchSize Maximum number of twitter messages that

should be in a twitter batch. The default value
is 1000 (optional).
maxBatchDurationMillis Maximum number of milliseconds to wait before
closing a batch. The default value is 1000
(optional).
Channel
We are using the memory channel. To configure the memory channel,
you must provide value to the type of the channel.
type − It holds the type of the channel. In our example, the type
is MemChannel.
Capacity − It is the maximum number of events stored in the channel. Its

default value is 100 (optional).
TransactionCapacity − It is the maximum number of events the channel

accepts or sends. Its default value is 100 (optional).
HDFS Sink
This sink writes data into the HDFS. To configure this sink, you must provide
the following details.
Channeltype hdfs
hdfs.path the path of the directory in HDFS where data is
to be stored.
And we can provide some optional values based on the scenario. Given
below are the optional properties of the HDFS sink that we are configuring in
our application.
fileType This is the required file format of our HDFS

file. SequenceFile, DataStream and
CompressedStream are the three types
available with this stream. In our example, we
are using the DataStream.
writeFormat Could be either text or writable.
batchSize It is the number of events written to a file
before it is flushed into the HDFS. Its default
value is 100.

rollsize It is the file size to trigger a roll. It default

value is 100.
rollCount It is the number of events written into the file
before it is rolled. Its default value is 10.
Example – Configuration File

Given below is an example of the configuration file. Copy this content and save
as twitter.conf in the conf folder of Flume.
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
# Describing/Configuring the source

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key
TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret
TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token
TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token
secret
TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout,
hbase, nosql
# Describing/Configuring the sink
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
# Describing/Configuring the channel TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
Execution
Browse through the Flume home directory and execute the application as shown
below.
$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf

Dflume.root.logger=DEBUG,console -n TwitterAgent
If everything goes fine, the streaming of tweets into HDFS will start. Given
below is the snapshot of the command prompt window while fetching tweets.
Figure 69: Data Fetching

Verifying HDFS
You can access the Hadoop Administration Web UI using the URL given below.
Click on the dropdown named Utilities on the right-hand side of the page. You
can see two options as shown in the snapshot given below.
Figure 70: Hadoop WebUI

Click on Browse the file system and enter the path of the HDFS
directory where you have stored the tweets. In our example, the path will
be /user/Hadoop/twitter_data/. Then, you can see the list of twitter log files
stored in HDFS as given below.
Figure 71: Twitter Logs Files
Apache Flume - NetCat Source

This chapter takes an example to explain how you can generate events and
subsequently log them into the console. For this, we are using
the NetCat source and the logger sink.
Prerequisites
To run the example provided in this chapter, you need to install Flume.
Configuring Flume
We have to configure the source, the channel, and the sink using the
configuration file in the conf folder. The example given in this chapter uses
a NetCat Source, Memory channel, and a logger sink.
NetCat Source
While configuring the NetCat source, we have to specify a port while configuring
the source. Now the source (NetCat source) listens to the given port and receives
each line we entered in that port as an individual event and transfers it to the
sink through the specified channel.

While configuring this source, you have to provide values to the following
properties
channels
Source type − netcat
bind − Host name or IP address to bind.
port − Port number to which we want the source to listen.
Channel
We are using the memory channel. To configure the memory channel,
you must provide a value to the type of the channel. Given below are the list of
properties that you need to supply while configuring the memory channel −
type − It holds the type of the channel. In our example, the type
is MemChannel.
Capacity − It is the maximum number of events stored in the channel. Its

default value is 100. (optional)
TransactionCapacity − It is the maximum number of events the channel

accepts or sends. Its default value is 100. (optional).
Logger Sink
This sink logs all the events passed to it. Generally, it is used for testing or
debugging purpose. To configure this sink, you must provide the following
details.
Channel type − logger
Example Configuration File

Given below is an example of the configuration file. Copy this content and save
as netcat.conf in the conf folder of Flume.
# Naming the components on the current agent
NetcatAgent.sources = Netcat
NetcatAgent.channels = MemChannel
NetcatAgent.sinks = LoggerSink
# Describing/Configuring the source

NetcatAgent.sources.Netcat.type = netcat
NetcatAgent.sources.Netcat.bind = localhost
NetcatAgent.sources.Netcat.port = 56565
# Describing/Configuring the sink
NetcatAgent.sinks.LoggerSink.type = logger

# Describing/Configuring the channel

NetcatAgent.channels.MemChannel.type = memory
NetcatAgent.channels.MemChannel.capacity = 1000
NetcatAgent.channels.MemChannel.transactionCapacity = 100
# Bind the source and sink to the channel

NetcatAgent.sources.Netcat.channels = MemChannel
NetcatAgent.sinks. LoggerSink.channel = MemChannel
Execution
Browse through the Flume home directory and execute the application as shown
below.
$ cd $FLUME_HOME
$ ./bin/flume-ng agent --conf $FLUME_CONF --conf-file $FLUME_CONF/netcat.conf
--name NetcatAgent -Dflume.root.logger=INFO,console
If everything goes fine, the source starts listening to the given port. In
this case, it is 56565. Given below is the snapshot of the command prompt
window of a NetCat source which has started and listening to the port 56565.
Figure 72: Netcat Console
Passing Data to the Source

To pass data to NetCat source, you have to open the port given in the
configuration file. Open a separate terminal and connect to the source (56565)
using the curl command. When the connection is successful, you will get a
message “connected” as shown below.
$ curl telnet://localhost:56565

Connected
Now you can enter your data line by line (after each line, you have to press
Enter). The NetCat
source receives each
line as an individual
event and you will get
a received message
“OK”.
Whenever you
are done with passing
data, you can exit the
console by pressing
(Ctrl+C). Given below
is the snapshot of the
console where we
have connected to the
source using the curl command. Figure 73: Passing Data to Source
Each line that

is entered in the
above console will be
received as an
individual event by
the source. Since we
have used
the Logger sink,
these events will be
logged on to the
console (source
console) through the
specified channel
(memory channel in this case). Figure 74: Netcat Console
The following snapshot shows the NetCat console where the events are
logged.

Chapter 15: APACHE SPARK | 180
CHAPTER
Apache Spark
15
Spark was introduced by Apache Software Foundation for speeding up the
Hadoop computational computing software process.
As against a common belief, Spark is not a modified version of
Hadoop and is not, really, dependent on Hadoop because it has its own cluster
management. Hadoop is just one of the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second
is processing. Since Spark has its own cluster management computation, it
uses Hadoop for storage purpose only.
15.1 Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which includes
interactive queries and stream processing. The main feature of Spark is its in-
memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming. Apart from
supporting all these workload in a respective system, it reduces the
management burden of maintaining separate tools.
15.2 Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s
AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It
was donated to Apache software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014.
15.3 Features of Apache Spark

Apache Spark has following features.

Speed − Spark helps to run an application in Hadoop cluster, up to 100 times

faster in memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala,

or Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
15.4 Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.
Figure 75: Spark Architecture
There are three ways of Spark deployment as explained below.
Standalone − Spark Standalone deployment means Spark occupies the place on

top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark
jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark

job in addition to standalone deployment. With SIMR, user can start Spark and
uses its shell without any administrative access.

15.5 Components of Spark

The following illustration depicts the different components of Spark.
Figure 76: Spark Component
15.6 Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture. It is, according to benchmarks,
done by the MLlib developers against the Alternating Least Squares (ALS)
implementations. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides
an API for expressing graph computation that can model the user-defined

graphs by using Pregel abstraction API. It also provides an optimized runtime

for this abstraction.
15.7 Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
It is an immutable distributed collection of objects. Each dataset in RDD is
divided into logical partitions, which may be computed on different nodes of the
cluster. RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs
can be created through deterministic operations on either data on stable storage
or other RDDs. RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection
in your driver program, or referencing a dataset in an external storage
system, such as a shared file system, HDFS, HBase, or any data source offering
a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take
place and why they are not so efficient.
Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with
a parallel, distributed algorithm on a cluster. It allows users to write parallel
computations, using a set of high-level operators, without having to worry about
work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data
between computations (Ex − between two MapReduce jobs) is to write it to an
external stable storage system (Ex − HDFS). Although this framework provides
numerous abstractions for accessing a cluster’s computational resources, users
still want more.
Both Iterative and Interactive applications require faster data sharing
across parallel jobs. Data sharing is slow in MapReduce due to replication,
serialization, and disk IO. Regarding storage system, most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-write
operations.
Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage
applications. The following illustration explains how the current framework

works, while doing the iterative operations on MapReduce. This incurs

substantial overheads due to data replication, disk I/O, and serialization, which
makes the system slow.
Figure 77: Iteractive operation on mapreduce
Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk
I/O on the stable storage, which can dominates application execution time.
The following illustration explains how the current framework works
while doing the interactive queries on MapReduce.
Figure 78: Mapreduce Result
Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization,
and disk IO. Most of the Hadoop applications, they spend more than 90% of the
time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it stores the
state of memory as an object across the jobs and the object is sharable between
those jobs. Data sharing in memory is 10 to 100 times faster than network and
Disk.
Let us now try to find out how iterative and interactive operations take
place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It
will store intermediate results in a distributed memory instead of Stable storage
(Disk) and make the system faster.
Note: If the Distributed memory (RAM) is not sufficient to store intermediate

results (State of the JOB), then it will store those results on the disk.
Figure 79: Interactive Operation on Spark RDD
Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries
are run on the same set of data repeatedly, this particular data can be kept in
memory for better execution times.
Figure 80: Spark Working with RDD
By default, each transformed RDD may be recomputed each time you run
an action on it. However, you may also persist an RDD in memory, in which
case Spark will keep the elements around on the cluster for much faster access,
the next time you query it. There is also support for persisting RDDs on disk, or
replicated across multiple nodes.
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into
a Linux based system. The following steps show how to install Apache Spark.
15.8 Spark Installation

Step 1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the
following command to verify the JAVA version.
$java –version

If Java is already, installed on your system, you get to see the following response
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
Step 2: Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala
installation using following command.
$scala –version
If Scala is already installed on your system, you get to see the following response
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step
for Scala installation.
Step 3: Downloading Scala

Download the latest version of Scala by visit the following link Download Scala.
For this tutorial, we are using scala-2.11.6 version. After downloading, you will
find the Scala tar file in the download folder.
Step 4: Installing Scala

Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files

Use the following commands for moving the Scala software files, to respective
directory (/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala

Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it. Use the following command for
verifying Scala installation.
$scala –version
If Scala is already installed on your system, you get to see the following response
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step 5: Downloading Apache Spark

Download the latest version of Spark by visiting the following link Download
Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After
downloading it, you will find the Spark tar file in the download folder.
Step 6: Installing Spark

Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files

The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Setting up the environment for Spark

Add the following line to ~/.bashrc file. It means adding the location, where the
spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc
Step 7: Verifying the Spark Installation

Write the following command for opening Spark shell.
$spark-shell

If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>


Big Data &amp; Hadoop

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data &amp; Hadoop

Transféré par

Droits d'auteur :

Formats disponibles

DISCLAIMER

This book is designed to provide information on Big

Copyright© 2017-2018, DexLab Solutions Corp

All Rights Reserved.

Chapters Topic Page

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Chapters Topic Page

10.19 Dropping an Index 83

11. Apache HBase 84 - 116

12. Sqoop 117 - 132

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Chapters Topic Page

13. Apache Pig 133 - 154

14. Apache Flume 155 - 179

15. Apache Spark 180 - 193

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

1.2 How the data convert into Big Data?

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

1.3 Problem with Big Data

3V's of Big Data

Veracity: Nowadays data are not

Issues: Huge amount of unstructured data which needs to be stored, processed

Processing: Map Reduce paradigm is applied to data distributed over network

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

2.1 What, When, Why Hadoop?

2.2 Modules of Hadoop

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

2.3 Advantages of Hadoop

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

3.1 Name Node

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

3.2 Secondary NameNode

3.3 Data Node

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

 JobTracker monitors the individual TaskTrackers and the submits

3.5 Task Tracker

 TaskTracker runs on DataNode. Mostly on all DataNodes.

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

grep command examples

Search for a given string in all files recursively

find command examples

Find all empty files in home directory

ssh command examples

Debug ssh client

Display ssh client version

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

vim command examples

Go to the first match of the specified

Open the file in read only mode.

sort command examples

Sort a file in descending order

Sort passwd file by 3rd field.

Visual Classification of Files With Special Characters Using ls -F

gzip command examples

shutdown command examples

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Shutdown the system after 10 minutes

Reboot the system using shutdown command

Force the filesystem check during reboot

free command examples

kill command examples

Big Data & Hadoop

Big Data & Hadoop