Académique Documents
Professionnel Documents
Culture Documents
No part of this book may be reproduced or distributed in any form or by any electronic
or mechanical means including information storage and retrieval systems, without
permission in writing from the management.
Contents
Chapters Topic Page
1. Big Data 10 - 11
1.1 Big Data 10
1.2 How the Data convert into Big Data 10
1.3 Problem with Big Data 11
2. Hadoop Introduction 12 - 13
2.1 What, When, Why Hadoop? 12
2.2 Modules of Hadoop 12
2.3 Advantage of Hadoop 13
3. HDFS Services 14 - 17
3.1 Namenode 14
3.2 Secondary Namenode 15
3.3 Data Node 15
3.4 Job Tracker 16
3.5 Task Tracker 17
4. Hadoop Admin 18 - 36
4.1 Linux Basic Commands 18
4.2 Some Hadoop Basic Shell Command 24
4.3 Hadoop Installation 27
4.4 Hadoop Installation & File Configuration 28
4.5 Hadoop Modes 30
4.6 Hadoop Architecture 31
4.7 YARN – Application Startup 32
4.8 Input Splits 33
4.9 Rack Awareness 34
4.10 Hadoop Rack Awareness Vs. Hadoop Namenode 34
4.11 Why we are using Hadoop Rack Awareness 35
4.12 What is a Rack in Hadoop 36
4.13 What is Rack Awareness in Hadoop 36
5. Hadoop Namespace 37 - 39
5.1 Block 37
5.2 Metadata 38
5.3 Namespace 38
5.4 Namespace Issue 39
6. Data Replication 40 - 41
6.1 File Placement 40
6.2 Data Replication 40
6.3 Block Replication 41
6.4 Replication Factor 41
Chapter 1: BIG DATA | 6
7. Communication 42 - 44
7.1 Name node – Data node 42
7.2 Data Comunication 42
7.3 Heart Beat 42
7.4 Block Report 43
8. Failure Management 45 - 48
8.1 Check Point 45
8.2 FSImage 45
8.3 EditLog 46
8.4 Backup Node 46
8.5 Block Scanner 47
8.6 Failure Type 48
9. Map Reduce 49 - 58
9.1 What is Mapreduce? 49
9.2 The Algorithm 49
9.3 Inputs & Outputs (Java Perspective) 51
9.4 Terminology 51
9.5 Important Commands 52
9.6 How to interact with Mapreduce? 52
9.7 Mapreduce Program for Word Count 54
9.8 The Mapper 54
9.9 The Shuffle 55
9.10 The Reducer 55
9.11 Running the Hadoop Job 56
10. Hive 61 - 83
10.1 Hive Overview 61
10.2 Hive is Not 61
10.3 Merits on Hive 61
10.4 Architecture of Hive 61
10.5 Hive Installation 62
10.6 Hive Data Types 70
10.7 Create Database 72
10.8 Comparison with Hive and other database retrieving
Information 72
10.9 Metadata 73
10.10 Current SQL Compatibility 74
10.11 Hive DDL Commands 75
10.12 Hive DML Commands 75
10.13 Joins 76
10.14 Hive Bucket 80
10.15 Advantage with Hive Bucket 80
10.16 Creating a View 81
10.17 Dropping a View 82
10.18 Creating an Index 82
CHAPTER
Big Data
1
1.1 Big Data
Data which are very large in size is called Big Data. Normally we work on data
of size MB(WordDoc, Excel) or maximum GB(Movies, Codes) but data in Peta
bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of
today's data has been generated in the past 3 years & because we are
continuously using different data generators factors.
OR
The data is beyond to your storage capacity and your processing power is called
big data.
Social Networking Sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
E-Commerce Site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
Telecom Company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its million
users.
Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
Volume: The amount of data which we deal with is of very large size of Peta
bytes.
Use case: An e-commerce site XYZ (having 100 million users) wants to offer a
gift voucher of 100$ to its top 10 customers who have spent the most in the
previous year.Moreover, they want to find the buying trend of these customers so
that company can suggest more items related to them.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed
File System) which uses commodity hardware to form clusters and store data in
a distributed fashion. It works on Write once, read many times principle.
CHAPTER
Hadoop
Introduction 2
Apache Hadoop is an open source framework that allows to store and process big
data.
Hadoop has its own cluster (set of machines) with commodity hardware where
numbers of machines are working in distributed way.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output
of Map task is consumed by reduce task and then the out of reducer gives the
desired result.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store
data so it really cost effective as compared to traditional relational database
management system.
Resilient to failure: HDFS has the property with which it can replicate data
over the network, so if one node is down or some other network failure happens,
then Hadoop takes the other copy of data and use it. Normally, data are
replicated thrice but the replication factor is configurable.
CHAPTER
HDFS Services
3
Hadoop 1.x and Hadoop 2.x core daemons are as follow:
Operations
Clients contact to the Name Node in order to perform common file system
operations, such as open, close, rename, and delete. The Name Node does not
store HDFS data itself, but rather maintains a mapping between HDFS file
name, a list of blocks in the file, and the Data Node(s) on which those blocks are
stored.The system is designed in such a way that user data never flows through
the Name Node.
It periodically receives a Heartbeat and a Block report from each of the
Data Nodes which is present in the cluster. When namenode periodically
receives a Heartbeat from the Data Node that mean datanode is functioning
properly. A Block report contains a list of all blocks on a Data Node.
Namenode Format
When the NameNode is formatted a namespace ID is generated, which
essentially identifies that specific instance of the distributed filesystem. When
DataNodes first connect to the NameNode they store that namespace ID along
with the data blocks, because the blocks have to belong to a specific filesystem.
If a DataNode later connects to a NameNode, and the namespace ID
which the NameNode declares does not match the namespace ID stored on the
DataNode, it will refuse to operate with the "incompatible namespace ID" error.
It means that the DataNode has connected to a different NameNode, and the
blocks which it is storing don't belong to that distributed file system.
Each block is saved as a separate file in the node’s local file system. Because the
Data Node abstracts away details of the local storage arrangement, all nodes do
not have to use the same local file system. Blocks are created or destroyed on
Data Nodes at the request of the Name Node, which validates and processes
requests from clients. Although the Name Node manages the namespace, clients
communicate directly with Data Nodes in order to read or write data at the
HDFS block level. A Data node normally has no knowledge about HDFS files.
While starting up, it scans through the local file system and creates a list of
HDFS data blocks corresponding to each of these local files and sends this report
to the Name node.
Individual files are broken into blocks of a fixed size and distributed
across multiple DataNodes in the cluster. The Name Node maintains metadata
about the size and location of blocks and their replicas.
Hadoop was designed with an idea that DataNodes are "disposable
workers", servers that are fast enough to do useful work as a part of the cluster,
but cheap enough to be easily replaced if they fail.
The data block is stored on multiple computers, improving both resilience
to failure and data locality, taking into account that network bandwidth is a
scarce resource in a large cluster.
3.4 JobTracker
One of the master components, it is responsible for managing the overall
execution of a job. It performs functions such as scheduling child tasks
(individual Mapper and Reducer) to individual nodes, keeping track of the health
of each task and node, and even rescheduling failed tasks. As we will soon
demonstrate, like the NameNode, the Job Tracker becomes a bottleneck when it
comes to scaling Hadoop to very large clusters. The JobTracker daemon is
responsible for launching and monitoring MapReduce jobs.
JobTracker process runs on a separate node and not usually on a
DataNode.
JobTracker is an essential Daemon for MapReduce execution in
MRv1. It is replaced by ResourceManager/ApplicationMaster in
MRv2.
JobTracker receives the requests for MapReduce execution from the
client.
JobTracker talks to the NameNode to determine the location of the
data.
JobTracker finds the best TaskTracker nodes to execute tasks based
on the data locality (proximity of the data) and the available slots to
execute a task on a given node.
CHAPTER
Hadoop Admin
4
4.1 Linux Basic Commands
tar command examples
Extract from an existing tar archive.
$ tar xvf archive_name.tar
Print the matched line, along with the 3 lines after it.
$ grep -A 3 -i "example" demo_text
ls command examples
Display filesize in human readable format (e.g. KB, MB etc.,)
$ ls –lh
Order Files Based on Last Modified Time (In Reverse Order) Using ls -ltr
$ ls –ltr
cd command examples
Use cd to toggle between the last two directories
$ cd ..
$ cd
$ shutdown -h now
If you want to quickly check, how many GB of RAM your system has, use the -g
option. -b option displays in bytes, -k in kilo bytes, -m in mega bytes.
$ free –g
If you want to see a total memory (including the swap), use the -t switch, which
will display a total line as shown below
dexlab@dexlab-laptop:~$ free –t
rm command examples
Get confirmation before removing the file
$ rm -i filename.txt
It is very useful while giving shell metacharacters in the file name argument.
Print the filename and get confirmation before removing the file.
$ rm -i file*
Following example recursively removes all files and directories under the
example directory. This also removes the example directory itself.
$ rm -r example
cp command examples
Copy file1 to file2 preserving the mode, ownership and timestamp
$ cp -p file1 file2
Copy file1 to file2. if file2 exists prompt for confirmation before overwritting it.
$ cp -i file1 file
mv command examples
Rename file1 to file2. if file2 exists prompt for confirmation before overwritting
it.
$ mv -i file1 file2
Note:
mv -f is just the opposite, which will overwrite file2 without prompting.
mv -v will print what is happening during file rename, which is useful while
specifying shell metacharacters in the file name argument.
$ mv -v file1 file2
While displaying the file, following cat -n command will prepend the line number
to each line of the output.
$ cat -n /etc/logrotate.conf
Revoke all access for the group (i.e read, write and execute ) on a specific file.
Apply the file permissions recursively to all the files in the sub-directories.
$ chmod -R ug+rwx file.txt
Super user can use passwd command to reset others password. This will not
prompt for current password of the user.
$ passwd dexlab
Remove password for a specific user. Root user can disable password for a
specific user. Once the password is disabled, the user can login without entering
the password.
$ passwd -d dexlab
Create nested directories using one mkdir command. If any of these directories
exist already, it will not display any error. If any of these
$ mkdir -p dir1/dir2/dir3/dir4/
$ uname –a
When you want to search an executable from a path other than the whereis
default path, you can use -B option and give path as argument to it. This
searches for the executable lsmk in the /tmp directory, and displays it, if it is
available.
$ whereis -u -B /tmp -f lsmk
View the content of the file in real time using tail -f. This is useful to view the log
files, that keeps growing. The command can be terminated using CTRL-C.
$ tail -f log-file
One you open a file using less command, following two keys are very helpful.
CTRL+F forward one window
CTRL+B backward one window
su command examples
Switch to a different user account using su command. Super user can switch to
any other user without entering their password.
$ su – dexlab
If you want to specify the mysql root password in the command line itself, enter
it immediately after -p (without any space).
Report the amount of space used and available on currently mounted filesystem
$ hadoop fs -df hdfs:/
Count the number of directories,files and bytes under the paths that match the
specified file pattern
Add a sample text file from the local directory named sample to the new
directory you created in HDFS during the previous step.
Create new sample file
$ vim sample.txt -> i -> "text" -> :wq
$ hadoop fs -put /home/dexlab/pg/sample.csv /"directory name"
$ hadoop fs -ls /vivek
$ hadoop fs -cat /vivek/sample.csv
hadoop fs -put data/sample.txt /home/dexlab
Since /user/training is your home directory in HDFS, any command that does not
have an absolute path is interpreted as relative to that directory. The next
command will therefore list your home directory, and should show the items you
have just added there.
$ hadoop fs –ls
Finally, remove the entire retail directory and all of its contents in HDFS.
$ hadoop fs -rm -r hadoop/retail
Add the purchases.txt file from the local directory named purchases.txt /home/
dexlab/training/ purchases.txt to the hadoop directory you created in HDFS
$ hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
To view the contents of your text file purchases.txt which is present in your
hadoop directory
$ hadoop fs -cat hadoop/purchases.txt
Java Installation
$ sudo apt-get install default-jdk
SSH Installation
SSH is used to interact with the master and slaves computer without any prompt
for password. First of all create a dexlab user on the master and slave systems
$ useradd dexlab
$ passwd dexlab
To map the nodes open the hosts file present in /etc/ folder on all the machines
and put the ip address along with their host name.
vi /etc/hosts
Set up SSH key in every node so that they can communicate among themselves
without password. Commands for the same are:
$ su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab@dexlab-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab_tp1@dexlab-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub dexlab_tp2@dexlab-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Hadoop Installation
Hadoop can be downloaded from
$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0
Append following lines in the end and save and exit Hadoop variables
export JAVA_HOME=/usr/
export HADOOP_INSTALL=/home/dexlab/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
After this format the name node and start all the daemons
$ su dexlab
$ cd /home/dexlab/hadoop
$ bin/hadoop namenode -format
$ start-all.sh
storage. This is typically used for storing inputs and output (but not
intermediate ones).
Other alternative storage solutions, for instance, Amazon uses the Simple
Storage Service (S3).
The MapReduce Framework is the software layer implementing the
MapReduce paradigm.
The YARN infrastructure and the HDFS federation are completely
decoupled and independent: the first one provides resources for running an
application while the second one provides storage. The MapReduce framework is
only one of many possible framework which runs on top of YARN (although
currently is the only one implemented).
Figure 5: YARN
In cases where the last record in a block is incomplete, the input split
includes location information for the next block and the byte offset of the data
needed to complete the record.
The figure shows this relationship between data blocks and input splits.
Hadoop rack awareness helps you to define manually the rack number of
each slave datanode in the cluster. So far Hadoop has the concept functioning
behind Rack Awareness. We are manually defined the rack numbers because we
can prevent the data loss and enhance the network performance.
Therefore the each block of data will transmitted to multiple machines. It
prevent if some machine failure, we are not losing all copies of data.
CHAPTER
Hadoop
Namespace 5
5.1 Block
A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystems for a single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size. Filesystem blocks are
typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
This is generally transparent to the filesystem user who is simply reading or
writing a file of whatever length.
Having a block abstraction for a distributed filesystem brings several
benefits. The first benefit is the most obvious: a file can be larger than any single
disk in the network. There’s nothing that requires the blocks from a file to be
stored on the same disk, so they can take advantage of any of the disks in the
cluster. Like in a filesystem for a single disk, files in HDFS are broken into
block-sized chunks, which are stored as independent units. HDFS, too, has the
concept of a block, but it is a much larger unit—64 MB by default. HDFS blocks
are large compared to disk blocks, and the reason is to minimize the cost of
seeks. By making a block large enough, the time to transfer the data from the
disk can be significantly longer than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk
transfer rate.
A file can be made up of several blocks, which are stored on different
DataNodes chosen randomly on a block-by-block basis. As a result, access to a
file usually requires access to multiple DataNodes, which means that HDFS
supports file sizes far larger than a single-machine disk capacity. The DataNode
stores each HDFS data block in a separate file on its local filesystem with no
knowledge about the HDFS files themselves.
In fact, it would be possible, if unusual, to store a single file on an HDFS
cluster whose blocks filled all the disks in the cluster.
The default block size and replication factor are specified by Hadoop
configuration, but can be overwritten on a per-file basis. An application can
specify block size, the number of replicas, and the replication factor for a specific
file at its creation time.
There are tools to perform filesystem maintenance, such as df and fsck,
that operate on the filesystem block level.
5.2 MetaData
Because of the relatively low amount of metadata per file (it only tracks
filenames, permissions, and the locations of each block), the NameNode stores all
of the metadata in the main memory, thus allowing for a fast random access. The
metadata storage is designed to be compact. As a result, a NameNode with 4 GB
of RAM is capable of supporting a huge number of files and directories.
Modern distributed and parallel file systems such as pNFS , PVFS, HDFS,
and GoogleFS treat metadata services as an independent system component,
separately from data servers. A reason behind this separation is to ensure that
metadata access does not obstruct the data access path. Another reason is design
simplicity and the ability to scale the two parts of the system independently.
Files and directories are represented on the NameNode by inodes, which
record attributes like permissions, modification and access times, namespace and
disk space quotas. The NameNode maintains the file system namespace. Any
change to the file system namespace or its properties is recorded by the
NameNodeHDFS keeps the entire namespace in RAM.
Metadata are the most important management information replicated for
namenode failover. In our solution, the metadata include initial metadata which
are replicated in initialization phase and two types of runtime metadata which
are replicated in replication phase. The initial metadata include two types of
files: version file which contains the version information of running HDFS and
file system image (fsimage) file which is a persistent checkpoint of the file
system. Both files are replicated only once in initialization phase, because their
replication are time-intensive processes. Slave node updates fsimage file based
on runtime metadata to make the file catch up with that of primary node
The name node has an in-memory data structure called FsImage that
contains the entire file system namespace and maps the files on to blocks. The
NameNode stores all Metadata in a file called FsImage.
5.3 NameSpace
Traditional local file systems support a persistent name space. Local file system
views devices as being locally attached, the devices are not shared, and hence
there is no need in the file system design to enforce device sharing semantics.
HDFS supports a traditional hierarchical file organization.The file system
namespace hierarchy is similar to most other existing file systems; one can
create and remove files, move a file from one directory to another, or rename a
file. The NameNode exposes a file system namespace and allows data to be
stored on a cluster of nodes while allowing the user a single system view of the
file system. HDFS exposes a hierarchical view of the file system with files stored
in directories, and directories can be nested. The NameNode is responsible for
managing the metadata for the files and directories. The current HDFS
architecture allows only a single namespace for the entire cluster. This
namespace is managed by a single namenode. This architectural decision made
HDFS simpler to implement.
Files can be organized under Directory, which together form the
namespace of a file system. A file system typically organizes its namespace with
a tree-structured hierarchical organization. A distributed file system is a file
system that allows access to files from multiple hosts across a network.
A user or an application can create directories and store files inside these
directories.
Namespace partitioning has been a research topic for a long time, and
several methods have been proposed to solve this problem in academia. These
can be generally categorized into four types:
1. Static Subtree Partitioning
2. Hashing
3. Lazy Hybrid
4. Dynamic Subtree Partitioning
CHAPTER
Data
Replication 6
6.1 File Placement
HDFS uses replication to maintain at least three copies (one primary and two
replicas) of every chunk. Applications that require more copies can specify a
higher replication factor typically at file create time. All copies of a chunk are
stored on different data nodes using a rack-aware replica placement policy. The
first copy is always written to the local storage of a data node to lighten the load
on the network. To handle machine failures, the second copy is distributed at
random on different data nodes on the same rack as the data node that stored
the first copy. This improves network bandwidth utilization because inter-rack
communication is faster than cross-rack communication which often goes
through intermediate network switches. To maximize data availability in case of
a rack failure, HDFS stores a third copy distributed at random on data nodes in
a different rack.
HDFS uses a random chunk layout policy to map chunks of a file on to
different data nodes. At file create time; the name node randomly selects a data
node to store a chunk. This random chunk selection may often lead to sub-
optimal file layout that is not uniformly load balanced. The name node is
responsible to maintain the chunk to data node mapping which is used by clients
to access the desired chunk.
CHAPTER
Communication
7
7.1 NameNode-DataNode
Data Nodes and Name Node connections are established by handshake where
namespace ID and the software version of the Data Nodes are verified. The
namespace ID is assigned to the file system instance when it is formatted. The
namespace ID is stored persistently on all nodes of the cluster. A different
namespace ID node cannot join the cluster.
CHAPTER
Failure
Management 8
8.1 Checkpoint
Checkpoint is an image record written persistently to disk. Name Node uses two
types of files to persist its namespace:
Edits: logs containing changes to the namespace; these logs are also called
journals.
The NameNode keeps an image of the entire file system namespace and
file Blockmap in memory. This key metadata item is designed to be compact,
such that a NameNode with 4 GB of RAM is plenty to support a huge number of
files and directories. When the NameNode starts up, it reads the FsImage and
EditLog from disk, applies all the transactions from the EditLog to the in-
memory representation of the FsImage, and flushes out this new version into a
new FsImage on disk. It can then truncate the old EditLog because its
transactions have been applied to the persistent FsImage. This process is called
a checkpoint.
The Checkpoint node uses parameter fs.checkpoint.period to check the
interval between two consecutive checkpoints. The Interval time is in seconds
(default is 3600 second). The Edit log file size is specified by parameter
fs.checkpoint.size (default size 64MB) and a checkpoint triggers if size exceeds.
Multiple checkpoint nodes may be specified in the cluster configuration file.
8.2 FSImage
The entire filesystem namespace is contained in a file called the FsImage stored
as a file in the NameNode’s local filesystem. The image file represents an HDFS
metadata state at a point in time.
The entire file system namespace, including the mapping of blocks to files
and file system properties, is stored in a file called the FsImage. The FsImage is
stored as a file in the NameNode’s local file system. Name Node creates an
updated file system metadata by merging both files i.e. fsimage and edits on
restart. The Name Node then overwrites fsimage with the new HDFS state and
begins a new edits journal. The Checkpoint node periodically downloads the
latest fsimage and edits from the active NameNode to create checkpoints by
merging them locally and then to upload new checkpoints back to the active
NameNode. This requires the same memory space as that of NameNode and so
checkpoint needs to be run on separate machine. Namespace information lost if
either the checkpoint or the journal is missing, so it is highly recommended to
configure HDFS to store the checkpoint and journal in multiple storage
directories.
The fsimage file is a persistent checkpoint of the filesystem metadata.
However, it is not updated for every filesystem write operation, because writing
out the fsimage file, which can grow to be gigabytes in size, would be very slow.
This does not compromise resilience, however, because if the namenode fails,
then the latest state of its metadata can be reconstructed by loading the fsimage
from disk into memory, and then applying each of the operations in the edit log.
8.3 EditLog
The NameNode also uses a transaction log to persistently record every change
that occurs in filesystem metadata (metadata store). This log is stored in the
EditLog file on the NameNode’s local filesystem. Edit log is a transactional log of
every filesystem metadata change since the image file was created.
The Name Node uses a transaction log called the Edit Log to persistently
record every change that occurs to file system metadata. For example, creating a
new file in HDFS causes the Name Node to insert a record into the Edit Log
indicating this. Similarly, changing the replication factor of a file causes a new
record to be inserted into the Edit Log. The Name Node uses a file in its local
host OS file system to store the Edit Log.
The Backup node does not need to download fsimage and edits files from the
active NameNode in order to create a checkpoint, as would be required with a
Checkpoint node or Secondary NameNode, since it already has an up-to-date
state of the namespace state in memory. The Backup node checkpoint process is
more efficient as it only needs to save the namespace into the local fsimage file
and reset edits. As the Backup node maintains a copy of the namespace in
memory, its RAM requirements are the same as the NameNode. The NameNode
supports one Backup node at a time. No Checkpoint nodes may be registered if a
Backup node is in use. Using multiple Backup nodes concurrently will be
supported in the future.
The Backup node is configured in the same manner as the Checkpoint
node. It is started with bin/hdfs namenode -checkpoint. The location of the
Backup (or Checkpoint) node and its accompanying web interface are configured
via the dfs.backup.address and dfs.backup.http.address configuration variables.
Use of a Backup node provides the option of running the NameNode with no
persistent storage, delegating all responsibility for persisting the state of the
namespace to the Backup node. To do this, start the NameNode with the –i
NameNode Failures
DataNodes Failures
Network Partitions
Several things can cause loss of connectivity between name node and data
nodes. Therefore, each data node is expected to send a periodic heartbeat
messages to its name node. This is required to detect loss of connectivity if it
stops receiving them. The name node marks data nodes as dead data nodes if
they are not responding to heartbeats and refrains from sending further requests
to them. Data stored on a dead node is no longer available to an HDFS client
from that node, which is effectively removed from the system.
CHAPTER
MapReduce
9
9.1 What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a
data processing application into mappers and reducers is sometimes nontrivial.
But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing
tasks, verifying task completion, and copying data around the cluster
between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop
server.
9.4 Terminology
PayLoad Applications implement the Map and the Reduce functions,
and form the core of the job.
Mapper Mapper maps the input key/value pairs to a set of
intermediate key/value pair.
NamedNode Node that manages the Hadoop Distributed File System
(HDFS).
DataNode Node where data is presented in advance before any
processing takes place.
MasterNode Node where JobTracker runs and which accepts job requests
from clients.
SlaveNode Node where Map and Reduce program runs.
JobTracker Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker Tracks the task and reports status to JobTracker.
Job A program is an execution of a Mapper and Reducer across a
dataset.
Task An execution of a Mapper or a Reducer on a slice of data.
Task Attempt A particular instance of an attempt to execute a task on a
SlaveNode.
The following table lists the options available and their description.
Options Description
namenode –format Formats the DFS filesystem.
secondarynamenode Runs the DFS secondary namenode.
Namenode Runs the DFS namenode.
Datanode Runs a DFS datanode.
Dfsadmin Runs a DFS admin client.
Mradmin Runs a Map-Reduce admin client.
Fsck Runs a DFS filesystem checking utility.
Fs Runs a generic filesystem user client.
Balancer Runs a cluster balancing utility.
Oiv Applies the offline fsimage viewer to an fsimage.
Fetchdt Fetches a delegation token from the NameNode.
Jobtracker Runs the MapReduce job Tracker node.
Pipes Runs a Pipes job.
Tasktracker Runs a MapReduce task Tracker node.
Historyserver Runs job history servers as a standalone daemon.
Job Manipulates the MapReduce jobs.
Queue Gets information regarding JobQueues.
Version Prints the version.
jar <jar> Runs a jar file.
distcp <srcurl> <desturl> Copies file or directories recursively.
distcp2 <srcurl> <desturl> DistCp version 2.
archive -archiveName NAME –p Creates a hadoop archive.
<parent path> <src>* <dest>
Classpath Prints the class path needed to get the Hadoop jar and
the required libraries.
Daemonlog Get/Set the log level for each daemon
GENERIC_OPTIONS Description
-submit <job-file> Submits the job.
-status <job-id> Prints the map and reduce completion percentage and all
job counters.
-counter <job-id> <group-name> Prints the counter value.
<countername>
-kill <job-id> Kills the job.
-events <job-id> <fromevent-#> <#- Prints the events' details received by jobtracker for the
of-events> given range.
-history [all] <jobOutputDir> - Prints job details, failed and killed tip details. More
history < jobOutputDir> details about the job such as successful tasks and task
attempts made for each task can be viewed by specifying
the [all] option.
-list[all] Displays all jobs. -list displays only jobs which are yet to
complete.
-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed
attempts.
-fail-task <task-id> Fails the task. Failed tasks are counted against failed
attempts.
-set-priority <job-id> <priority> Changes the priority of the job. Allowed priority values are
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
That is, Hadoop Streaming sends raw data to the mapper via stdin, then
sends the mapped key/value pairs to the reducer via stdin.
import sys
where
1. Hadoop sends a line of text from the input file (“line” being defined by a
string of text terminated by a linefeed character, \n)
2. Python strips all leading/trailing whitespace (line.strip()
3. Python splits that line into a list of individual words along whitespace
(line.split())
4. For each word (which will become a key), we assign a value of 1 and then
print the key-value pair on a single line, separated by a tab (\t)
A more detailed explanation of this process can be found in Yahoo’s excellent
Hadoop Tutorial.
Translating this into Python and adding a little extra code to tighten up
the logic, we get
#!/usr/bin/env python
import sys
last_key = None
running_total = 0
if last_key == this_key:
running_total += value
else:
if last_key:
print( "%s\t%d" % (last_key, running_total) )
running_total = value
last_key = this_key
if last_key == this_key:
print( "%s\t%d" % (last_key, running_total) )
Before submitting the Hadoop job, you should make sure your mapper
and reducer scripts actually work.
This is just a matter of running them through pipes on a little bit of sample data
(e.g., the first 1000 lines of Moby Dick):
Once you know the mapper/reducer scripts work without errors, we can
plug them into Hadoop. We accomplish this by running the Hadoop Streaming
jar file as our Hadoop job. This hadoop-streaming-X.Y.Z.jar file comes with the
standard Apache Hadoop distribution and should be in $HADOOP_HOME/
contrib/streaming where $HADOOP_HOME is the base directory of your Hadoop
installation and X.Y.Z is the version of Hadoop you are running. On Gordon the
location is /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar, so our
actual job launch command would look like
$ hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
packageJobJar: [/scratch/glock/819550.gordon-fe2.local/hadoop-glock/data/hadoop-
unjar4721749961014550860/] [] /tmp/streamjob7385577774459124859.jar tmpDir=null
13/07/17 19:26:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/17 19:26:16 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/17 19:26:16 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 19:26:16 INFO streaming.StreamJob: getLocalDirs(): [/scratch/glock/819550.gordon-
fe2.local/hadoop-glock/data/mapred/local]
13/07/17 19:26:16 INFO streaming.StreamJob: Running job: job_201307171926_0001
13/07/17 19:26:16 INFO streaming.StreamJob: To kill this job, run:
13/07/17 19:26:16 INFO streaming.StreamJob: /opt/hadoop/libexec/../bin/hadoop job -
Dmapred.job.tracker=gcn-13-34.ibnet0:54311 -kill job_201307171926_0001
13/07/17 19:26:16 INFO streaming.StreamJob: Tracking URL: http://gcn-13-
34.ibnet0:50030/jobdetails.jsp?jobid=job_201307171926_0001
And at this point, the job is running. That “tracking URL” is a bit
deceptive in that you probably won’t be able to access it. Fortunately, there is a
command-line interface for monitoring Hadoop jobs that is somewhat similar
to qstat. Noting the Hadoop jobid (highlighted in red above), you can do:
$ hadoop -status job_201307171926_0001
Job: job_201307171926_0001
file: hdfs://gcn-13-34.ibnet0:54310/scratch/glock/819550.gordon-fe2.local/hadoop-
glock/data/mapred/staging/glock/.staging/job_201307171926_0001/job.xml
tracking URL: http://gcn-13-34.ibnet0:50030/jobdetails.jsp?jobid=job_201307171926_0001
map() completion: 1.0
reduce() completion: 1.0
Counters: 30
Job Counters
Launched reduce tasks=1
SLOTS_MILLIS_MAPS=16037
Total time spent by all reduces waiting after reserving slots (ms)=0
Total time spent by all maps waiting after reserving slots (ms)=0
Launched map tasks=2
Data-local map tasks=2
...
Since the hadoop streaming job runs in the foreground, you will have to
use another terminal (with HADOOP_CONF_DIR properly exported) to check on
the job while it runs. However, you can also review the job metrics after the job
has finished. In the example highlighted above, we can see that the job only used
one reduce task and two map tasks despite the cluster having more than two
nodes.
Adjusting Parallelism
Unlike with a traditional HPC job, the level of parallelism a Hadoop job is not
necessarily the full size of your compute resource. The number of map tasks is
ultimately determined by the nature of your input data due to how HDFS
distributes chunks of data to your mappers. You can “suggest” a number of
mappers when you submit the job though. Doing so is a matter of applying the
change highlighted in green:
$ hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-D mapred.map.tasks=4 \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input wordcount/mobydick.txt \
-output wordcount/output
...
$ hadoop -status job_201307172000_0001
...
Job Counters
Launched reduce tasks=1
SLOTS_MILLIS_MAPS=24049
Total time spent by all reduces waiting after reserving slots (ms)=0
Total time spent by all maps waiting after reserving slots (ms)=0
Rack-local map tasks=1
Launched map tasks=4
Data-local map tasks=3
With all this being said, the fact that Hadoop defaults to only two mappers
says something about the problem we’re trying to solve–that is, this entire
example is actually very stupid. While it illustrates the concepts quite neatly,
counting words in a 1.2 MB file is a waste of time if done through Hadoop
because, by default, Hadoop assigns chunks to mappers in increments of 64 MB.
Hadoop is meant to handle multi-gigabyte files, and actually getting Hadoop
streaming to do something useful for your research often requires a bit more
knowledge than what I’ve presented above.
To fill in these gaps, the next part of this tutorial, Parsing VCF Files with
Hadoop Streaming, shows how I applied Hadoop to solve a real-world problem
involving Python, some exotic Python libraries, and some not-completely-uniform
files.
HIVE
CHAPTER
Hive
10
10.1 Hive Overview
What is Hive?
Hive is a data warehouse package used for processing, managing and
querying structured data in Hadoop. It eases analysing process and summarises
big data.
It acts as a platform used to develop SQL type scripts to do MapReduce
operations.
Initially Hive was started by Facebook and after that Apache Software
Foundation uses it and further modify as an open source named Apache Hive.
Several companies are now using hive. For example, Amazon uses it in Amazon
Elastic MapReduce.
Hive is a data warehouse infrastructure software that can create User Interface
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Hive chooses respective database servers to store the schema or Meta Store
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the HiveQL Process Engine
Metastore. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Execution Engine
Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor
of MapReduce.
Hadoop distributed file system or HBASE are the data storage HDFS or HBASE
techniques to store data into file system.
If Java is already installed on your system, you get to see the following
response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for
installing java.
Installing Java
Step I: Following Command are used to install
Sudo apt-get install default-jdk
Step II: For setting up PATH and JAVA_HOME variables, add the
following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step III: Now verify the installation using the command java -version from
the terminal as explained above.
If Hadoop is already installed on your system, then you will get the following
response:
Hadoop 2.6.0Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:
Downloading Hadoop
Download and extract Hadoop 2.6.0 from Apache Software Foundation using the
following commands.
$ cd /usr/local
$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/
$ hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz
$ mv hadoop-2.6.0/* to hadoop/
Now apply all the changes into the current running system.
$ source ~/.bashrc
In order to develop Hadoop programs using java, you have to reset the java
environment variables in hadoop-env.sh file by replacing JAVA_HOME
value with the location of java in your system.
export JAVA_HOME=/usr/
Given below are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication
data, the namenode path, and the datanode path of your local file systems. It
means the place where you want to store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<Configuration>
<Property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/dexlab/hadoop/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/dexlab/hadoop/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can
make changes according to your Hadoop user.
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By
default, Hadoop contains a template of yarn-site.xml. First of all, you need to
copy the file from mapred-site.xml.template to mapred-site.xml file using the
following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit hive-site.xml and append the following lines between the <configuration>
and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Now set them in HDFS before verifying Hive. Use the following commands:
$ $/home/dexlab/hadoop/bin/hadoop fs -mkdir /tmp
$ /home/dexlab/hadoop fs -mkdir /user/hive/warehouse
$ $/home/dexlab/hadoop/bin/hadoop fs -chmod g+w /tmp
$ $/home/dexlab/hadoop/bin/hadoop fs -chmod g+w /user/hive/warehouse
Column Types
For column data types of Hive Column type are used as follows:
10Y Y TINYINT
10S S SMALLINT
10 - INT
10L L BIGINT
String Types
Single quotes (' ') or double quotes (" ") are used to specify string type data types.
It consists of two data types:
VARCHAR
CHAR
The following table depicts various CHAR data types:
Length Data Type
1 to 65355 VARCHAR
255 CHAR
Timestamp
Timestamp supports java.sql ans traditional UNIX timestamp with
optinal nanosecond precision.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
It is suitable for HIVE 0.8.0
Dates
It is suitable for HIVE 0.12.0
DATE values are described in year/month/day format in the form {{YYYY-
MM-DD}}.
Decimals
DECIMAL type in Hive correlates withBig Decimal format of Java which
is used for representing immutable arbitrary precision.
It is supported by HIVE 0.11.0 and HIVE 0.13.0
The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
A set of heterogeneous data types is union. You can create an instance
using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following types of literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally,
this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax:
ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax:
MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Retrieving All
SELECT * FROM table; SELECT * FROM table;
Values
SELECT SELECT
Grouping With
owner, COUNT(*) FROM owner, COUNT(*) FROM table
Counting
table GROUP BY owner; GROUP BY owner;
SELECT
Selecting from
pet.name, comment FROM SELECT pet.name, comment
multiple tables
pet FROM pet, event WHERE
(Join same table
JOIN event ON (pet.name = pet.name =event.name;
using alias w/”AS”)
event.name)
10.9 Metadata
Hive MySQL
Function
Selecting a
USE database; USE database;
database
Listing tables in a
SHOW TABLES; SHOW TABLES;
database
Describing the
DESCRIBE (FORMATTED DESCRIBE table;
format of a table
|EXTENDED)table;
Command Line
Hive Function
hive -S -e 'select a.col from tab1 a' Run Query Silent Mode
Drop database
1. hive> DROP DATABASE IF EXISTS dexlabdb;
LOAD DATA
hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO
TABLE Employee;
GROUP BY
hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
10.13 Joins
Hive transforms joins of various tables into a single map job in which according
to join clauses same column is used in every table .
Working of Joins
Compilation of join operation into map reduced task.
Traversing from join table performed by mapper.
Expanding join key and join pair into intermediate file.
SHUFFLE STAGE – Sorting and merging of those pairs done by Hadoop
called shuffle stage.
Because of inclusion of sorting and merging process makes shuffle stage
expensive.
REDUCER STAGE – Actual join work is done by reducer which takes
sorted result as an input.
Covering the basics of joins in hive.
We will be working with two tables customer and orders that we imported
in sqoop and going to perform following.
INNER JOIN – Select records that have matching values in both tables.
LEFT JOIN (LEFT OUTER JOIN) – returns all the values from the left table,
plus the matched values from the right table, or NULL in case of no matching
join predicate
RIGHT JOIN (RIGHT OUTER JOIN) A RIGHT JOIN returns all the values
from the right table, plus the matched values from the left table, or NULL in
case of no matching join predicate
FULL JOIN (FULL OUTER JOIN) – Selects all records that match either left or
right table records.
LEFT SEMI JOIN: Only returns the records from the left-hand table. Hive
doesn’t support IN subqueries so you can’t do
SELECT * FROM TABLE_A WHERE TABLE_A.ID IN (SELECT ID FROM TABLE_B);
Customer Table
Hive Tip: to print column headers in command line
hive> set hive.cli.print.header=true;
hive> select * from customers;
OK
customers.id customers.name
1 John
2 Kevin
19 Alex
3 Mark
4 Jenna
5 Robert
6 Zoya
7 Sam
8 George
9 Peter
Orders Table:
hive> select * from orders;
OK
order_id orders.order_date orders.customer_id orders.amount
101 2016-01-01 7 3540
102 2016-03-01 1 240
103 2016-03-02 6 2340
104 2016-02-12 3 5000
105 2016-02-12 3 5500
Inner Join
Select records that have matching values in both tables.
hive> select c.id, c.name, o.order_date, o.amount from customers c inner join orders o ON (c.id
= o.customer_id);
Output
c.id c.name o.order_date o.amount
7 Sam 2016-01-01 3540
1 John 2016-03-01 240
6 Zoya 2016-03-02 2340
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
9 Peter 2016-02-14 3005
1 John 2016-02-14 20
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
1 John 2016-02-27 200
Output
c.id c.name o.order_date o.amount
1 John 2016-02-27 200
1 John 2016-02-14 20
1 John 2016-03-01 240
19 Alex NULL NULL
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
3 Mark 2016-02-12 5500
3 Mark 2016-02-12 5000
4 Jenna NULL NULL
5 Robert NULL NULL
6 Zoya 2016-03-02 2340
7 Sam 2016-01-01 3540
8 George NULL NULL
9 Peter 2016-02-14 3005
OUTPUT
customers.id customers.name
1 John
2 Kevin
3 Mark
6 Zoya
7 Sam
9 Peter
Time taken: 56.362 seconds, Fetched: 6 row(s)
orderdate STRING,
amount DOUBLE,
tax DOUBLE,
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Example
Let us take an example for view. Assume employee table as given below, with
the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve
the employee details who earn a salary of more than Rs 30000. We store the
result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example:
Let us take an example for index. Use the same employee table that we have
used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an
index named index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
CHAPTER
Apache HBase
11
11.1 HBase Overview
HBase Architectural Components
Physically, HBase is composed of three types of servers in a master slave type of
architecture. Region servers serve data for reads and writes. When accessing data,
clients communicate with HBase RegionServers directly. Region assignment, DDL
(create, delete tables) operations are handled by the HBase Master process. Zookeeper,
which is part of HDFS, maintains a live cluster state.
The Hadoop DataNode stores the data that the Region Server is managing. All
HBase data is stored in HDFS files. Region Servers are collocated with the HDFS
DataNodes, which enable data locality (putting the data close to where it is needed) for
the data served by the RegionServers. HBase data is local when it is written, but when a
region is moved, it is not local until compaction.
The NameNode maintains metadata information for all the physical data blocks
that comprise the files.
11.2 Regions
HBase Tables are divided horizontally by row key range into “Regions.” A region
contains all rows in the table between the region’s start key and end key.
Regions are assigned to the nodes in the cluster, called “Region Servers,” and
these serve data for reads and writes. A region server can serve about 1,000
regions.
common shared state. Note that there should be three or five machines for
consensus.
region servers, and will recover region servers on failure. The Inactive HMaster
listens for active HMaster failure, and if an active HMaster fails, the inactive
HMaster becomes active.
WAL: Write Ahead Log is a file on the distributed file system. The WAL is used
to store new data that hasn't yet been persisted to permanent storage; it is used
for recovery in the case of failure.
BlockCache: is the read cache. It stores frequently read data in memory. Least
Recently Used data is evicted when full.
MemStore: is the write cache. It stores new data which has not yet been written
to disk. It is sorted before writing to disk. There is one MemStore per column
family per region.
● Edits are appended to the end of the WAL file that is stored on disk.
● The WAL is used to recover not-yet-persisted data in case a server
crashes.
Note that this is one reason why there is a limit to the number of column
families in HBase. There is one MemStore per CF; when one is full, they all
flush. It also saves the last written sequence number so the system knows what
was persisted so far.
The highest sequence number is stored as a meta field in each HFile, to
reflect where persisting has ended and where to continue. On region startup, the
sequence number is read, and the highest is used as the sequence number for
new edits.
3. If the scanner does not find all of the row cells in the MemStore and Block
Cache, then HBase will use the Block Cache indexes and bloom filters to
load HFiles into memory, which may contain the target row cells.
HDFS to provide the data safety as it stores its files. When data is written in
HDFS, one copy is written locally, and then it is replicated to a secondary node,
and a third copy is written to a tertiary node.
Data Recovery
WAL files contain a list of edits, with one edit representing a single put or delete.
Edits are written chronologically, so, for persistence, additions are appended to
the end of the WAL file that is stored on disk.
What happens if there is a failure when the data is still in memory and
not persisted to an HFile? The WAL is replayed. Replaying a WAL is done by
reading the WAL, adding and sorting the contained edits to the current
MemStore. At the end, the MemStore is flush to write changes to an HFile.
HBase Installation
We can install HBase in any of the three modes: Standalone mode, Pseudo
Distributed mode, and Fully Distributed mode.
Installing HBase in Standalone Mode
Download the latest stable version of HBase form http://www.interior-
dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the
tar “zxvf” command. See the following command.
$cd usr/local/
$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-
hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
Shift to super user mode and move the HBase folder to /home/dexlab as shown
below.
$su
$password: enter your password here
mv hbase-0.99.1/* Hbase/
HBase.
hbase-env.sh
Set the java Home for HBase and open hbase-env.sh file from the conf folder.
Edit JAVA_HOME environment variable and change the existing path to your
current JAVA_HOME variable as shown below.
cd /home/dexlab/Hbase/conf
gedit hbase-env.sh
This will open the env.sh file of HBase. Now replace the existing JAVA_HOME
value with your current value as shown below.
export JAVA_HOME=/usr/
hbase-site.xml
This is the main configuration file of HBase. Set the data directory to an
appropriate location by opening the HBase home folder in /usr/local/HBase.
Inside the conf folder, you will find several files, open the hbase-site.xml file as
shown below.
$ cd /home/dexlab/HBase/
$ cd conf
$ gedit hbase-site.xml
Inside the hbase-site.xml file, you will find the <configuration> and
</configuration> tags. Within them, set the HBase directory under the property
key with the name “hbase.rootdir” as shown below.
<configuration>
//Here you have to set the path where you want HBase to store its files.
<property>
<name>hbase.rootdir</name>
<value>/home/dexlab/HBase/HFiles</value>
</property>
//Here you have to set the path where you want HBase to store its built in Zookeeper files.
<property>
<name>hbase.Zookeeper.property.dataDir</name>
<value>/home/dexlab/Zookeeper</value>
</property>
</configuration>
$./start-hbase.sh
If everything goes well, when you try to run HBase start script, it will prompt
you a message saying that HBase has started.
starting master, logging to /home/dexlab/HBase/bin/../logs/hbase-tpmaster-
localhost.localdomain.out
Configuring HBase
Before proceeding with HBase, configure Hadoop and HDFS on your local
system or on a remote system and make sure they are running. Stop HBase if it
is running.
hbase-site.xml
Edit hbase-site.xml file to add the following properties.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the
local file system, change the hbase.rootdir, your HDFS instance address, using
the hdfs://// URI syntax. We are running HDFS on the localhost at port 8030.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8030/hbase</value>
</property>
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using
the following command.
$cd /usr/local/HBase
$bin/start-hbase.sh
Start HBase
Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/start-hbase.sh
Start Region
Start the region server as shown below.
$./bin/./local-regionservers.sh start 3
This will give you the HBase Shell Prompt as shown below.
2014-12-09 14:24:27,526 INFO [main] Configuration.deprecation:
hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri
Nov 14 18:26:29 PST 2014
hbase(main):001:0>
This interface lists your currently running Region servers, backup masters and
HBase tables.
HBase Tables
Set classpath for HBase libraries (lib folder in HBase) in it as shown below.
export CLASSPATH = $CLASSPATH:/home/dexlab/hbase/lib/*
This is to prevent the “class not found” exception while accessing the
HBase using java API.
HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase
uses the Hadoop File System to store its data. It will have a master server and
region servers. The data storage will be in the form of regions (tables). These
regions will be split up and stored in region servers.
The master server manages these region servers and all these tasks take
place on HDFS. Given below are some of the commands supported by HBase
Shell.
General Commands
status Provides the status of HBase, for example, the number of
servers.
version Provides the version of HBase being used.
table_help Provides help for table-reference commands.
whoami Provides information about the user.
You can start the HBase interactive shell using “hbase shell” command as shown
below.
./bin/hbase shell
If you have successfully installed HBase in your system, then it gives you the
HBase shell prompt as shown below.
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.23, rf42302b28aceaab773b15f234aa8718fff7eea3c, Wed Aug 27
00:54:09 UTC 2014
hbase(main):001:0>
To exit the interactive shell command at any moment, type exit or use <ctrl+c>.
Check the shell functioning before proceeding further. Use the list command for
this purpose. List is a command used to get the list of all the tables in HBase.
First of all, verify the installation and the configuration of HBase in your system
using this command as shown below.
hbase(main):001:0> list
When you type this command, it gives you the following output.
hbase(main):001:0> list
TABLE
Example: Given below is a sample schema of a table named emp. It has two
column families: “dexlab data” and “emp data”.
Row key dexlab data emp data
Verification
You can verify whether the table is created using the list command as shown
below. Here you can observe the created emp table.
hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds
When you type this command and execute in HBase prompt, it will display the
list of all the tables in HBase as shown below.
hbase(main):001:0> list
TABLE
emp
Example
Given below is an example that shows how to disable a table.
hbase(main):025:0> disable 'emp'
0 row(s) in 1.2760 seconds
Verification
After disabling the table, you can still sense its existence through list and exists
commands. You cannot scan it. It will give you the following error.
hbase(main):028:0> scan 'emp'
ROW COLUMN + CELL
ERROR: emp is disabled.
is_disabled
This command is used to find whether a table is disabled. Its syntax is as
follows.
hbase> is_disabled 'table name'
The following example verifies whether the table named emp is disabled. If it is
disabled, it will return true and if not, it will return false.
hbase(main):031:0> is_disabled 'emp'
true
0 row(s) in 0.0440 seconds
disable_all
This command is used to disable all the tables matching the given regex. The
syntax for disable_all command is given below.
hbase> disable_all 'r.*'
Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and
raju. The following code will disable all the tables starting with raj.
hbase(main):002:07> disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Example
Given below is an example to enable a table.
hbase(main):005:0> enable 'emp'
0 row(s) in 0.4580 seconds
Verification
After enabling the table, scan it. If you can see the schema, your table is
successfully enabled.
hbase(main):006:0> scan 'emp'
is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'
The following code verifies whether the table named emp is enabled. If it is
enabled, it will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'emp'
true
0 row(s) in 0.0440 seconds
describe
This command returns the description of the table. Its syntax is as follows:
hbase> describe 'table name'
Given below is the output of the describe command on the emp table.
hbase(main):006:0> describe 'emp'
DESCRIPTION
ENABLED
alter
Alter is the command used to make changes to an existing table. Using this
command, you can change the maximum number of cells of a column family, set
and delete table scope operators, and delete a column family from a table.
In the following example, we have made the emp table read only.
hbase(main):006:0> alter 'emp', READONLY
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.2140 seconds
Given below is an example to delete a column family from the ‘emp’ table.
Assume there is a table named employee in HBase. It contains the
following data:
hbase(main):006:0> scan 'employee'
ROW COLUMN+CELL
Now let us delete the column family named emp using the alter command.
hbase(main):007:0> alter 'employee','delete'⇒'emp'
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.2380 seconds
Now verify the data in the table after alteration. Observe the column family
‘emp’ is no more, since we have deleted it.
hbase(main):003:0> scan 'employee'
ROW COLUMN + CELL
row1 column = dexlab:city, timestamp = 14181936767, value = hyderabad
==============================================================
drop_all
This command is used to drop the tables matching the “regex” given in the
command. Its syntax is as follows:
hbase> drop_all ‘t.*’
Example
Assume there are tables named raja, rajani, rajendra, rajesh, and raju.
hbase(main):017:0> list
TABLE
raja
rajani
rajendra
rajesh
raju
9 row(s) in 0.0270 seconds
All these tables start with the letters raj. First of all, let us disable all these
tables using the disable_all command as shown below.
hbase(main):002:0> disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Now you can delete all of them using the drop_all command as given below.
hbase(main):018:0> drop_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Drop the above 5 tables (y/n)?
y
5 tables successfully dropped
exit
You exit the shell by typing the exit command.
hbase(main):021:0> exit
Insert the remaining rows using the put command in the same way. If you insert
the whole table, you will get the following output.
hbase(main):022:0> scan 'emp'
ROW COLUMN+CELL
1 column=dexlab data:city, timestamp=1417524216501, value=hyderabad
value=manager
value=sr:engg
value=jr:engg
3 column=emp data:salary, timestamp=1417524702514,
value=25000
The newly given value replaces the existing value, updating the row.
Example
Suppose there is a table in HBase called emp with the following data.
The following command will update the city value of the employee named ‘Raju’
to Delhi.
hbase(main):002:0> put 'emp','row1','dexlab:city','Delhi'
0 row(s) in 0.0400 seconds
The updated table looks as follows where you can observe the city of Raju has
been changed to ‘Delhi’.
hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = dexlab:name, timestamp = 1418035791555, value = raju
row1 column = dexlab:city, timestamp = 1418274645907, value = Delhi
row1 column = emp:designation, timestamp = 141857555,value = manager
row1 column = emp:salary, timestamp = 1418039555, value = 50000
1 row(s) in 0.0100 seconds
Example
The following example shows how to use the get command. Let us scan the first
row of the emp table.
hbase(main):012:0> get 'emp', '1'
COLUMN CELL
Example
Given below is the example to read a specific column in HBase table.
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'dexlab:name'}
COLUMN CELL
dexlab:name timestamp = 1418035791555, value = raju
1 row(s) in 0.0080 seconds
Example
Here is an example to delete a specific cell. Here we are deleting the salary.
hbase(main):006:0> delete 'emp', '1', 'dexlab data:city',
1417521848375
0 row(s) in 0.0060 seconds
Example
Here is an example of “deleteall” command, where we are deleting all the cells of
row1 of emp table.
hbase(main):007:0> deleteall 'emp','1'
0 row(s) in 0.0240 seconds
Verify the table using the scan command. A snapshot of the table after deleting
the table is given below.
hbase(main):022:0> scan 'emp'
Example
The following example shows how to read data from a table using the scan
command. Here we are reading the emp table.
hbase(main):010:0> scan 'emp'
count
You can count the number of rows of a table using the count command. Its
syntax is as follows:
count ‘<table name>’
After deleting the first row, emp table will have two rows. Verify it as shown
below.
truncate
This command disables drops and recreates a table. The syntax of truncate is as
follows:
hbase> truncate 'table name'
Example
Given below is the example of truncate command. Here we have truncated the
emp table.
hbase(main):011:0> truncate 'emp'
Truncating 'one' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 1.5950 seconds
After truncating the table, use the scan command to verify. You will get a table
with zero rows.
hbase(main):017:0> scan ‘emp’
ROW COLUMN + CELL
0 row(s) in 0.3110 seconds
We can grant zero or more privileges to a user from the set of RWXCA, where
R - represents read privilege.
W - represents write privilege.
X - represents execute privilege.
C - represents create privilege.
A - represents admin privilege.
revoke
The revoke command is used to revoke a user's access rights of a table. Its
syntax is as follows:
hbase> revoke <user>
The following code revokes all the permissions from the user named
‘Dexlabanalytics’.
hbase(main):006:0> revoke 'Dexlabanalytics'
user_permission
This command is used to list all the permissions for a particular table.
The syntax of user_permission is as follows:
hbase>user_permission ‘tablename’
The following code lists all the user permissions of ‘emp’ table.
hbase(main):013:0> user_permission 'emp'
CHAPTER
Sqoop
12
12.1 Introduction
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
12.6 Sqoop-Import
Sqoop import command imports a table from an RDBMS to HDFS. Each record
from a table is considered as a separate record in HDFS. Records can be stored
as text files, or in binary representation as Avro or SequenceFiles.
Generic Syntax:
$ sqoop import (generic args) (import args)
$ sqoop-import (generic args) (import args)
The Hadoop specific generic arguments must precede any import
arguments, and the import arguments can be of any order.
If Java is already installed on your system, you get to see the following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If Java is not installed on your system, then follow the steps given below.
Installing Java
Follow the simple steps given below to install Java on your system.
Step 1:
Sudo apt-get install default-jdk
Now apply all the changes into the current running system.
$ source ~/.bashrc
If Hadoop is already installed on your system, then you will get the following
response:
Hadoop 2.6.0
If Hadoop is not installed on your system, then proceed with the following steps:
Downloading Hadoop
Download and extract Hadoop 2.6.0 from Apache Software Foundation using the
following commands.
$wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/
hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz
$mv hadoop-2.4.1/* to hadoop/
Now, apply all the changes into the current running system.
$ source ~/.bashrc
core-site.xml
The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000 </value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data,
namenode path, and datanode path of your local file systems. It means the place
where you want to store the Hadoop infrastructure.
Let us assume the following data.
dfs.replication (data replication value) = 1
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/dexlab/hadoop/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/dexlab/hadoop/hdfs/datanode </value>
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can
make changes according to your Hadoop infrastructure.
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By
default, Hadoop contains a template of yarn-site.xml. First of all, you need to
copy the file from mapred-site.xml.template to mapred-site.xml file using the
following command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$ source ~/.bashrc
Example
Let us take an example of three tables named as emp, emp_add, and
emp_contact, which are in a database called dexlabdb in a MySQL database
server. The three tables and their data are as follows.
emp:
Id Name Deg salary dept
emp_add:
Id Hno Street city
emp_contact:
Id Phno Email
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file
system as a text file or a binary file.
The following command is used to import the emp table from MySQL
database server to HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/dexlabdb \
--username root \
--table emp --m 1
It shows you the emp table data and fields are separated with comma (,).
1201, denial, manager, 50000, TP
1202, manish, preader, 50000, TP
1203, kapil, php dev, 30000, AC
1204, prashant, php dev, 30000, AC
1205, kirti, admin, 20000, TP
The following command is used to import emp_add table data into ‘/queryresult’
directory.
$ sqoop import \
--connect jdbc:mysql://localhost/dexlabdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
It will show you the emp_add table data with comma (,) separated fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
It will show you the emp_add table data with comma (,) separated fields.
1202, 108I, aoc, sec-bad
1204, 78B, old city, sec-bad
1205, 720C, hitech, sec-bad
Incremental Import
Incremental import is a technique that imports only the newly added rows in a
table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options
to perform the incremental import.
The following syntax is used for the incremental option in Sqoop import
command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows:
1206, bunny p, grp des, 20000, GR
The following command is used to perform the incremental import in the
emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/dexlabdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to
HDFS emp/ directory.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
It shows you the emp table data with comma (,) separated fields.
1201, denial, manager, 50000, TP
1202, manish, preader, 50000, TP
1203, kapil, php dev, 30000, AC
1204, prashant, php dev, 30000, AC
1205, kirti, admin, 20000, TP
1206, bunny p, grp des, 20000, GR
The following command is used to see the modified or newly added rows from the
emp table.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1
It shows you the newly added rows to the emp table with comma (,) separated
fields.
1206, bunny p, grp des, 20000, GR
12.9 Import-all-tables
Syntax
The following syntax is used to import all tables.
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Example
Let us take an example of importing all tables from the dexlabdb database. The
list of tables that the database dexlabdb contains is as follows.
+------------------------+
| Tables |
+-----------------------+
| emp |
| emp_add |
| emp_contact |
+-----------------------+
The following command is used to import all the tables from the dexlabdb
database.
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/dexlabdb \
--username root
Note: If you are using the import-all-tables, it is mandatory that every table in
that database must have a primary key field.
The following command is used to verify all the table data to the dexlabdb
database in HDFS
$ $HADOOP_HOME/bin/hadoop fs –ls
Output
drwxr-xr-x - dexlab supergroup 0 2014-12-22 22:50 _sqoop
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:46 emp
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:50 emp_add
drwxr-xr-x - dexlab supergroup 0 2014-12-23 01:52 emp_contact
Example
Let us take an example of the employee data in file, in HDFS. The employee data
is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as
follows.
1201, denial, manager, 50000, TP
1202, manish, preader, 50000, TP
1203, kapil, php dev, 30000, AC
1204, prashant, php dev, 30000, AC
1205, kirti, admin, 20000, TP
1206, bunny p, grp des, 20000, GR
The following command is used to export the table data (which is in emp_data
file on HDFS) to the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
If the given data is stored successfully, then you can find the following table of
given employee data.
+-------+--------------+-----------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+-------+--------------+-----------------+-------------------+--------+
| 1201 | denial | manager | 50000 | TP |
| 1202 | manish | preader | 50000 | TP |
| 1203 | kapil | php dev | 30000 | AC |
| 1204 | prashant | php dev | 30000 | AC |
| 1205 | kirti | admin | 20000 | TP |
| 1206 | bunny p | grp des | 20000 | GR |
+------+--------------+------------------+-------------------+--------+
Sqoop Job
Syntax
The following is the syntax for creating a Sqoop job.
$ sqoop job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args)
[-- [subtool-name] (subtool-args)]
that is importing data from the employee table in the db database to the HDFS
file.
$ sqoop job --create dexjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
It shows the tools and their options, which are used in dexjob.
Job: dexjob
Tool: import Options:
----------------------------
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206
...
12.11 List-Databases
Syntax
The following syntax is used for Sqoop list-databases command.
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
Sample Query
The following command is used to list all the databases in the MySQL database
server.
$ sqoop list-databases \
--connect jdbc:mysql://localhost/ \
--username root
12.12 List-Tables
Syntax
The following syntax is used for Sqoop list-tables command.
$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args)
Sample Query
The following command is used to list all the tables in the dexlabdb database of
MySQL database server.
$ sqoop list-tables \
--connect jdbc:mysql://localhost/dexlabdb \
--username root
CHAPTER
Apache Pig
13
13.1 What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used
to analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language
known as Pig Latin. This language provides various operators using which
programmers can develop their own functions for reading, writing, and
processing data.
To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
Apache Pig uses multi-query approach, thereby MapReduce will require almost 20
reducing the length of the codes to a great extent. times more the number of lines to
perform the same task.
In Apache Pig, schema is optional. We can store data without Schema is mandatory in
designing a schema (values are stored as $01, $02 etc.) SQL.
The data model in Apache Pig is nested relational. The data model used in SQL
is flat relational.
Apache Pig provides limited opportunity for Query There is more opportunity
optimization. for query optimization in
SQL.
following table, we have listed a few significant points that set Apache Pig apart
from Hive.
Apache Pig Hive
Apache Pig uses a language called Pig Latin. It Hive uses a language called HiveQL. It was
was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an
Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-
unique) is known as a bag. Each tuple
can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a
a table in RDBMS, it is not necessary
that every tuple contain the same
number of fields or that the fields in the
same position (column) have the same
type. Figure 53: Tuple and Bag
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Example − [name#Raja, age#30]
Step 1 - Create a directory with the name Pig in the same directory where the
installation directories of Hadoop, Java, and other software were installed. (In
our tutorial, we have created the Pig directory in the user named Hadoop).
$ mkdir Pig
Step 3 - Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created
earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/dexlab/Pig/
.bashrc file
In the .bashrc file, set the following variables −
Local Mode
In this mode, all the files are installed and run from your local host and local file
system. There is no need of Hadoop or HDFS. This mode is generally used for
testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop
File System (HDFS) using Apache Pig. In this mode, whenever we execute the
Pig Latin statements to process the data, a MapReduce job is invoked in the
back-end to perform a particular operation on the data that exists in the HDFS.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly
entering the Pig Latin statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
Dump dexlab;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
Note − We will discuss in detail how to run a Pig script in Bach mode and in
embedded mode in subsequent chapters.
Sh Command
Using sh command, we can invoke any shell commands from the Grunt shell.
Using sh command from the Grunt shell, we cannot execute the commands that
are a part of the shell environment (ex – cd).
Syntax
Given below is the syntax of sh command.
Grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh
option as shown below. In this example, it lists out the files in the /pig/bin/
directory.
Grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any Fs Shell commands from the Grunt
shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command.
In the following example, it lists the files in the HDFS root directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Dexlab supergroup 0 2015-09-08 14:13 Hbase
In the same way, we can invoke all the other file system shell commands
from the Grunt shell using the fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility
commands such as clear, help, history, quit, and set; and commands such as
exec, kill, and run to control Pig from the Grunt shell. Given below is the
description of the utility commands provided by the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell using the clear command as shown
below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Usage
You can get a list of Pig commands using the help command as shown below.
grunt> help
history Command
This command displays a list of statements executed / used so far since the
Grunt sell is invoked.
Usage
Assume we have executed three statements since opening the Grunt shell.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',');
Then, using the history command will produce the following output.
grunt> history
Set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
default_parallel You can set the number of reducers for a map job by passing any whole
number as a value to this key.
Debug You can turn off or turn on the debugging freature in Pig by passing
on/off to this key.
job.name You can set the Job name to the required job by passing a string value
to this key.
job.priority You can set the job priority to a job by passing one of the following
values to this key −
● very_low
● low
● normal
● high
● very_high
stream.skippath For streaming, you can set the path from where the data is not to be
transferred, by passing the desired path in the form of a string to this
key.
Quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
Let us now take a look at the commands using which you can control
Apache Pig from the Grunt shell.
Exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility command exec.
Dexlab.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
Sample_script.pig
dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab.txt' USING
PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump dexlab;
Now, let us execute the above script from the Grunt shell using the exec
command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As directed in
the script, it loads the dexlab.txt file into Pig and gives you the result of the
Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from the
Run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file
file_name] script
Example
Let us assume there is a file named dexlab.txt in the /pig_data/ directory of
HDFS with the following content.
Dexlab.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
Sample_script.pig
dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run
command as shown below.
grunt> run /sample_script.pig
You can see the output of the script using the Dump operator as shown
below.
grunt> Dump;
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Note − The difference between exec and the run command is that if we use run,
the statements from the script are available in the command history.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Dexlab_data = LOAD 'dexlab_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Complex Types
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values
in a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be the
result of an operation.
− Subtraction − Subtracts right hand operand from left a − b will give −10
hand operand
operator
> Greater than − Checks if the value of the left operand is (a > b) is not
greater than the value of the right operand. If yes, then the true.
condition becomes true.
< Less than − Checks if the value of the left operand is less (a < b) is
than the value of the right operand. If yes, then the condition true.
becomes true.
>= Greater than or equal to − Checks if the value of the left (a >= b) is
operand is greater than or equal to the value of the right not true.
operand. If yes, then the condition becomes true.
<= Less than or equal to − Checks if the value of the left operand (a <= b) is
is less than or equal to the value of the right operand. If yes, true.
then the condition becomes true.
Matches Pattern matching − Checks whether the string in the left- f1 matches
hand side matches with the constant in the right-hand side. '.*tutorial.*'
LOAD To Load the data from the file system (local/HDFS) into a
relation.
Filtering
Sorting
Diagnostic Operators
The above dataset contains personal details like id, first name, last name,
phone number and city, of six dexlabs.
If your system contains Hadoop, and if you have set the PATH variable, then
you will get the following output
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
Now, move the file from the local file system to HDFS using put command
as shown below. (You can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/dexlab/Pig/Pig_Data/dexlab_data.txt dfs://localhost:9000/pig_data/
Output
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
Note − We load the data without specifying the schema. In that case, the
columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in dexlab_data.txt in Pig under the schema
named Dexlab using the LOAD command.
$ Pig –x mapreduce
Input file We are reading data from the file dexlab_data.txt, which is in the /pig_data/
path directory of HDFS.
Storage We have used the PigStorage() function. It loads and stores data as
function structured text files. It takes a delimiter using which each entity of a tuple is
separated, as a parameter. By default, it takes ‘\t’ as a parameter.
Note − The load statement will simply load the data into the specified relation
in Pig. To verify the execution of the Load statement, you have to use the
Diagnostic Operators which are discussed in the next chapters.
Example
Assume we have a file dexlab_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation dexlab using the LOAD operator as
shown below.
grunt> dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown
below.
grunt> STORE dexlab INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Verification
You can verify the stored data as shown below.
Step 1 - First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Dexlab supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Dexlab supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing the store
statement.
Step 2 - Using cat command, list the contents of the file named part-m-00000 as
shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the
results on the screen. It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Example
Assume we have a file dexlab_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation dexlab using the LOAD operator as
shown below.
grunt> dexlab = LOAD 'hdfs://localhost:9000/pig_data/dexlab_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us print the contents of the relation using the Dump operator as shown
below.
grunt> Dump dexlab
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
CHAPTER
Apache Flume
14
14.1 What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log files,
events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from various web servers
to HDFS.
producers and the centralized stores and provides a steady flow of data
between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one
sender and one receiver) are maintained for each message. It guarantees
reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Log file − In general, a log file is a file that lists events/actions that occur in an
operating system. For example, web servers list every request made to the
server in the log files.
On harvesting such log data, we can get information about −
Note − In POSIX file system, whenever we are accessing a file (say performing
write operation), other programs can still read this file (at least the saved
portion of the file). This is because the file exists on the disc before it is closed.
Available Solutions
To send streaming data (log files, events etc..,) from various sources to HDFS,
we have the following tools available at our disposal −
Facebook’s Scribe
Scribe is an immensely popular tool that is used to aggregate and stream log
data. It is designed to scale to a very large number of nodes and be robust to
network and node failures.
Apache Kafka
Kafka has been developed by Apache Software Foundation. It is an open-source
message broker. Using Kafka, we can handle feeds with high-throughput and
low-latency.
Apache Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log data,
events (etc...) from various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool that is principally
designed to transfer streaming data from various sources to HDFS.
In this tutorial, we will discuss in detail how to use Flume with some
examples.
Source
A source is the component of an Agent which receives data from the data
generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives
events from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.
Channel
A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
These channels are fully transactional and they can work with any
number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
Sink
A sink stores the data into centralized stores like HBase and HDFS. It
consumes the data (events) from the channels and delivers it to the destination.
The destination of the sink might be another agent or the central stores.
Note − A flume agent can have multiple sources, sinks and channels. We have
listed all the supported sources, sinks, channels in the Flume configuration
chapter of this tutorial.
Interceptors
Interceptors are used to alter/inspect flume events which are transferred
between source and channel.
Channel Selectors
These are used to determine which channel is to be opted to transfer the data in
case of multiple channels. There are two types of channel selectors −
Sink Processors
These are used to invoke a particular sink from the selected group of sinks.
These are used to create failover paths for your sinks or load balance events
across multiple sinks from a channel.
Multi-hop Flow
Within Flume, there can be multiple agents and before reaching the final
destination, an event may travel through more than one agent. This is known
as multi-hop flow.
Fan-out Flow
The dataflow from one source to multiple channels is known as fan-out flow. It
is of two types:
Replicating − The data flow where the data will be replicated in all the
configured channels.
Multiplexing − The data flow where the data will be sent to a selected channel
which is mentioned in the header of the event.
Fan-in Flow
The data flow in which the data will be transferred from many sources to one
channel is known as fan-in flow.
Failure Handling
In Flume, for each event, two transactions take place: one at the sender and one
at the receiver. The sender sends events to the receiver. Soon after receiving the
data, the receiver commits its own transaction and sends a “received” signal to
the sender. After receiving the signal, the sender commits its transaction.
(Sender will not commit its transaction till it receives a signal from the receiver.)
Installing Flume
First of all, download the latest version of Apache Flume software from the
website https://flume.apache.org/.
Step 1 - Open the website. Click on the download link on the left-hand side of
the home page. It will take you to the download page of Apache Flume.
Step 2 - In the Download page, you can see the links for binary and source files
of Apache Flume. Click on the link apache-flume-1.6.0-bin.tar.gz
You will be redirected to a list of mirrors where you can start your
download by clicking any of these mirrors. In the same way, you can download
the source code of Apache Flume by clicking on apache-flume-1.6.0-src.tar.gz.
Step 3 - Create a directory with the name Flume in the same directory where
the installation directories of Hadoop, HBase, and other software were
installed (if you have already installed any) as shown below.
$ mkdir Flume
Figure 58: Setting home folder, path and classpath for Flume
conf Folder
If you open the conf folder of Apache Flume, you will have the following four
files −
flume-conf.properties.template,
flume-env.sh.template,
flume-env.ps1.template, and
log4j.properties.
flume-env.sh
Open flume-env.sh file and set the JAVA_Home to the folder where Java was
installed in your system.
If you have successfully installed Flume, you will get a help prompt of
Flume as shown below.
Flume supports various sources, sinks, and channels. They are listed in
the table given below.
Sources Channels Sinks
You can use any of them. For example, if you are transferring Twitter
data using Twitter source through a memory channel to an HDFS sink, and the
agent name id TwitterAgent, then
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
After listing the components of the agent, you have to describe the
source(s), sink(s), and channel(s) by providing values to their properties.
For example, if we consider the twitter source, following are the properties to
which we must provide values to configure it.
TwitterAgent.sources.Twitter.type = Twitter (type name)
TwitterAgent.sources.Twitter.consumerKey =
TwitterAgent.sources.Twitter.consumerSecret =
TwitterAgent.sources.Twitter.accessToken =
TwitterAgent.sources.Twitter.accessTokenSecret =
The following example shows how to bind the sources and the sinks to a channel.
Here, we consider twitter source, memory channel, and HDFS sink.
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channels = MemChannel
where −
agent − Command to start the Flume agent
--conf ,-c<conf> − Use configuration file in the conf directory
-f<file> − Specifies a config file path, if missing
--name, -n <name> − Name of the twitter agent
-D property =value − Sets a Java system property value.
To fetch Twitter data, we will have to follow the steps given below
Create a twitter Application
Install / Start HDFS
Configure Flume
Step 1 - To create a
Twitter application, click
on the following link
https://apps.twitter.com/.
Sign in to your Twitter
account. You will have a
Twitter Application
Management window
where you can create,
delete, and manage
Twitter Apps.
Figure 63: Creating Twitter Application
Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start
Hadoop and create a folder in it to store Flume data. Follow the steps given
below before configuring Flume.
If your system contains Hadoop, and if you have set the path variable, then you
will get the following output
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoop-nodemanager-localhost.localdomain.out
Configuring Flume
We have to configure the source, the channel, and the sink using the
configuration file in the conf folder. The example given in this chapter uses an
experimental source provided by Apache Flume named Twitter 1%
Firehose Memory channel and HDFS sink.
Channels
Source type: org.apache.flume.source.twitter.TwitterSource
consumerKey The OAuth consumer key
consumerSecret OAuth consumer secret
accessToken OAuth access token
accessTokenSecret OAuth token secret
Channel
We are using the memory channel. To configure the memory channel,
you must provide value to the type of the channel.
type − It holds the type of the channel. In our example, the type
is MemChannel.
HDFS Sink
This sink writes data into the HDFS. To configure this sink, you must provide
the following details.
Channeltype hdfs
hdfs.path the path of the directory in HDFS where data is
to be stored.
And we can provide some optional values based on the scenario. Given
below are the optional properties of the HDFS sink that we are configuring in
our application.
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
Execution
Browse through the Flume home directory and execute the application as shown
below.
$ cd $FLUME_HOME
$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent
If everything goes fine, the streaming of tweets into HDFS will start. Given
below is the snapshot of the command prompt window while fetching tweets.
Click on the dropdown named Utilities on the right-hand side of the page. You
can see two options as shown in the snapshot given below.
Click on Browse the file system and enter the path of the HDFS
directory where you have stored the tweets. In our example, the path will
be /user/Hadoop/twitter_data/. Then, you can see the list of twitter log files
stored in HDFS as given below.
Prerequisites
To run the example provided in this chapter, you need to install Flume.
Configuring Flume
We have to configure the source, the channel, and the sink using the
configuration file in the conf folder. The example given in this chapter uses
a NetCat Source, Memory channel, and a logger sink.
NetCat Source
While configuring the NetCat source, we have to specify a port while configuring
the source. Now the source (NetCat source) listens to the given port and receives
each line we entered in that port as an individual event and transfers it to the
sink through the specified channel.
While configuring this source, you have to provide values to the following
properties
channels
Source type − netcat
bind − Host name or IP address to bind.
port − Port number to which we want the source to listen.
Channel
We are using the memory channel. To configure the memory channel,
you must provide a value to the type of the channel. Given below are the list of
properties that you need to supply while configuring the memory channel −
type − It holds the type of the channel. In our example, the type
is MemChannel.
Logger Sink
This sink logs all the events passed to it. Generally, it is used for testing or
debugging purpose. To configure this sink, you must provide the following
details.
Execution
Browse through the Flume home directory and execute the application as shown
below.
$ cd $FLUME_HOME
$ ./bin/flume-ng agent --conf $FLUME_CONF --conf-file $FLUME_CONF/netcat.conf
--name NetcatAgent -Dflume.root.logger=INFO,console
If everything goes fine, the source starts listening to the given port. In
this case, it is 56565. Given below is the snapshot of the command prompt
window of a NetCat source which has started and listening to the port 56565.
Connected
Now you can enter your data line by line (after each line, you have to press
Enter). The NetCat
source receives each
line as an individual
event and you will get
a received message
“OK”.
Whenever you
are done with passing
data, you can exit the
console by pressing
(Ctrl+C). Given below
is the snapshot of the
console where we
have connected to the
source using the curl command. Figure 73: Passing Data to Source
The following snapshot shows the NetCat console where the events are
logged.
CHAPTER
Apache Spark
15
Spark was introduced by Apache Software Foundation for speeding up the
Hadoop computational computing software process.
As against a common belief, Spark is not a modified version of
Hadoop and is not, really, dependent on Hadoop because it has its own cluster
management. Hadoop is just one of the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second
is processing. Since Spark has its own cluster management computation, it
uses Hadoop for storage purpose only.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of data.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides
an API for expressing graph computation that can model the user-defined
By default, each transformed RDD may be recomputed each time you run
an action on it. However, you may also persist an RDD in memory, in which
case Spark will keep the elements around on the cluster for much faster access,
the next time you query it. There is also support for persisting RDDs on disk, or
replicated across multiple nodes.
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into
a Linux based system. The following steps show how to install Apache Spark.
If Java is already, installed on your system, you get to see the following response
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
If Scala is already installed on your system, you get to see the following response
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step
for Scala installation.
If Scala is already installed on your system, you get to see the following response
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>