Académique Documents
Professionnel Documents
Culture Documents
servers and mobile devices, for example to collect data and integrate it into
Hadoop.
DataNode: DataNodes are the slave nodes, which are responsible for storing
data in the HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in
Hadoop, which manages resources and provides an execution environment
to the processes.
ResourceManager: It receives the processing requests, and then passes the
parts of requests to corresponding NodeManagers accordingly, where the
actual processing takes place. It allocates resources to applications based on
the needs.
NodeManager: NodeManager is installed on every DataNode and it is
responsible for execution of the task on every single DataNode.
Tell me about the various Hadoop daemons and their roles in a Hadoop
cluster.
Generally approach this question by first explaining the HDFS daemons i.e.
NameNode, DataNode and Secondary NameNode, and then moving on to the
YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining
the JobHistoryServer.
NameNode: It is the master node which is responsible for storing the
metadata of all the files and directories. It has information about blocks, that
make a file, and where those blocks are located in the cluster.
Datanode: It is the slave node that contains the actual data.
Secondary NameNode: It periodically merges the changes (edit log) with
the FsImage (Filesystem Image), present in the NameNode. It stores the
modified FsImage into persistent storage, which can be used in case of
failure of NameNode.
ResourceManager: It is the central authority that manages resources and
schedule applications running on top of YARN.
NodeManager: It runs on slave machines, and is responsible for launching
the applications containers (where applications execute their part),
monitoring their resource usage (CPU, memory, disk, network) and reporting
these to the ResourceManager.
JobHistoryServer: It maintains information about MapReduce jobs after the
Application Master terminates.
HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families
HBase provides flexible data model and low latency access to small amounts
of data stored in large data sets
HBase on top of Hadoop will increase throughput and performance of
distributed cluster set up. In turn, it provides faster random reads and writes
operations
Other noSQL Databases are Cassandra,MongoDB,CouchDB
It is NoSQL Database, all the data is stored in {key, value} paris in the form
of byte array types.
Column family : is going to have column name with time stamp. The data
storage will be in the form of regions (tables). These regions will be split up
and stored in region servers. The master server manages these region
servers and all these tasks take place on HDFS. Given below are some of the
commands supported by HBase Shell.
HBase Architecture
HBase Shell Commands
1. Create : To create table
2. Put : To insert records in to the table
3. Get : To fectch/extract the records from the table
4. Scan : To get all entries on the table [if the dataset is very large we should
not use]
5. Alter : To change the table properties
6. Status : system status like a number of servers present in the cluster,
active server count, and average load value
7. Version : HBase version in command mode
10
11
HBase can have its own API, we can design our own API and dump the data
To Start work with Hbase we need to start HBase service
HMaster Service
What is Hadoop? Hadoop is an open source, Java-based programming
framework that supports the processing and storage of extremely large data
sets in a distributed computing environment. It consists of computer
clusters built from commodity hardware .All the modules in Hadoop are
designed with a fundamental assumption that hardware failures are a
common occurrence and should be automatically handled by the framework
HDFS It is specially designed filesystem for storing huge data sets with
cluster of commodity hardware with streaming access pattern [Write once
read many times without changing content of file]
64MB Block size /Default, we can configure to 128MB, 256MB and so on.
The main reason of having large block to reduce cost of seek time.
12
NameNode: It is the master daemon that maintains and manages the data block present in
the DataNodes.
DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a
commodity hardware, that is responsible of storing the data as blocks.
Secondary NameNode: The Secondary NameNode works concurrently with the primary
Name Node as a helper daemon. It performs check pointing.