Hadoop Distributed File System

1
Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a

distributed, scalable, Java-based file system adept at storing large volumes
of unstructured data.
MapReduce: MapReduce is a software framework that serves as the
compute layer of Hadoop. MapReduce jobs are divided into two (obviously
named) parts. The Map function divides a query into multiple parts and
processes data at the node level. The Reduce function aggregates the
results of the Map function to determine the answer to the query.
Hive: Hive is a Hadoop-based data warehousing-like framework originally
developed by Facebook. It allows users to write queries in a SQL-like
language called HiveQL, which are then converted to MapReduce. This allows
SQL programmers with no MapReduce experience to use the warehouse and
makes it easier to integrate with business intelligence and visualization tools
such as Microstrategy, Tableau, Revolutions Analytics, etc.
Pig : Pig is a language used to run MapReduce jobs on Hadoop.It supports
MapReduce programs in several languages including Java.
HBase: HBase is an open source, non-relational, distributed database
running on top of Hadoop.Tables in HBase can serve as the input and output
for MapReduce jobs run in Hadoop, and may be accessed through Java APIs
as well as through REST, Avro or Thrift gateway APIs.
Flume: Flume is a framework for populating Hadoop with data. Agents are
populated throughout ones IT infrastructure inside web servers, application
servers and mobile devices, for example to collect data and integrate it into
Hadoop.
Sqoop: is a command-line interface application for transferring data
between relational databases and Hadoop.
It supports:
1. Incremental loads of a single table,
2. A free form SQL query,
3. Saved jobs which can be run multiple times to import updates made to
a database since the last import.
Imports from Sqoop be used to populate tables in Hive or HBase. and exports
from it can be used to put data from Hadoop into a relational database.
Flume: Flume is a framework for populating Hadoop with data. Agents are
populated throughout ones IT infrastructure inside web servers, application
servers and mobile devices, for example to collect data and integrate it into
Hadoop.
Ambari: Ambari is a web-based set of tools for deploying, administering and

monitoring Apache Hadoop clusters. It's development is being led by
engineers from Hortonworoks, which include Ambari in its Hortonworks Data
Platform.
HCatalog: HCatalog is a centralized metadata management and sharing
service for Apache Hadoop. It allows for a unified view of all data in Hadoop
clusters and allows diverse tools, including Pig and Hive, to process any data
elements without needing to know physically where in the cluster the data is
stored.
What is Big Data?
Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available database
management tools or traditional data processing applications. The challenge
includes capturing, creating, storing, searching, sharing, transferring,
analyzing and visualization of this data.
Big Data Characteristics : Five Vs
Volume: Huge Volume of Data

Velocity: Data from different sources and changing data yesterday to today
Variety : Structured, Unstructured and Semi Structured
VERACITY: Veracity refers to the data in doubt or uncertainty of data
available due to data inconsistency and incompleteness.
Value : Value added data
What is Hadoop and its components?
Storage unit HDFS (NameNode, DataNode)
Processing framework YARN (ResourceManager, NodeManager)
What are HDFS and YARN?
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is
responsible for storing different kinds of data as blocks in a distributed
environment. It follows master and slave topology.
NameNode: NameNode is the master node in the distributed environment
and it maintains the metadata information for the blocks of data stored in
HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing
data in the HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in
Hadoop, which manages resources and provides an execution environment
to the processes.
ResourceManager: It receives the processing requests, and then passes the
parts of requests to corresponding NodeManagers accordingly, where the
actual processing takes place. It allocates resources to applications based on
the needs.
NodeManager: NodeManager is installed on every DataNode and it is
responsible for execution of the task on every single DataNode.
Tell me about the various Hadoop daemons and their roles in a Hadoop
cluster.
Generally approach this question by first explaining the HDFS daemons i.e.
NameNode, DataNode and Secondary NameNode, and then moving on to the
YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining
the JobHistoryServer.
NameNode: It is the master node which is responsible for storing the
metadata of all the files and directories. It has information about blocks, that
make a file, and where those blocks are located in the cluster.
Datanode: It is the slave node that contains the actual data.
Secondary NameNode: It periodically merges the changes (edit log) with
the FsImage (Filesystem Image), present in the NameNode. It stores the
modified FsImage into persistent storage, which can be used in case of
failure of NameNode.
ResourceManager: It is the central authority that manages resources and
schedule applications running on top of YARN.
NodeManager: It runs on slave machines, and is responsible for launching
the applications containers (where applications execute their part),
monitoring their resource usage (CPU, memory, disk, network) and reporting
these to the ResourceManager.
JobHistoryServer: It maintains information about MapReduce jobs after the
Application Master terminates.
Explain the difference between NameNode, Backup Node and

Checkpoint NameNode: NameNode is at the heart of the HDFS file system
which manages the metadata i.e. the data of the files is not stored on the
NameNode but rather it has the directory tree of all the files present in the
HDFS file system on a hadoop cluster. NameNode uses two files for the
namespacefsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since
checkpoint.
Checkpoint Node- Checkpoint Node keeps track of the latest checkpoint in
a directory that has same structure as that of NameNodes directory.
Checkpoint node creates checkpoints for the namespace at regular intervals

by downloading the edits and fsimage file from the NameNode and merging
it locally. The new image is then again updated back to the active
NameNode.
BackupNode: Backup Node also provides check pointing functionality like
that of the checkpoint node but it also maintains its up-to-date in-memory
copy of the file system namespace that is in sync with the active NameNode.
HBase : Is open source distributed column-oriented database built on top of the

Hadoop file system. It is horizontally scalable. Apache HBase Designed to provide
quick random access to huge structured data
HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families
HBase provides flexible data model and low latency access to small amounts
of data stored in large data sets
HBase on top of Hadoop will increase throughput and performance of
distributed cluster set up. In turn, it provides faster random reads and writes
operations
Other noSQL Databases are Cassandra,MongoDB,CouchDB
It is NoSQL Database, all the data is stored in {key, value} paris in the form
of byte array types.
Column family : is going to have column name with time stamp. The data
storage will be in the form of regions (tables). These regions will be split up
and stored in region servers. The master server manages these region
servers and all these tasks take place on HDFS. Given below are some of the
commands supported by HBase Shell.
HBase is a column-oriented database and data is stored in tables. The

tables are sorted by RowId. As shown below, HBase has RowId, which is
the collection of several column families that are present in the table
HBase Architecture
HBase Shell Commands
1. Create : To create table
2. Put : To insert records in to the table
3. Get : To fectch/extract the records from the table
4. Scan : To get all entries on the table [if the dataset is very large we should
not use]
5. Alter : To change the table properties
6. Status : system status like a number of servers present in the cluster,
active server count, and average load value
7. Version : HBase version in command mode
10
11
HBase can have its own API, we can design our own API and dump the data
To Start work with Hbase we need to start HBase service
HMaster Service
What is Hadoop? Hadoop is an open source, Java-based programming
framework that supports the processing and storage of extremely large data
sets in a distributed computing environment. It consists of computer
clusters built from commodity hardware .All the modules in Hadoop are
designed with a fundamental assumption that hardware failures are a
common occurrence and should be automatically handled by the framework
HDFS It is specially designed filesystem for storing huge data sets with
cluster of commodity hardware with streaming access pattern [Write once
read many times without changing content of file]
64MB Block size /Default, we can configure to 128MB, 256MB and so on.
The main reason of having large block to reduce cost of seek time.
12
NameNode: It is the master daemon that maintains and manages the data block present in
the DataNodes.
DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a
commodity hardware, that is responsible of storing the data as blocks.
Secondary NameNode: The Secondary NameNode works concurrently with the primary
Name Node as a helper daemon. It performs check pointing.

Hadoop Distributed File System

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hadoop Distributed File System

Transféré par

Droits d'auteur :

Formats disponibles

1

Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a

Ambari: Ambari is a web-based set of tools for deploying, administering and

Volume: Huge Volume of Data

Explain the difference between NameNode, Backup Node and

Checkpoint node creates checkpoints for the namespace at regular intervals

HBase : Is open source distributed column-oriented database built on top of the

HBase is a column-oriented database and data is stored in tables. The

Vous aimerez peut-être aussi