Intro Odt

HADOOP ECOSYSTEM
INTRODUCTION :
Apache Hadoop is an open source software framework used to develop data processing applications
which are executed in a distributed computing environment.
“Hadoop is a technology to store massive datasets on a cluster of cheap machines in a distributed
manner”
Applications built using HADOOP are run on large data sets distributed across clusters of
commodity computers. Commodity computers are cheap and widely available. These are mainly
useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File system. The
processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster
nodes(server) containing data. This computational logic is nothing, but a compiled version of a
program written in a high-level language such as Java. Such a program, processes data stored in
Hadoop HDFS.
Apache Hadoop is the most powerful tool of Big Data. Hadoop ecosystem revolves around three
main components HDFS, MapReduce, and YARN. Apart from these Hadoop Components, there are
some other Hadoop ecosystem components also, that play an important role to boost Hadoop
functionalities.
let’s now understand the different Hadoop Components in detail.
2.1. HDFS
Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. HDFS store
very large files running on a cluster of commodity hardware. It follows the principle of storing less
number of large files rather than the huge number of small files. HDFS stores data reliably even in
the case of hardware failure. Hence, it provides high throughput access to the application by
accessing in parallel.
Components of HDFS:
• NameNode – It works as Master in Hadoop cluster. Namenode stores meta-data i.e. number
of blocks, replicas and other details. Meta-data is present in memory in the master.
NameNode assigns tasks to the slave node. It should deploy on reliable hardware as it is the
centerpiece of HDFS.
• DataNode – It works as Slave in Hadoop cluster. In Hadoop HDFS, DataNode is
responsible for storing actual data in HDFS. DataNode also performs read and write
operation as per request for the clients. DataNodes can also deploy on commodity hardware.
2.2. MapReduce
Hadoop MapReduce is the data processing layer of Hadoop. It processes large structured and
unstructured data stored in HDFS. MapReduce also processes a huge amount of data in parallel. It
does this by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop,
MapReduce works by breaking the processing into phases: Map and Reduce.
• Map – It is the first phase of processing, where we specify all the complex logic code.
• Reduce – It is the second phase of processing. Here we specify light-weight processing like
aggregation/summation.
2.3. YARN
Hadoop YARN provides the resource management. It is the operating system of Hadoop. So, it is
responsible for managing and monitoring workloads, implementing security controls. It is a central
platform to deliver data governance tools across Hadoop clusters.
YARN allows multiple data processing engines such as real-time streaming, batch processing etc.
Components of YARN:
• Resource Manager – It is a cluster level component and runs on the Master machine. Hence
it manages resources and schedule applications running on the top of YARN. It has two
components: Scheduler & Application Manager.
• Node Manager – It is a node level component. It runs on each slave machine. It
continuously communicate with Resource Manager to remain up-to-date
2.4. Hive
Apache Hive is an open source data warehouse system used for querying and analyzing large
datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop. Hive also
support analysis of large datasets stored in HDFS and also in Amazon S3 filesystem is supported by
Hive. Hive uses the language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs.
2.5. Pig
It is a high-level language platform developed to execute queries on huge datasets that are stored in
Hadoop HDFS. PigLatin is a language used in pig which is very similar to SQL. Pig loads the data,
apply the required filters and dump the data in the required format. Pig also converts all the
operation into Map and Reduce tasks which are effectively processed on Hadoop.
Characteristics of Pig:
• Extensible – Pig users can create custom functions to meet their particular processing
requirements.
• Self-optimizing – Since Pig allows the system to optimize automatically. So, the user can
focus on semantics.
• Handles all kinds of data – Pig analyzes both structured as well as unstructured.
2.6. HBase
Apache HBase is NoSQL database that runs on the top of Hadoop. It is a database that stores
structured data in tables that could have billions of rows and millions of columns. HBase also
provides real-time access to read or write data in HDFS.
Components of HBase:
• HBase Master – It is not part of the actual data storage. But it performs administration
(interface for creating, updating and deleting tables.).
• Region Server – It is the worker node. It handles read, writes, updates and delete requests
from clients. Region server also process runs on every node in Hadoop cluster.
2.7. HCatalog
It is table and storage management layer on the top of Apache Hadoop. HCatalog is a main
component of Hive. Hence, it enables the user to store their data in any format and structure. It also
supports different Hadoop components to easily read and write data from the cluster.
Advantages of HCatalog:
• Provide visibility for data cleaning and archiving tools.
• With the table abstraction, HCatalog frees the user from the overhead of data storage.
• Enables notifications of data availability.
2.8. Avro
It is an open source project that provides data serialization and data exchange services for Hadoop.
Using serialization, service programs can serialize data into files or messages. It also stores data
definition and data together in one message or file. Hence, this makes it easy for programs to
dynamically understand information stored in Avro file or message.
Avro provides:
• Container file, to store persistent data.
• Remote procedure call.
• Rich data structures.
• Compact, fast, binary data format.
2.9. Thrift
Apache Thrift is a software framework that allows scalable cross-language services development.
Thrift is also used for RPC communication. Apache Hadoop does a lot of RPC calls, so there is a
possibility of using Thrift for performance.
2.10. Drill
The drill is used for large-scale data processing. Designing of the drill is to scale to several
thousands of nodes and query petabytes of data. It is also a low latency distributed query engine for
large-scale datasets. The drill is also the first distributed SQL query engine that has a schema-free
model.
Characteristics of drill:
• Drill decentralized metadata – Drill does not have centralized metadata requirement. Drill
users do not need to create and manage tables in metadata in order to query data.
• Flexibility – Drill provides Hierarchical columnar data model. It can represent complex,
highly dynamic data and also allow efficient processing.
• Dynamic schema discovery – To start the query execution process Drill does not require
any type specification for data. Instead, drill starts processing the data in units called record
batches. It also discovers schema on the fly during processing.
2.11. Mahout
It is an open source framework used for creating scalable machine learning algorithm. Once we
store data in HDFS, mahout provides the data science tools to automatically find meaningful
patterns in those Big Data sets.
2.12. Sqoop
It is mainly used for importing and exporting data. So, it imports data from external sources into
related Hadoop components like HDFS, HBase or Hive. It also exports data from Hadoop to other
external sources. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL.
2.13. Flume
Flume efficiently collects, aggregate and move a large amount of data from its origin and sending it
back to HDFS. It has a very simple and flexible architecture based on streaming data flows. Flume
is fault tolerant, also a reliable mechanism. Flume also allows flow data from the source into
Hadoop environment. It uses a simple extensible data model that allows for the online analytic
application. Hence, using Flume we can get the data from multiple servers immediately into
Hadoop.
2.14. Ambari
It is an open source management platform. It is a platform for provisioning, managing, monitoring
and securing Apache Hadoop cluster. Hadoop management gets simpler because Ambari provides
consistent, secure platform for operational control.
Benefits of Ambari:
• Simplified installation, configuration, and management – It can easily and efficiently
create and manage clusters at scale.
• Centralized security setup – Ambari configures cluster security across the entire platform.
It also reduces the complexity to administer.
• Highly extensible and customizable – Ambari is highly extensible for bringing custom
services under management.
• Full visibility into cluster health – Ambari ensures that the cluster is healthy and available
with a holistic approach to monitoring.
2.15. Zookeeper
Zookeeper in Hadoop is a centralized service. It maintains configuration information, naming, and
provide distributed synchronization. It also provides group services. Zookeeper also manages and
coordinates a large cluster of machines.
Benefits of Zookeeper:
• Fast – zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
• Ordered – zookeeper maintains a record of all transactions, which can also be used for
high-level
2.16. Oozie
It is a workflow scheduler system to manage Apache Hadoop jobs. It combines multiple jobs
sequentially into one logical unit of work. Hence, Oozie framework is fully integrated with Apache
Hadoop stack, YARN as an architecture center. It also supports Hadoop jobs for Apache
MapReduce, Pig, Hive, and Sqoop.
Oozie is scalable and also very much flexible. One can easily start, stop, suspend and rerun jobs.
Hence, Oozie makes it very easy to rerun failed workflows. It is also possible to skip a specific
failed node.
There are two basic types of Oozie jobs:
• Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive.
• Oozie coordinator – It runs workflow jobs based on predefined schedules and availability
of data.

Intro Odt

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Intro Odt

Transféré par

Droits d'auteur :

Formats disponibles

HADOOP ECOSYSTEM

Vous aimerez peut-être aussi