Vous êtes sur la page 1sur 31

Framework for Processing Data in

Hadoop - YARN and MapReduce


Hadoop 1 Architecture
Hadoop-1 Data Processing Architecture
Hadoop 1 Daemons and Application Execution
Hadoop-1 Data Processing Architecture
● Distributed storage – data, interim & final results are written

● Resource management – Hadoop resources : CPU cycles, memory, disk &


network bandwidth. This is shared among multiple applications in a predictable
& tunable way. Performed by Job Tracker daemon.

● Processing Framework – Map Reduce, managed by Job Tracker, local


execution by Task tracker daemon

● API – Application to be developed using MapReduce API. Pig & Hive projects
provide programmers with easier interfaces & are compiled into MapReduce
Launching a MapReduce application
1. The client application submits an application request to the JobTracker.

2. The JobTracker determines how many processing resources are needed


to execute the entire application. This is done by requesting the locations
and names of the files and data blocks that the application needs from
the NameNode, and calculating how many map tasks and reduce tasks
will be needed to process all this data.

3. The JobTracker looks at the state of the slave nodes and queues all the
map tasks and reduce tasks for execution.
Launching a MapReduce application
4. As processing slots become available on the slave nodes, map tasks are
deployed to the slave nodes. Map tasks assigned to specific blocks of
data are assigned to nodes where that same data is stored.

5. The JobTracker monitors task progress, and in the event of a task failure
or a node failure, the task is restarted on the next available slot.

6. After the map tasks are finished, reduce tasks process the interim result
sets from the map tasks.

7. The result set is returned to the client application.


Job Tracker
 JobTracker process runs on Master Node.

 Receives requests for MapReduce execution from the client.

 Talks to the NameNode to determine location of the data.

 Finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.

 Monitors the individual TaskTrackers and the submits back the overall status of
the job back to the client.

 When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
Task Tracker
 TaskTracker runs on DataNode. Mostly on all DataNodes.

 Mapper and Reducer tasks are executed on DataNodes administered by


TaskTrackers.

 TaskTrackers will be assigned Mapper and Reducer tasks to execute by


JobTracker.

 TaskTracker will be in constant communication with the JobTracker


signalling the progress of the task in execution.

 TaskTracker failure is not considered fatal. When a TaskTracker becomes


unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.
Hadoop 1
● Setting optimal number of map and reduce slots is critical
● Every map/slot tasks spawns its own JVM
● Separate map & reduce slots are defined, as their requirements are
different:
 Map tasks - data locality, heavy disk and CPU
 Reduce tasks - more network bandwidth, as they receive output
from other map tasks
Hadoop 1.x Limitations

It is only suitable for Batch Processing of Huge amount of Data, which is already
in Hadoop System.
It is not suitable for Real-time Data Processing, for Data Streaming

Supports upto 4000 Nodes per Cluster


It has a single component : JobTracker to perform many activities like Resource
Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.

Supports only one Name Node and One Namespace per Cluster
Does not support Multi-tenancy Support.
Runs only Map/Reduce jobs.
Fundamental Idea of YARN

Decoupled cluster resource management and scheduling from MapReduce's


data processing component, enabling Hadoop to support varied types of
processing and a broader array of applications.

Hadoop clusters can now run interactive querying, streaming data and real-time
analytics applications

Have a global Resource Manager (RM) and per-application Application Master


(AM).
YARN provides …

 An elegant solution to Hadoop 1 drawbacks

 More efficient and flexible workload scheduling and resources management

With this, Hadoop has become a general purpose platform for data processing
Hadoop-2 Data Processing Architecture
Hadoop with YARN architecture
Hadoop-2 Data Processing Architecture

● Distributed storage: Nothing has changed here with the shift from
MapReduce to YARN — HDFS is still the storage layer for Hadoop.

● Resource management: The key underlying concept in the shift to YARN


from Hadoop 1 is decoupling resource management from data processing.
This enables YARN to provide resources to any processing framework written
for Hadoop, including MapReduce.
Hadoop-2 Data Processing Architecture
● Processing framework: Because YARN is a general-purpose resource
management facility, it can allocate cluster resources to any data processing
framework written for Hadoop. The processing framework then handles
application runtime issues.

● Application Programming Interface (API): With the support for additional


processing frameworks, support for additional APIs will come. Hoya (for running
HBase on YARN), Apache Giraph (for graph processing), Open MPI (for
message passing in parallel systems), Apache Storm (for data stream
processing) are in active development.
YARN Daemons
YARN Daemons
● Resource Manager (RM)
● Node Manager (NM)
● Application Master (AM)
Resource Manager
● It is core component of YARN
● Runs on master node
● A critical component in Hadoop
● Maintains a global view of all resources in cluster
● Handles & schedules requests
● Scheduler – FIFO, Fair share, Capacity share, etc
● Completely agnostic about application & framework - Application master &
Node manager takes care of it
Capacity Scheduler support in Hadoop 2.0,
Node Manager (NM)
● Each slave node has a NM which is slave to RM
● Tracks available data processing resources on slave node & sends regular
reports to RM
● Each Node Manager offers some resources to the cluster. Its resource capacity
is the amount of memory and the number of cores.
● Container is a fraction of the NM capacity and it is used by the client for running
a program.
● Containers are similar to slots, but are generic i.e. can run any given application
● Containers can be requested with custom amount, whereas slots are uniform
● All containers running on a slave node are provisioned, monitored & tracked
by NM
Application Master
● Each application on Hadoop has a Application master running in a container,
on a slave node
● Send heartbeat messages to RM with its status & needs of application
● As per NM, it assigns container resources leases on specific slave nodes
● It oversees full lifecycle of an application – requesting RM for resources and
submitting container leases to NM

Thus each application framework written for Hadoop must have its own
Application master implementation.
YARN Application Execution
● The client application submits an application request to the Resource Manager.
● The Resource Manager asks a Node Manager to create an Application Master
instance for this application. The Node Manager gets a container for it and
starts it up.
● This new Application Master initializes itself by registering itself with the
Resource Manager.
● The Application Master figures out how many processing resources are
needed to execute the entire application.
● The Application Master then requests the necessary resources from the
Resource Manager.
YARN Application Execution
● The Resource Manager accepts the resource request and queues up the
specific resource requests alongside all the other resource requests that are
already scheduled.

● As the requested resources become available on the slave nodes, the


Resource Manager grants the Application Master leases for containers on
specific slave nodes.

● The Application Master requests the assigned container from the Node
Manager and sends it a Container Launch Context (CLC).

● The application executes while the container processes are running.


.
YARN Application Execution
● Also, while containers are running, the Resource Manager can send a kill
order to the Node Manager to terminate a specific container.

● In the case of MapReduce applications, after the map tasks are finished, the
Application Master requests resources for a round of reduce tasks to process
the interim result sets from the map tasks.

● When all tasks are complete, the Application Master sends the result set to the
client application, informs the Resource Manager that the application has
successfully completed, deregisters itself from the Resource Manager, and
shuts itself down.
Hadoop1 vs Hadoop2
Hadoop 1 Hadoop 2

Supports MapReduce (MR) processing Allows to work in MR as well as other


model only. Does not support non-MR tools distributed computing models like Spark,
Hama, Giraph, Message Passing Interface)
MPI & HBase coprocesso

MR does both processing and cluster- YARN (Yet Another Resource Negotiator)
resource management. does cluster resource management
and processing is done using different
processing models.
Has limited scaling of nodes. Limited to Has better scalability. Scalable up to 10000
4000 nodes per cluster nodes per cluster
Works on concepts of slots – slots can run Works on concepts of containers. Using
either a Map task or a Reduce task only. containers can run generic tasks.
Hadoop1 vs Hadoop2
Hadoop 1 Hadoop 2

A single Namenode to manage the entire Multiple Namenode servers manage


namespace. multiple namespaces.

Has Single-Point-of-Failure (SPOF) – Has to feature to overcome SPOF with a


because of single Namenode- and in the standby Namenode and in the case of
case of Namenode failure, needs manual Namenode failure, it is configured for
intervention to overcome. automatic recovery.

Has a limitation to serve as a platform for Can serve as a platform for a wide variety of
event processing, streaming and real-time data analytics-possible to run event
operations. processing, streaming and real-time
operations.
Real-Time and Streaming Applications
● Tez, supports real-time applications, where the user expects an immediate
response
● Storm –analyzes streaming data; processes data that hasn’t yet been stored
to disk

For Both to work – YARN provides


● Dedicated Application Master that stays alive, waiting to coordinate requests
● Application Master also has open leases on reusable containers to execute
any requests as they arrive

Vous aimerez peut-être aussi