Académique Documents
Professionnel Documents
Culture Documents
● API – Application to be developed using MapReduce API. Pig & Hive projects
provide programmers with easier interfaces & are compiled into MapReduce
Launching a MapReduce application
1. The client application submits an application request to the JobTracker.
3. The JobTracker looks at the state of the slave nodes and queues all the
map tasks and reduce tasks for execution.
Launching a MapReduce application
4. As processing slots become available on the slave nodes, map tasks are
deployed to the slave nodes. Map tasks assigned to specific blocks of
data are assigned to nodes where that same data is stored.
5. The JobTracker monitors task progress, and in the event of a task failure
or a node failure, the task is restarted on the next available slot.
6. After the map tasks are finished, reduce tasks process the interim result
sets from the map tasks.
Finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
Monitors the individual TaskTrackers and the submits back the overall status of
the job back to the client.
When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
Task Tracker
TaskTracker runs on DataNode. Mostly on all DataNodes.
It is only suitable for Batch Processing of Huge amount of Data, which is already
in Hadoop System.
It is not suitable for Real-time Data Processing, for Data Streaming
Supports only one Name Node and One Namespace per Cluster
Does not support Multi-tenancy Support.
Runs only Map/Reduce jobs.
Fundamental Idea of YARN
Hadoop clusters can now run interactive querying, streaming data and real-time
analytics applications
With this, Hadoop has become a general purpose platform for data processing
Hadoop-2 Data Processing Architecture
Hadoop with YARN architecture
Hadoop-2 Data Processing Architecture
● Distributed storage: Nothing has changed here with the shift from
MapReduce to YARN — HDFS is still the storage layer for Hadoop.
Thus each application framework written for Hadoop must have its own
Application master implementation.
YARN Application Execution
● The client application submits an application request to the Resource Manager.
● The Resource Manager asks a Node Manager to create an Application Master
instance for this application. The Node Manager gets a container for it and
starts it up.
● This new Application Master initializes itself by registering itself with the
Resource Manager.
● The Application Master figures out how many processing resources are
needed to execute the entire application.
● The Application Master then requests the necessary resources from the
Resource Manager.
YARN Application Execution
● The Resource Manager accepts the resource request and queues up the
specific resource requests alongside all the other resource requests that are
already scheduled.
● The Application Master requests the assigned container from the Node
Manager and sends it a Container Launch Context (CLC).
● In the case of MapReduce applications, after the map tasks are finished, the
Application Master requests resources for a round of reduce tasks to process
the interim result sets from the map tasks.
● When all tasks are complete, the Application Master sends the result set to the
client application, informs the Resource Manager that the application has
successfully completed, deregisters itself from the Resource Manager, and
shuts itself down.
Hadoop1 vs Hadoop2
Hadoop 1 Hadoop 2
MR does both processing and cluster- YARN (Yet Another Resource Negotiator)
resource management. does cluster resource management
and processing is done using different
processing models.
Has limited scaling of nodes. Limited to Has better scalability. Scalable up to 10000
4000 nodes per cluster nodes per cluster
Works on concepts of slots – slots can run Works on concepts of containers. Using
either a Map task or a Reduce task only. containers can run generic tasks.
Hadoop1 vs Hadoop2
Hadoop 1 Hadoop 2
Has a limitation to serve as a platform for Can serve as a platform for a wide variety of
event processing, streaming and real-time data analytics-possible to run event
operations. processing, streaming and real-time
operations.
Real-Time and Streaming Applications
● Tez, supports real-time applications, where the user expects an immediate
response
● Storm –analyzes streaming data; processes data that hasn’t yet been stored
to disk