Vous êtes sur la page 1sur 11

12/19/2015

Hadoop 2.0 (MRV2- YARN)

www.Bigdatainpractice.com

Storage and Processing: HDFS and MapReduce (Hadoop 1)


HDFS

Name Node

Secondary
Name Node

Map Reduce

Data Nodes & Task Trackers


D1

D2

D3

D4

D5

D6

D7

D8

Job Tracker

www.Bigdatainpractice.com

12/19/2015

Big Data: Hadoop Eco-System MRV1


UNIVERSE: Actions

historically justified
business decisions

Deployment of
Knowledge
Reporting Tools
(Extracted Knowledge)

MapReduce Framework

HADOOP
(Cluster)

Hadoop Distributed File System

Web
logs

UNIVERSE:
Diverse Business
Applications

APP2
Emails,
Documents

EDW

(un-structured)

APP1

Flat Files

www.Bigdatainpractice.com

Challenges with Hadoop 1


Name Node (Master Node)

HDFS

1. Single Name node, Single


Namespace Single Point of
Failure
2. Required Large Name node RAM
and no horizontal scalability for
Data Nodes
Name node

Name Node
D1

D2

D5

D6

3.

Secondary Name node is not for


D4
failover manual recovery
process in case name node failure

D3

D7

D8

www.Bigdatainpractice.com

12/19/2015

Challenges with Hadoop 1


Challenges with Map Reduce in Hadoop 1:
Map Reduce

1. Job Tracker is Single Point of Failure


2. Job Tracker is Overburdened by life cycle
management of the jobs (CPU)
3.

Single Job Tracker has to monitor thousands of


Map Reduce Jobs (Network)
Job Tracker
T1

T2

T3

T4

T5

T6

T7

T8

Task Tracker on every


data node

MR V1 Other challenges:
1.

Cascading Failures

2.

Multi-tenancy

www.Bigdatainpractice.com

Hadoop 2.0: Federation (horizontal scalability)


Name Node Federation
Name node 1

Name node 2

Name node n

Name
Space
1

Name
Space
2

Name
Space
n

POOL 1

Data
Node 1

BLOCK POOLS
POOL 2

Data
Node 2

POOL n

Reduces load on single


name node (as there are
multiple name nodes each
looking at separate name
space)

Provides support for


cross data center support
for HDFS

Data
Node N

www.Bigdatainpractice.com

12/19/2015

Hadoop 2.0: HIGH AVAILABILITY


JOURNAL NODE

Active Name
Node Writes Edit
Log to Journal
Nodes

JOURNAL NODE

JOURNAL NODE

Passive Name Node


reads edit logs and
updates itself
ACTIVE
NAME
NODE

PASIVE
NAME
NODE
Data Nodes send
heart-beat message
to both Name Nodes

Data
Node 1

Data
Node 2

Data
Node N

Data
Node 2

www.Bigdatainpractice.com

Big Data: Hadoop Eco-System Apache YARN


UNIVERSE: Actions

historically justified
business decisions

Deployment of
Knowledge
Reporting Tools
(Extracted Knowledge)

MapReduce
Framework

HBASE

Apache
Spark GIRAPH, Other
Frameworks

HADOOP
(Cluster)

Hadoop Distributed File System

Web
logs

UNIVERSE:
Diverse Business
Applications

APP2
Emails,
Documents
(un-structured)

EDW

APP1

Flat Files

www.Bigdatainpractice.com

12/19/2015

Hadoop 2.0: Apache YARN


Capacity Scheduler:
Schedules MR or Non
MR Jobs on Cluster

RESOURCE
MANAGER

Application Manager:
Start and monitor Application
Masters running on cluster
Restarts Application Masters on
Different Node in case of Failures

Cap Scheduler
App Manager

Client
Resource Manager
Node Manager
Application Master
HDFS

Application Master:
Get Input Splits
Request Resources and Start
Containers on different nodes
Works with client in complete
life cycle of Job

Application
Master

Container

Container

Node
Manager

Node
Manager

Node
Manager

Node
Manager

www.Bigdatainpractice.com

Apache Oozie

www.Bigdatainpractice.com

12/19/2015

Introduction to Python and Python Streaming

Introduction to Oozie
Oozie Important Components
Oozie Workflow Components
Oozie Workflow Features
Oozie Operational Details

www.Bigdatainpractice.com

Apache Oozie
1. Open Source, Java application
for scheduling Hadoop Jobs

2. It combines jobs into


logical unit of work as
workflow
3. A workflow is set of
actions forming a
DAG

4. Oozie is integrated with


YARN and supports hive,
pig, mapreduce actions

8. Complex Data
Transformations can be
designed and scheduled
using apache oozie
7. Workflows,
coordinators and
Bundles as important
components

6. Oozie support decisions,


fork (parallel execution)
and join nodes
5. Oozie also supports actions specific to a
system (like java programs or shell
actions)

www.Bigdatainpractice.com

12/19/2015

Oozie important components


1) Oozie Workflow:
Set of actions forming a DAG.
Actions have to be executed in certain sequence
and/or in parallel.
2) Oozie Coordinator:
Coordinator is useful for scheduling repetitive
es
tasks.
Coordinator can be set to trigger different
workflows as per time frequency or system events
(like data availability)
3) Oozie Bundles:
Oozie Bundles is way to package multiple
coordinators and workflow jobs and to manage
lifecycle of those jobs.
www.Bigdatainpractice.com

Oozie important workflow components

Start
End
Action
Decision
Fork
Join

www.Bigdatainpractice.com

12/19/2015

Oozie Workflow Features


1) Complex Data Transformation Tasks can be designed
using oozie workflows.
2) Allows easy re-execution in the cases
3) Allows parallel execution of the tasks wherever
possible
4) Allows partial execution with use of decision variables
es

5) Output of one action can be passed to next actions


(this is also called as pipelines)
6) Job monitoring and error analysis can be easily done
using Oozie web console.
7) Oozie framework is designed to be fault resistant, it
maintains all states in database, thus all pending tasks
are completed when oozie comes up (incase of
failures).
www.Bigdatainpractice.com

Oozie Operational Details


Important Directories and Files:
1) Job.properties:
2) Workflow.xml
3) Lib: for jar files
Steps:
1) Create a directory for application (example: OozieSimple)
2) Create job.properties file for the job that defines all the input
parameters
3) Create workflow.xml file for the job that defines all actions and
execution sequence of those actions
4) Create a lib directory (inside OozieSimple) where actual jar files
can be copied.

5) Move OozieSimple to HDFS (with all its contents).


6) Trigger oozie job with oozie command.

www.Bigdatainpractice.com

12/19/2015

Oozie Operational Details


Sample job.properties

#*****************cluster settings*******************
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie.wf.application.path=${nameNode}/user/${user.name}/OozieSimple
#****************decision variable*******************
is_mapreduce_simple_run=1
is_python_run=1
#******************Step 1 Python variables***********
#***Set country=NA if it has to be read from file*****
country=IN
FilePath=/user/cloudera/country.txt
#****************mapreduce variables*****************
input=/user/cloudera/INPUT1/SalesData.csv
output=/user/cloudera/OozieDecisionOUT

www.Bigdatainpractice.com

Sample decision node

<decision name=check_is_python_action_run">
<switch>
<case to=python_action">
${wf:conf('_is_python_run')eq 1}
</case>
<default to=check_is_mapreduce_simple_run" />
</switch>
</decision>

www.Bigdatainpractice.com

12/19/2015

Sample Fork

<workflow-app xmlns="uri:oozie:workflow:0.2"
name="pscore_datacheck">
<start to="fork" />
<fork name="fork">
<path start="decision_check_action_1" />
<path start="decision_check_action_2" />
<path start=action_3" />
</fork>

www.Bigdatainpractice.com

Sample FS Action

<action name="create_output_directories">
<fs>
<mkdir path="${nameNode}/user/hive/warehouse/prep/>
<mkdir path="${nameNode}/user/hive/warehouse/scoring"/>
<mkdir path="${nameNode}/user/hive/warehouse/post"/>
</fs>
<ok to="decision_on_next_activity"/>
<error to="send-email"/>
</action>

www.Bigdatainpractice.com

10

12/19/2015

Thank You

www.Bigdatainpractice.com

11

Vous aimerez peut-être aussi