Hadoop - Session 8 Oozie MRV2

12/19/2015
Hadoop 2.0 (MRV2- YARN)
www.Bigdatainpractice.com
Storage and Processing: HDFS and MapReduce (Hadoop 1)

HDFS
Name Node
Secondary
Name Node
Map Reduce
Data Nodes & Task Trackers

D1
D2
D3
D4
D5
D6
D7
D8
Job Tracker
12/19/2015
Big Data: Hadoop Eco-System MRV1

UNIVERSE: Actions
historically justified
business decisions
Deployment of
Knowledge
Reporting Tools
(Extracted Knowledge)
MapReduce Framework
HADOOP
(Cluster)
Hadoop Distributed File System
Web
logs
UNIVERSE:
Diverse Business
Applications
APP2
Emails,
Documents
EDW
(un-structured)
APP1
Flat Files
Challenges with Hadoop 1

Name Node (Master Node)
HDFS
1. Single Name node, Single

Namespace Single Point of
Failure
2. Required Large Name node RAM
and no horizontal scalability for
Data Nodes
Name node
Name Node
D1
D2
D5
D6
3.
Secondary Name node is not for

D4
failover manual recovery
process in case name node failure
D3
D7
D8
12/19/2015
Challenges with Hadoop 1

Challenges with Map Reduce in Hadoop 1:
Map Reduce
1. Job Tracker is Single Point of Failure

2. Job Tracker is Overburdened by life cycle
management of the jobs (CPU)
3.
Single Job Tracker has to monitor thousands of

Map Reduce Jobs (Network)
Job Tracker
T1
T2
T3
T4
T5
T6
T7
T8
Task Tracker on every

data node
MR V1 Other challenges:
1.
Cascading Failures
2.
Multi-tenancy
Hadoop 2.0: Federation (horizontal scalability)

Name Node Federation
Name node 1
Name node 2
Name node n
Name
Space
1
Name
Space
2
Name
Space
n
POOL 1
Data
Node 1
BLOCK POOLS
POOL 2
Data
Node 2
POOL n
Reduces load on single

name node (as there are
multiple name nodes each
looking at separate name
space)
Provides support for

cross data center support
for HDFS
Data
Node N
12/19/2015
Hadoop 2.0: HIGH AVAILABILITY

JOURNAL NODE
Active Name
Node Writes Edit
Log to Journal
Nodes
JOURNAL NODE
JOURNAL NODE
Passive Name Node

reads edit logs and
updates itself
ACTIVE
NAME
NODE
PASIVE
NAME
NODE
Data Nodes send
heart-beat message
to both Name Nodes
Data
Node 1
Data
Node 2
Data
Node N
Data
Node 2
Big Data: Hadoop Eco-System Apache YARN

UNIVERSE: Actions
historically justified
business decisions
Deployment of
Knowledge
Reporting Tools
(Extracted Knowledge)
MapReduce
Framework
HBASE
Apache
Spark GIRAPH, Other
Frameworks
HADOOP
(Cluster)
Hadoop Distributed File System
Web
logs
UNIVERSE:
Diverse Business
Applications
APP2
Emails,
Documents
(un-structured)
EDW
APP1
Flat Files
12/19/2015
Hadoop 2.0: Apache YARN

Capacity Scheduler:
Schedules MR or Non
MR Jobs on Cluster
RESOURCE
MANAGER
Application Manager:
Start and monitor Application
Masters running on cluster
Restarts Application Masters on
Different Node in case of Failures
Cap Scheduler
App Manager
Client
Resource Manager
Node Manager
Application Master
HDFS
Application Master:
Get Input Splits
Request Resources and Start
Containers on different nodes
Works with client in complete
life cycle of Job
Application
Master
Container
Container
Node
Manager
Node
Manager
Node
Manager
Node
Manager
Apache Oozie
12/19/2015
Introduction to Python and Python Streaming
Introduction to Oozie
Oozie Important Components
Oozie Workflow Components
Oozie Workflow Features
Oozie Operational Details
Apache Oozie
1. Open Source, Java application
for scheduling Hadoop Jobs
2. It combines jobs into

logical unit of work as
workflow
3. A workflow is set of
actions forming a
DAG
4. Oozie is integrated with

YARN and supports hive,
pig, mapreduce actions
8. Complex Data
Transformations can be
designed and scheduled
using apache oozie
7. Workflows,
coordinators and
Bundles as important
components
6. Oozie support decisions,

fork (parallel execution)
and join nodes
5. Oozie also supports actions specific to a
system (like java programs or shell
actions)
12/19/2015
Oozie important components

1) Oozie Workflow:
Set of actions forming a DAG.
Actions have to be executed in certain sequence
and/or in parallel.
2) Oozie Coordinator:
Coordinator is useful for scheduling repetitive
es
tasks.
Coordinator can be set to trigger different
workflows as per time frequency or system events
(like data availability)
3) Oozie Bundles:
Oozie Bundles is way to package multiple
coordinators and workflow jobs and to manage
lifecycle of those jobs.
Oozie important workflow components
Start
End
Action
Decision
Fork
Join
12/19/2015
Oozie Workflow Features

1) Complex Data Transformation Tasks can be designed
using oozie workflows.
2) Allows easy re-execution in the cases
3) Allows parallel execution of the tasks wherever
possible
4) Allows partial execution with use of decision variables
es
5) Output of one action can be passed to next actions

(this is also called as pipelines)
6) Job monitoring and error analysis can be easily done
using Oozie web console.
7) Oozie framework is designed to be fault resistant, it
maintains all states in database, thus all pending tasks
are completed when oozie comes up (incase of
failures).

Important Directories and Files:
1) Job.properties:
2) Workflow.xml
3) Lib: for jar files
Steps:
1) Create a directory for application (example: OozieSimple)
2) Create job.properties file for the job that defines all the input
parameters
3) Create workflow.xml file for the job that defines all actions and
execution sequence of those actions
4) Create a lib directory (inside OozieSimple) where actual jar files
can be copied.
5) Move OozieSimple to HDFS (with all its contents).

6) Trigger oozie job with oozie command.
12/19/2015

Sample job.properties
#*****************cluster settings*******************
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
oozie.wf.application.path=${nameNode}/user/${user.name}/OozieSimple
#****************decision variable*******************
is_mapreduce_simple_run=1
is_python_run=1
#******************Step 1 Python variables***********
#***Set country=NA if it has to be read from file*****
country=IN
FilePath=/user/cloudera/country.txt
#****************mapreduce variables*****************
input=/user/cloudera/INPUT1/SalesData.csv
output=/user/cloudera/OozieDecisionOUT
Sample decision node
<decision name=check_is_python_action_run">
<switch>
<case to=python_action">
${wf:conf('_is_python_run')eq 1}
</case>
<default to=check_is_mapreduce_simple_run" />
</switch>
</decision>
12/19/2015
Sample Fork
<workflow-app xmlns="uri:oozie:workflow:0.2"
name="pscore_datacheck">
<start to="fork" />
<fork name="fork">
<path start="decision_check_action_1" />
<path start="decision_check_action_2" />
<path start=action_3" />
</fork>
Sample FS Action
<action name="create_output_directories">
<fs>
<mkdir path="${nameNode}/user/hive/warehouse/prep/>
<mkdir path="${nameNode}/user/hive/warehouse/scoring"/>
<mkdir path="${nameNode}/user/hive/warehouse/post"/>
</fs>
<ok to="decision_on_next_activity"/>
<error to="send-email"/>
</action>
10
12/19/2015
Thank You
11

Hadoop - Session 8 Oozie MRV2

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hadoop - Session 8 Oozie MRV2

Transféré par

Droits d'auteur :

Formats disponibles

12/19/2015

Hadoop 2.0 (MRV2- YARN)

Storage and Processing: HDFS and MapReduce (Hadoop 1)

Data Nodes & Task Trackers

Big Data: Hadoop Eco-System MRV1

Hadoop Distributed File System

Challenges with Hadoop 1

1. Single Name node, Single

Secondary Name node is not for

Challenges with Hadoop 1

1. Job Tracker is Single Point of Failure

Single Job Tracker has to monitor thousands of

Task Tracker on every

Hadoop 2.0: Federation (horizontal scalability)

Reduces load on single

Provides support for

Hadoop 2.0: HIGH AVAILABILITY

Passive Name Node

Big Data: Hadoop Eco-System Apache YARN

Hadoop Distributed File System

Hadoop 2.0: Apache YARN

Introduction to Python and Python Streaming

2. It combines jobs into

4. Oozie is integrated with

6. Oozie support decisions,

Oozie important components

Oozie important workflow components

Oozie Workflow Features

5) Output of one action can be passed to next actions

Oozie Operational Details

5) Move OozieSimple to HDFS (with all its contents).

Oozie Operational Details

Sample decision node

Vous aimerez peut-être aussi