14 SparkParallelProcessing

Parallel
Processing in Spark
Chapter 14
201509
Course Chapters
1 IntroducHon Course IntroducHon

2 IntroducHon to Hadoop and the Hadoop Ecosystem
IntroducHon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporHng RelaHonal Data with Apache Sqoop
5 IntroducHon to Impala and Hive
ImporHng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParHHoning
9 Capturing Data with Apache Flume IngesHng Streaming Data
10 Spark Basics
11 Working with RDDs in Spark
12 AggregaHng Data with Pair RDDs
13 WriHng and Deploying Spark ApplicaHons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaDerns in Spark Data Processing
17 Spark SQL and DataFrames
18 Conclusion Course Conclusion
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-2
Parallel Programming with Spark
In this chapter you will learn

How RDDs are distributed across a cluster
How Spark executes RDD operaAons in parallel
Chapter Topics
Distributed Data Processing

Parallel Processing in Spark
with Spark
Review: Spark on a Cluster

RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Spark Cluster Review
Worker (Slave) Nodes
$ spark-submit
--master yarn-client
--class MyClass
--num-executors 3
MyApp.jar
Cluster HDFS
Master Master
Node Node
$ spark-submit
Container
Driver Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar
Container
Cluster HDFS
Master Master
Node Node
Container
$ spark-submit
Container
Executor
Driver Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar
Executor
Container
Cluster HDFS
Master Master
Node Node
Executor
Container
Chapter Topics

with Spark

RDD ParAAons
Stages and Tasks
Conclusion
RDDs on a Cluster
Resilient Distributed Datasets RDD 1

Data is par$$oned across worker nodes
Executor
ParAAoning is done automaAcally by Spark rdd_1_0
OpHonally, you can control how many
parHHons are created Executor
rdd_1_1
Executor
rdd_1_2
Chapter Topics

with Spark

RDD ParHHons
ParAAoning of File-based RDDs
Stages and Tasks
Conclusion

File ParHHoning: Single Files
ParAAons from single les sc.textFile("myfile",3)

ParHHons based on size
You can opHonally specify a minimum RDD
number of parHHons
textFile(file, minPartitions) Executor
Default is 2
myle
More parHHons = more parallelizaHon
Executor
Executor
File ParHHoning: MulHple Files
RDD
sc.textFile("mydir/*")
Executor
Each le becomes (at least) one
parHHon le1
File-based operaHons can be done Executor

per-parHHon, for example parsing
XML le2
Executor
sc.wholeTextFiles("mydir")
For many small les RDD
Creates a key-value PairRDD Executor
key = le name
value = le contents
Executor
OperaHng on ParHHons
Most RDD operaAons work on each element of an RDD

A few work on each par00on
foreachPartition call a funcHon for each parHHon
mapPartitions create a new RDD by execuHng a funcHon on each
parHHon in the current RDD
mapPartitionsWithIndex same as mapPartitions but
includes index of the parHHon
FuncAons for parAAon operaAons take iterators
Chapter Topics

with Spark

RDD ParHHons
Stages and Tasks
Conclusion

HDFS and Data Locality (1)
$ hdfs dfs put mydata
HDFS:
mydata
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3
HDFS:
Driver Program mydata
Executor HDFS
Spark
Block 1
Context
Executor HDFS
Block 2
Executor HDFS
Block 3
Executor
By default, Spark parHHons

sc.textFile("hdfs://mydata").collect() le-based RDDs by block.
Each block loads into a single
parHHon.
HDFS:
RDD
Executor HDFS
Spark
Block 1
Context
Executor HDFS
Block 2
Executor HDFS
Block 3
Executor
An acHon triggers
sc.textFile("hdfs://mydata").collect() execuHon: tasks on
executors load data from
blocks into parHHons
HDFS:
RDD
Executor HDFS
Spark task Block 1
Context
Executor HDFS
task Block 2
Executor HDFS
task Block 3
Executor
Data is distributed across

sc.textFile("hdfs://mydata").collect() executors unHl an acHon
returns a value to the driver
HDFS:
RDD
Executor HDFS
Spark
Block 1
Context
Executor HDFS
Block 2
Executor HDFS
Block 3
Executor
Chapter Topics

with Spark

RDD ParHHons
ExecuAng Parallel OperaAons
Stages and Tasks
Conclusion
Parallel OperaHons on ParHHons
RDD operaAons are executed in parallel on each parAAon

When possible, tasks execute on the worker nodes where the data is in
memory
Some operaAons preserve parAAoning
e.g., map, flatMap, filter
Some operaAons reparAAon
e.g., reduce, sort, group
Example: Average Word Length by LeDer (1)
> avglens = sc.textFile(file)
RDD
HDFS:
mydata
> avglens = sc.textFile(file) \

.flatMap(lambda line: line.split())
RDD RDD
HDFS:
mydata

.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word)))
RDD RDD RDD
HDFS:
mydata

.map(lambda word: (word[0],len(word))) \
.groupByKey()
RDD RDD RDD

RDD
HDFS:
mydata

.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
RDD RDD RDD

RDD RDD
HDFS:
mydata
Chapter Topics

with Spark

RDD ParHHons
Stages and Tasks
Conclusion
Stages
OperaAons that can run on the same parAAon are executed in stages
Tasks within a stage are pipelined together
Developers should be aware of stages to improve performance
Spark ExecuHon: Stages (1)
> val avglens = sc.textFile(myfile).

flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
> avglens.saveAsTextFile("avglen-output")
Stage 1 Stage 2
RDD RDD RDD
RDD RDD

groupByKey().
Stage 1 Stage 2
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4

groupByKey().
Stage 1 Stage 2
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4

groupByKey().
Stage 1 Stage 2
Task 1
Task 5
Task 2
Task 3 Task 6
Task 4
Summary of Spark Terminology
Job a set of tasks executed as a result of an ac$on

Stage a set of tasks in a job that can be executed in parallel
Task an individual unit of work sent to one executor
ApplicaAon can contain any number of jobs managed by a single driver
Job Task Stage
RDD RDD RDD

RDD RDD
Stage
How Spark Calculates Stages
Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies

Narrow dependencies
Only one child depends on the RDD
No shue required between nodes
Can be collapsed into a single stage
e.g., map, filter, union
Wide (or shue) dependencies
MulHple children depend on the RDD
Denes a new stage
e.g., reduceByKey, join, groupByKey
Viewing the Stages using toDebugString (Scala)

groupByKey().
> avglens.toDebugString()
(2) MappedRDD[5] at map at

Stage 2
| ShuffledRDD[4] at groupByKey at
+-(4) MappedRDD[3] at map at
| FlatMappedRDD[2] at flatMap at
Stage 1
| myfile MappedRDD[1] at textFile at
| myfile HadoopRDD[0] at textFile at
Indents indicate
stages (shue
boundaries)
Viewing the Stages using toDebugString (Python)
> avglens = sc.textFile(myfile) \

.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))
> print avglens.toDebugString()

(2) PythonRDD[13] at RDD at
| MappedRDD[12] at values at
| ShuffledRDD[11] at partitionBy at Stage 2
+-(4) PairwiseRDD[10] at groupByKey at
| PythonRDD[9] at groupByKey at
| myfile MappedRDD[7] at textFile at
| myfile HadoopRDD[6] at textFile at Stage 1
Indents indicate
stages (shue
boundaries)
Spark Task ExecuHon (1)
val avglens = sc.textFile(myfile).

Stage 1 Stage 2 groupByKey().
Task 1 Task 5 avglens.saveAsTextFile("avglen-output")
Task 2 Task 6Client

Executor HDFS
Task 3 Block 1
Task 4
Executor HDFS
Block 2
Driver Program
Spark
Context Executor HDFS
Block 3
Executor HDFS
Block 4

Task 5 avglens.saveAsTextFile("avglen-output")
Task 6Client

Executor HDFS
Block 1
Task 1
Executor HDFS
Block 2
Driver Program Task 2
Spark
Block 3
Task 3
Executor HDFS
Block 4
Task 4

Task 6Client

Executor HDFS
Block 1
Shue
Data Task 1
Executor HDFS
Block 2
Shue
Driver Program Data Task 2
Spark
Block 3
Shue
Data Task 3
Executor HDFS
Block 4
Shue
Data Task 4

Stage 2 groupByKey().
Task 6Client

Executor HDFS
Block 1
Shue
Data
Executor HDFS
Block 2
Shue
Driver Program Data
Spark
Block 3
Shue
Data
Executor HDFS
Block 4
Shue
Data

Stage 2 groupByKey().
avglens.saveAsTextFile("avglen-output")
Client HDFS
Executor
Block 1
Shue
Data
Executor HDFS
Block 2
Shue Task 5
Driver Program Data
Spark
Block 3
Shue Task 6
Data
Executor HDFS
Block 4
Shue
Data

groupByKey().
avglens.saveAsTextFile("avglen-output")
Executor HDFS
Block 1
Executor HDFS
Block 2
Task 5
Driver Program part-00000
Spark
Block 3
Task 6
part-00001
Executor HDFS
Block 4
Spark Task ExecuHon (alternate ending)

groupByKey().
avglens.collect()
Executor HDFS
Block 1
Executor HDFS
Block 2
Task 5
Driver Program
Spark
Block 3
Task 6
Executor HDFS
Block 4
Controlling the Level of Parallelism
Wide operaAons (e.g., reduceByKey) parAAon result RDDs

More parHHons = more parallel tasks
Cluster will be under-uHlized if there are too few parHHons
You can control how many parAAons
Congure with the spark.default.parallelism property
spark.default.parallelism 10
OpHonal numPartitions parameter in funcHon call
> words.reduceByKey(lambda v1, v2: v1 + v2, 15)
Viewing Stages in the Spark ApplicaHon UI (1)
You can view jobs and stages in the Spark ApplicaAon UI
Jobs are idenHed by the

acHon that triggered the
job execuHon
Viewing Stages in the Spark ApplicaHon UI (2)
Select the job to view execuAon stages
Stages are Number of tasks =

Data shued
idenHed by the number of
between stages
last operaHon parHHons
Chapter Topics

with Spark

RDD ParHHons
Stages and Tasks
Conclusion
EssenHal Points
RDDs are stored in the memory of Spark executor JVMs

Data is split into parAAons each parAAon in a separate executor
RDD operaAons are executed on parAAons in parallel
OperaAons that depend on the same parAAon are pipelined together in
stages
e.g., map, filter
OperaAons that depend on mulAple parAAons are executed in separate
stages
e.g., join, reduceByKey
Chapter Topics

with Spark

RDD ParHHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaAon UI
In this homework assignment, you will

Use the Spark ApplicaHon UI to view how jobs, stages and tasks are
executed in a job
Please refer to the Homework descripAon

14 SparkParallelProcessing

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

14 SparkParallelProcessing

Transféré par

Droits d'auteur :

Formats disponibles

Parallel

1 IntroducHon Course IntroducHon

18 Conclusion Course Conclusion

In this chapter you will learn

Distributed Data Processing

Review: Spark on a Cluster

Distributed Data Processing

Review: Spark on a Cluster

Resilient Distributed Datasets RDD 1

Distributed Data Processing

Review: Spark on a Cluster

ParAAons from single les sc.textFile("myfile",3)

File-based operaHons can be done Executor

Most RDD operaAons work on each element of an RDD

Distributed Data Processing

Review: Spark on a Cluster

$ hdfs dfs put mydata

By default, Spark parHHons

Data is distributed across

Distributed Data Processing

Review: Spark on a Cluster

RDD operaAons are executed in parallel on each parAAon

> avglens = sc.textFile(file)

> avglens = sc.textFile(file) \

> avglens = sc.textFile(file) \

RDD RDD RDD

> avglens = sc.textFile(file) \

RDD RDD RDD

> avglens = sc.textFile(file) \

RDD RDD RDD

Distributed Data Processing

Review: Spark on a Cluster

> val avglens = sc.textFile(myfile).

> val avglens = sc.textFile(myfile).

> val avglens = sc.textFile(myfile).

> val avglens = sc.textFile(myfile).

Job a set of tasks executed as a result of an ac$on

Job Task Stage

RDD RDD RDD

Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies

> val avglens = sc.textFile(myfile).

(2) MappedRDD[5] at map at

> avglens = sc.textFile(myfile) \

> print avglens.toDebugString()

val avglens = sc.textFile(myfile).

Task 2 Task 6Client

val avglens = sc.textFile(myfile).

val avglens = sc.textFile(myfile).

val avglens = sc.textFile(myfile).

val avglens = sc.textFile(myfile).

val avglens = sc.textFile(myfile).

val avglens = sc.textFile(myfile).

Wide operaAons (e.g., reduceByKey) parAAon result RDDs

OpHonal numPartitions parameter in funcHon call

> words.reduceByKey(lambda v1, v2: v1 + v2, 15)

You can view jobs and stages in the Spark ApplicaAon UI

Jobs are idenHed by the

Select the job to view execuAon stages

Stages are Number of tasks =

Distributed Data Processing

Review: Spark on a Cluster

RDDs are stored in the memory of Spark executor JVMs

Distributed Data Processing