Vous êtes sur la page 1sur 51

Parallel

Processing in Spark
Chapter 14

201509
Course Chapters

1 IntroducHon Course IntroducHon


2 IntroducHon to Hadoop and the Hadoop Ecosystem
IntroducHon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporHng RelaHonal Data with Apache Sqoop
5 IntroducHon to Impala and Hive
ImporHng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParHHoning
9 Capturing Data with Apache Flume IngesHng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaHng Data with Pair RDDs
13 WriHng and Deploying Spark ApplicaHons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaDerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-2
Parallel Programming with Spark

In this chapter you will learn


How RDDs are distributed across a cluster
How Spark executes RDD operaAons in parallel

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-3
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-4
Spark Cluster Review
Worker (Slave) Nodes

$ spark-submit
--master yarn-client
--class MyClass
--num-executors 3
MyApp.jar

Cluster HDFS
Master Master
Node Node

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-5
Spark Cluster Review
Worker (Slave) Nodes

$ spark-submit
Container
--master yarn-client
Driver Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar

Container
Cluster HDFS
Master Master
Node Node

Container

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-6
Spark Cluster Review
Worker (Slave) Nodes

$ spark-submit
Container
Executor
--master yarn-client
Driver Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar

Executor
Container
Cluster HDFS
Master Master
Node Node

Executor
Container

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-7
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParAAons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-8
RDDs on a Cluster

Resilient Distributed Datasets RDD 1


Data is par$$oned across worker nodes
Executor
ParAAoning is done automaAcally by Spark rdd_1_0
OpHonally, you can control how many
parHHons are created Executor
rdd_1_1

Executor
rdd_1_2

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-9
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParAAoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-10
File ParHHoning: Single Files

ParAAons from single les sc.textFile("myfile",3)


ParHHons based on size
You can opHonally specify a minimum RDD
number of parHHons
textFile(file, minPartitions) Executor

Default is 2
myle
More parHHons = more parallelizaHon
Executor

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-11
File ParHHoning: MulHple Files
RDD
sc.textFile("mydir/*")
Executor
Each le becomes (at least) one
parHHon le1

File-based operaHons can be done Executor


per-parHHon, for example parsing
XML le2
Executor

sc.wholeTextFiles("mydir")
For many small les RDD
Creates a key-value PairRDD Executor
key = le name
value = le contents
Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-12
OperaHng on ParHHons

Most RDD operaAons work on each element of an RDD


A few work on each par00on
foreachPartition call a funcHon for each parHHon
mapPartitions create a new RDD by execuHng a funcHon on each
parHHon in the current RDD
mapPartitionsWithIndex same as mapPartitions but
includes index of the parHHon
FuncAons for parAAon operaAons take iterators

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-13
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-14
HDFS and Data Locality (1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-15
HDFS and Data Locality (2)

$ hdfs dfs put mydata

HDFS:
mydata
HDFS
Block 1

HDFS
Block 2

HDFS
Block 3

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-16
HDFS and Data Locality (3)

HDFS:
Driver Program mydata
Executor HDFS
Spark
Block 1
Context

Executor HDFS
Block 2

Executor HDFS
Block 3

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-17
HDFS and Data Locality (4)

By default, Spark parHHons


sc.textFile("hdfs://mydata").collect() le-based RDDs by block.
Each block loads into a single
parHHon.
HDFS:
RDD
Driver Program mydata
Executor HDFS
Spark
Block 1
Context

Executor HDFS
Block 2

Executor HDFS
Block 3

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-18
HDFS and Data Locality (5)

An acHon triggers
sc.textFile("hdfs://mydata").collect() execuHon: tasks on
executors load data from
blocks into parHHons
HDFS:
RDD
Driver Program mydata
Executor HDFS
Spark task Block 1
Context

Executor HDFS
task Block 2

Executor HDFS
task Block 3

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-19
HDFS and Data Locality (6)

Data is distributed across


sc.textFile("hdfs://mydata").collect() executors unHl an acHon
returns a value to the driver

HDFS:
RDD
Driver Program mydata
Executor HDFS
Spark
Block 1
Context

Executor HDFS
Block 2

Executor HDFS
Block 3

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-20
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuAng Parallel OperaAons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-21
Parallel OperaHons on ParHHons

RDD operaAons are executed in parallel on each parAAon


When possible, tasks execute on the worker nodes where the data is in
memory
Some operaAons preserve parAAoning
e.g., map, flatMap, filter
Some operaAons reparAAon
e.g., reduce, sort, group

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-22
Example: Average Word Length by LeDer (1)

> avglens = sc.textFile(file)

RDD

HDFS:
mydata

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-23
Example: Average Word Length by LeDer (2)

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split())

RDD RDD

HDFS:
mydata

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-24
Example: Average Word Length by LeDer (3)

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word)))

RDD RDD RDD

HDFS:
mydata

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-25
Example: Average Word Length by LeDer (4)

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey()

RDD RDD RDD


RDD

HDFS:
mydata

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-26
Example: Average Word Length by LeDer (5)

> avglens = sc.textFile(file) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))

RDD RDD RDD


RDD RDD

HDFS:
mydata

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-27
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-28
Stages

OperaAons that can run on the same parAAon are executed in stages
Tasks within a stage are pipelined together
Developers should be aware of stages to improve performance

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-29
Spark ExecuHon: Stages (1)

> val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 1 Stage 2
RDD RDD RDD
RDD RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-30
Spark ExecuHon: Stages (2)

> val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 1 Stage 2

Task 1
Task 5
Task 2

Task 3 Task 6

Task 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-31
Spark ExecuHon: Stages (3)

> val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 1 Stage 2

Task 1
Task 5
Task 2
Task 3 Task 6

Task 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-32
Spark ExecuHon: Stages (4)

> val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.saveAsTextFile("avglen-output")

Stage 1 Stage 2

Task 1
Task 5
Task 2
Task 3 Task 6

Task 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-33
Summary of Spark Terminology

Job a set of tasks executed as a result of an ac$on


Stage a set of tasks in a job that can be executed in parallel
Task an individual unit of work sent to one executor
ApplicaAon can contain any number of jobs managed by a single driver

Job Task Stage

RDD RDD RDD


RDD RDD

Stage
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-34
How Spark Calculates Stages

Spark constructs a DAG (Directed Acyclic Graph) of RDD dependencies


Narrow dependencies
Only one child depends on the RDD
No shue required between nodes
Can be collapsed into a single stage
e.g., map, filter, union
Wide (or shue) dependencies
MulHple children depend on the RDD
Denes a new stage
e.g., reduceByKey, join, groupByKey

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-35
Viewing the Stages using toDebugString (Scala)

> val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

> avglens.toDebugString()

(2) MappedRDD[5] at map at


Stage 2
| ShuffledRDD[4] at groupByKey at
+-(4) MappedRDD[3] at map at
| FlatMappedRDD[2] at flatMap at
Stage 1
| myfile MappedRDD[1] at textFile at
| myfile HadoopRDD[0] at textFile at

Indents indicate
stages (shue
boundaries)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-36
Viewing the Stages using toDebugString (Python)

> avglens = sc.textFile(myfile) \


.flatMap(lambda line: line.split()) \
.map(lambda word: (word[0],len(word))) \
.groupByKey() \
.map(lambda (k, values): \
(k, sum(values)/len(values)))

> print avglens.toDebugString()


(2) PythonRDD[13] at RDD at
| MappedRDD[12] at values at
| ShuffledRDD[11] at partitionBy at Stage 2
+-(4) PairwiseRDD[10] at groupByKey at
| PythonRDD[9] at groupByKey at
| myfile MappedRDD[7] at textFile at
| myfile HadoopRDD[6] at textFile at Stage 1

Indents indicate
stages (shue
boundaries)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-37
Spark Task ExecuHon (1)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
Stage 1 Stage 2 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 1 Task 5 avglens.saveAsTextFile("avglen-output")

Task 2 Task 6Client



Executor HDFS
Task 3 Block 1

Task 4

Executor HDFS
Block 2
Driver Program
Spark
Context Executor HDFS
Block 3

Executor HDFS
Block 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-38
Spark Task ExecuHon (2)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
Stage 1 Stage 2 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client

Executor HDFS
Block 1
Task 1

Executor HDFS
Block 2
Driver Program Task 2
Spark
Context Executor HDFS
Block 3
Task 3

Executor HDFS
Block 4
Task 4
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-39
Spark Task ExecuHon (3)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
Stage 1 Stage 2 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client

Executor HDFS
Block 1
Shue
Data Task 1

Executor HDFS
Block 2
Shue
Driver Program Data Task 2
Spark
Context Executor HDFS
Block 3
Shue
Data Task 3

Executor HDFS
Block 4
Shue
Data Task 4
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-40
Spark Task ExecuHon (4)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
Stage 2 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
Task 5 avglens.saveAsTextFile("avglen-output")

Task 6Client

Executor HDFS
Block 1
Shue
Data

Executor HDFS
Block 2
Shue
Driver Program Data
Spark
Context Executor HDFS
Block 3
Shue
Data

Executor HDFS
Block 4
Shue
Data
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-41
Spark Task ExecuHon (5)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
Stage 2 groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.saveAsTextFile("avglen-output")

Client HDFS
Executor
Block 1
Shue
Data

Executor HDFS
Block 2
Shue Task 5
Driver Program Data
Spark
Context Executor HDFS
Block 3
Shue Task 6
Data

Executor HDFS
Block 4
Shue
Data
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-42
Spark Task ExecuHon (6)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.saveAsTextFile("avglen-output")

Executor HDFS
Block 1

Executor HDFS
Block 2
Task 5
Driver Program part-00000
Spark
Context Executor HDFS
Block 3
Task 6
part-00001

Executor HDFS
Block 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-43
Spark Task ExecuHon (alternate ending)

val avglens = sc.textFile(myfile).


flatMap(line => line.split("\\W")).
map(word => (word(0),word.length)).
groupByKey().
map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))
avglens.collect()

Executor HDFS
Block 1

Executor HDFS
Block 2
Task 5
Driver Program
Spark
Context Executor HDFS
Block 3
Task 6

Executor HDFS
Block 4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-44
Controlling the Level of Parallelism

Wide operaAons (e.g., reduceByKey) parAAon result RDDs


More parHHons = more parallel tasks
Cluster will be under-uHlized if there are too few parHHons
You can control how many parAAons
Congure with the spark.default.parallelism property

spark.default.parallelism 10

OpHonal numPartitions parameter in funcHon call

> words.reduceByKey(lambda v1, v2: v1 + v2, 15)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-45
Viewing Stages in the Spark ApplicaHon UI (1)

You can view jobs and stages in the Spark ApplicaAon UI

Jobs are idenHed by the


acHon that triggered the
job execuHon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-46
Viewing Stages in the Spark ApplicaHon UI (2)

Select the job to view execuAon stages

Stages are Number of tasks =


Data shued
idenHed by the number of
between stages
last operaHon parHHons

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-47
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaHon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-48
EssenHal Points

RDDs are stored in the memory of Spark executor JVMs


Data is split into parAAons each parAAon in a separate executor
RDD operaAons are executed on parAAons in parallel
OperaAons that depend on the same parAAon are pipelined together in
stages
e.g., map, filter
OperaAons that depend on mulAple parAAons are executed in separate
stages
e.g., join, reduceByKey

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-49
Chapter Topics

Distributed Data Processing


Parallel Processing in Spark
with Spark

Review: Spark on a Cluster


RDD ParHHons
ParHHoning of File-based RDDs
HDFS and Data Locality
ExecuHng Parallel OperaHons
Stages and Tasks
Conclusion
Homework: View Jobs and Stages in the Spark ApplicaAon UI

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-50
Homework: View Jobs and Stages in the Spark ApplicaHon UI

In this homework assignment, you will


Use the Spark ApplicaHon UI to view how jobs, stages and tasks are
executed in a job
Please refer to the Homework descripAon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-51

Vous aimerez peut-être aussi