Académique Documents
Professionnel Documents
Culture Documents
API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Madhukara Phatak
Big data consultant and
trainer at datamantra.io
Consult in Hadoop, Spark
and Scala
www.madhukaraphatak.com
Agenda
Dataframe DSL
CSV
JSON
JDBC
DataFrame API
Single abstraction for representing structured data in
Spark
DataFrame = RDD + Schema (aka SchemaRDD)
All data source APIs return DataFrame
Introduced in 1.3
Inspired from R and Python panda
.rdd to convert to RDD representation resulting in RDD
[Row]
Support for DataFrame DSL in Spark
Pig latin
Hive queries
Hive parser
Logical Plan
Optimizer
Logical Plan
Optimizer
Optimized Logical
Plan(M/R plan)
Executor
Optimized Logical
Plan(M/R plan)
Executor
Physical Plan
Physical Plan
SparkQL
HiveQL
Hive queries
Hive parser
Spark SQL
queries
SparkSQL Parser
DataFrame
Catalyst
Spark RDD
code
Explain Command
Explain command on dataframe allows us look at these
plans
There are three types of logical plans
Parsed logical plan
Analysed Logical Plan
Optimized logical Plan
Explain also shows Physical plan
DataFrameExample.scala
Filter example
In last example, all plans looked same as there were no
dataframe operations
In this example, we are going to apply two filters on the
data frame
Observe generated optimized plan
Example : FilterExampleTree.scala
Optimized Plan
Optimized plan normally allows spark to plug in set of
optimization rules
In our example, When multiple filters are added, spark
&& them for better performance
Even developer can plug in his/her own rules to
optimizer
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Manipulating Trees
Every optimization in spark-sql is implemented as a tree
or logical transformation
Series of these transformation allows for modular
optimizer
All tree manipulations are done using scala case class
As developer we can write these manipulations too
Lets create an OR filter rather than and
OrFilter.scala
Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0
Parsed Plan
This is plan generated after parsing the DSL
Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
Usually they recognize the different transformations and
represent them in the tree nodes
Its a straightforward translation without much tweaking
This will be fed to analyser to generate analysed plan
`SubQuery
a
`Projection
'customerId,'amountPaid
`Filter
(1 = 1)
UnResolvedRelation
Sales
Analyzed plan
We use sqlContext.analyser access the rules to
generate analyzed plan
These rules has to be run in sequence to resolve
different entities in the logical plan
Different entities to be resolved is
Relations ( aka Table)
References Ex : Subquery, aliases etc
Data type casting
ResolveRelations Rule
This rule resolves all the relations ( tables) specified in
the plan
Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
Once it finds the relation, it resolves that with actual
relationship
`Project
a.customerId
`Filter
(amount = 500)
`Filter
(amount = 500)
`SubQuery
a
`SubQuery
a
`Projection
'customerId,'amountPaid
`Projection
'customerId,'amountPaid
`Filter
(1 = 1)
Filter
(1 = 1)
SubQuery - sales
UnResolvedRelation
Sales
JsonRelation
Sales[amountPaid..]
ResolveReferences
This rule resolves all the references in the Plan
All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
This unique numbering allows subqueries to removed
for better optimization
Project
customerId#1L
`Filter
(amount = 500)
Filter
(amount#4 = 500)
`SubQuery
a
SubQuery
a
`Projection
'customerId,'amountPaid
Projection
customerId#1L,amountPaid#0
`Filter
(1 = 1)
`Filter
(1 = 1)
SubQuery - sales
SubQuery - sales
JsonRelation
Sales[amountPaid..]
JsonRelation
Sales[amountPaid#0..]
PromoteString
This rule allows analyser to promote string to right data
types
In our query, Filter( 1=1) we are comparing a double
with string
This rule puts a cast from string to double to have the
right semantics
Project
customerId#1L
Filter
(amount#4 = 500)
Filter
(amount#4 = 500)
SubQuery
a
SubQuery
a
Projection
customerId#1L,amountPaid#0
Projection
customerId#1L,amountPaid#0
`Filter
(1 = 1)
`Filter
(1 = CAST(1, DoubleType))
SubQuery - sales
SubQuery - sales
JsonRelation
Sales[amountPaid#0..]
JsonRelation
Sales[amountPaid#0..]
Optimize
Eliminate Subqueries
This rule allows analyser to eliminate superfluous sub
queries
This is possible as we have unique identifier for each of
the references
Removal of sub queries allows us to do advanced
optimization in subsequent steps
Eliminate subqueries
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
SubQuery
a
Projection
customerId#1L,amountPaid#0
`Filter
(1 = CAST(1, DoubleType))
Filter
(amount#4 = 500)
Projection
customerId#1L,amountPaid#0
`Filter
(1 = CAST(1, DoubleType))
SubQuery - sales
JsonRelation
Sales[amountPaid#0..]
JsonRelation
Sales[amountPaid#0..]
Constant Folding
Simplifies expressions which result in constant values
In our plan, Filter(1=1) always results in true
So constant folding replaces it in true
ConstantFoldingPlan
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Filter
(amount#4 = 500)
Projection
customerId#1L,amountPaid#0
Projection
customerId#1L,amountPaid#0
`Filter
(1 = CAST(1, DoubleType))
`Filter
True
JsonRelation
Sales[amountPaid#0..]
JsonRelation
Sales[amountPaid#0..]
Simplify Filters
This rule simplifies filters by
Removes always true filters
Removes entire plan subtree if filter is false
In our query, the true Filter will be removed
By simplifying filters, we can avoid multiple iterations on
data
Projection
customerId#1L,amountPaid#0
`Filter
True
JsonRelation
Sales[amountPaid#0..]
Filter
(amount#4 = 500)
Projection
customerId#1L,amountPaid#0
JsonRelation
Sales[amountPaid#0..]
PushPredicateThroughFilter
Its always good to have filters near to the data source
for better optimizations
This rules pushes the filters near to the JsonRelation
When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
PushPredicateThroughFilter Plan
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Projection
customerId#1L,amountPaid#0
Projection
customerId#1L,amountPaid#0
Filter
(amountPaid#0 = 500)
JsonRelation
Sales[amountPaid#0..]
JsonRelation
Sales[amountPaid#0..]
Project Collapsing
Removes unnecessary projects from the plan
In our plan , we dont need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
So we can get rid of the second projection
This gives us most optimized plan
Projection
customerId#1L,amountPaid#0
Filter
(amountPaid#0 = 500)
JsonRelation
Sales[amountPaid#0..]
Project
customerId#1L
Filter
(amountPaid#0 = 500)
JsonRelation
Sales[amountPaid#0..]
References
https://www.youtube.com/watch?v=GQSNJAzxOr8
https://databricks.com/blog/2015/04/13/deep-dive-intospark-sqls-catalyst-optimizer.html
http://spark.apache.org/sql/