Vous êtes sur la page 1sur 43

Anatomy of Data Frame

API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

Madhukara Phatak
Big data consultant and
trainer at datamantra.io
Consult in Hadoop, Spark
and Scala
www.madhukaraphatak.com

Agenda

Spark SQL library


Dataframe abstraction
Pig/Hive pipleline vs SparkSQL
Logical plan
Optimizer
Different steps in Query analysis

Spark SQL library


Data source API
Universal API for Loading/ Saving structured data
DataFrame API
Higher level representation for structured data
SQL interpreter and optimizer
Express data transformation in SQL
SQL service
Hive thrift server

Architecture of Spark SQL

Dataframe DSL

Spark SQL and HQL

Data Frame API


Data Source API

CSV

JSON

JDBC

DataFrame API
Single abstraction for representing structured data in
Spark
DataFrame = RDD + Schema (aka SchemaRDD)
All data source APIs return DataFrame
Introduced in 1.3
Inspired from R and Python panda
.rdd to convert to RDD representation resulting in RDD
[Row]
Support for DataFrame DSL in Spark

Need for new abstraction


Single abstraction for structured data
Ability to combine data from multiple sources
Uniform access from all different language APIs
Ability to support multiple DSLs
Familiar interface to Data scientists
Same API as R/ Panda
Easy to convert from R local data frame to Spark
New 1.4 SparkR is built around it

Data Structure of structured world


Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured
data
Having single data structure allows to build multiple
DSLs targeting different developers
All DSLs will be using same optimizer and code
generator underneath
Compare with Hadoop Pig and Hive

Pig and Hive pipeline


HiveQL

Pig latin
Hive queries

Hive parser

Pig latin script


Pig parser

Logical Plan
Optimizer

Logical Plan
Optimizer

Optimized Logical
Plan(M/R plan)
Executor

Optimized Logical
Plan(M/R plan)
Executor

Physical Plan

Physical Plan

Issue with Pig and Hive flow


Pig and hive shares a lot similar steps but independent
of each other
Each project implements its own optimizer and
executor which prevents benefiting from each others
work
There is no common data structure on which we can
build both Pig and Hive dialects
Optimizer is not flexible to accommodate multiple DSLs
Lot of duplicate effort and poor interoperability

Spark SQL pipeline


Dataframe
DSL

SparkQL

HiveQL
Hive queries
Hive parser

Spark SQL
queries

SparkSQL Parser

DataFrame

Catalyst

Spark RDD
code

Spark SQL flow


Multiple DSLs share same optimizer and executor
All DSLs ultimately generate Dataframes
Catalyst is a new optimizer built from ground up for
Spark which is rule based framework
Catalyst allows developers to plug custom rules specific
to their DSL
You can plug your own DSL too!!

What is a data frame?


Data frame is a container for Logical Plan
Logical Plan is a tree which represents data and
schema
Every transformation is represented as tree
manipulation
These trees are manipulated and optimized by catalyst
rules
Logical plan will be converted to physical plan for
execution

Explain Command
Explain command on dataframe allows us look at these
plans
There are three types of logical plans
Parsed logical plan
Analysed Logical Plan
Optimized logical Plan
Explain also shows Physical plan
DataFrameExample.scala

Filter example
In last example, all plans looked same as there were no
dataframe operations
In this example, we are going to apply two filters on the
data frame
Observe generated optimized plan
Example : FilterExampleTree.scala

Optimized Plan
Optimized plan normally allows spark to plug in set of
optimization rules
In our example, When multiple filters are added, spark
&& them for better performance
Even developer can plug in his/her own rules to
optimizer

Accessing Plan trees


Every dataframe is attached with queryExecution object
which allows us to access these plans individually.
We can access plans as follows
parsed plan - queryExecution.logical
Analysed - queryExecution.analyzed
Optimized - queryExecution.optimizedPlan
numberedTreeString on the plan allows us to see the
hierarchy
Example : FilterExampleTree.scala

Filter tree representation


00 Filter NOT (CAST(c2#0,
DoubleType) = CAST(0,
DoubleType))

01 Filter NOT (CAST(c1#0,


DoubleType) = CAST(0,
DoubleType))

02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]

Filter (NOT (CAST(c1#0,


DoubleType) = 0.0) && NOT
(CAST(c2#1, DoubleType) = 0.0))

02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]

Manipulating Trees
Every optimization in spark-sql is implemented as a tree
or logical transformation
Series of these transformation allows for modular
optimizer
All tree manipulations are done using scala case class
As developer we can write these manipulations too
Lets create an OR filter rather than and
OrFilter.scala

Understanding steps in plan


Logical plan goes through series of rules to resolve and
optimize plan
Each plan is a Tree manipulation we seen before
We can apply series of rules to see how a given plan
evolves over time
This understanding allows us to understand how to
tweak given query for better performance
Ex : StepsInQueryPlanning.scala

Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0

Parsed Plan
This is plan generated after parsing the DSL
Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
Usually they recognize the different transformations and
represent them in the tree nodes
Its a straightforward translation without much tweaking
This will be fed to analyser to generate analysed plan

Parsed Logical Plan


`Project
a.customerId
`Filter
(amount = 500)

`SubQuery
a
`Projection
'customerId,'amountPaid
`Filter
(1 = 1)

UnResolvedRelation
Sales

Analyzed plan
We use sqlContext.analyser access the rules to
generate analyzed plan
These rules has to be run in sequence to resolve
different entities in the logical plan
Different entities to be resolved is
Relations ( aka Table)
References Ex : Subquery, aliases etc
Data type casting

ResolveRelations Rule
This rule resolves all the relations ( tables) specified in
the plan
Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
Once it finds the relation, it resolves that with actual
relationship

Resolved Relation Logical Plan


`Project
a.customerId

`Project
a.customerId

`Filter
(amount = 500)

`Filter
(amount = 500)

`SubQuery
a

`SubQuery
a

`Projection
'customerId,'amountPaid

`Projection
'customerId,'amountPaid

`Filter
(1 = 1)

Filter
(1 = 1)
SubQuery - sales

UnResolvedRelation
Sales

JsonRelation
Sales[amountPaid..]

ResolveReferences
This rule resolves all the references in the Plan
All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
This unique numbering allows subqueries to removed
for better optimization

Resolved References Plan


`Project
a.customerId

Project
customerId#1L

`Filter
(amount = 500)

Filter
(amount#4 = 500)

`SubQuery
a

SubQuery
a

`Projection
'customerId,'amountPaid

Projection
customerId#1L,amountPaid#0

`Filter
(1 = 1)

`Filter
(1 = 1)

SubQuery - sales

SubQuery - sales

JsonRelation
Sales[amountPaid..]

JsonRelation
Sales[amountPaid#0..]

PromoteString
This rule allows analyser to promote string to right data
types
In our query, Filter( 1=1) we are comparing a double
with string
This rule puts a cast from string to double to have the
right semantics

Promote String Plan


Project
customerId#1L

Project
customerId#1L

Filter
(amount#4 = 500)

Filter
(amount#4 = 500)

SubQuery
a

SubQuery
a

Projection
customerId#1L,amountPaid#0

Projection
customerId#1L,amountPaid#0

`Filter
(1 = 1)

`Filter
(1 = CAST(1, DoubleType))

SubQuery - sales

SubQuery - sales

JsonRelation
Sales[amountPaid#0..]

JsonRelation
Sales[amountPaid#0..]

Optimize

Eliminate Subqueries
This rule allows analyser to eliminate superfluous sub
queries
This is possible as we have unique identifier for each of
the references
Removal of sub queries allows us to do advanced
optimization in subsequent steps

Eliminate subqueries
Project
customerId#1L

Project
customerId#1L

Filter
(amount#4 = 500)
SubQuery
a
Projection
customerId#1L,amountPaid#0
`Filter
(1 = CAST(1, DoubleType))

Filter
(amount#4 = 500)

Projection
customerId#1L,amountPaid#0
`Filter
(1 = CAST(1, DoubleType))

SubQuery - sales
JsonRelation
Sales[amountPaid#0..]

JsonRelation
Sales[amountPaid#0..]

Constant Folding
Simplifies expressions which result in constant values
In our plan, Filter(1=1) always results in true
So constant folding replaces it in true

ConstantFoldingPlan
Project
customerId#1L

Project
customerId#1L

Filter
(amount#4 = 500)

Filter
(amount#4 = 500)

Projection
customerId#1L,amountPaid#0

Projection
customerId#1L,amountPaid#0

`Filter
(1 = CAST(1, DoubleType))

`Filter
True

JsonRelation
Sales[amountPaid#0..]

JsonRelation
Sales[amountPaid#0..]

Simplify Filters
This rule simplifies filters by
Removes always true filters
Removes entire plan subtree if filter is false
In our query, the true Filter will be removed
By simplifying filters, we can avoid multiple iterations on
data

Simplify Filter Plan


Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)

Projection
customerId#1L,amountPaid#0
`Filter
True

JsonRelation
Sales[amountPaid#0..]

Filter
(amount#4 = 500)

Projection
customerId#1L,amountPaid#0

JsonRelation
Sales[amountPaid#0..]

PushPredicateThroughFilter
Its always good to have filters near to the data source
for better optimizations
This rules pushes the filters near to the JsonRelation
When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
In our example, the filter rule is rewritten to use alias
amountPaid rather than amount

PushPredicateThroughFilter Plan
Project
customerId#1L

Project
customerId#1L

Filter
(amount#4 = 500)

Projection
customerId#1L,amountPaid#0

Projection
customerId#1L,amountPaid#0

Filter
(amountPaid#0 = 500)

JsonRelation
Sales[amountPaid#0..]

JsonRelation
Sales[amountPaid#0..]

Project Collapsing
Removes unnecessary projects from the plan
In our plan , we dont need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
So we can get rid of the second projection
This gives us most optimized plan

Project Collapsing Plan


Project
customerId#1L

Projection
customerId#1L,amountPaid#0

Filter
(amountPaid#0 = 500)

JsonRelation
Sales[amountPaid#0..]

Project
customerId#1L

Filter
(amountPaid#0 = 500)

JsonRelation
Sales[amountPaid#0..]

Generating Physical Plan


Catalyser can take a logical plan and turn into a
physical plan or Spark plan
On queryExecutor, we have a plan called executedPlan
which gives us physical plan
On physical plan, we can call executeCollect or
executeTake to start evaluating the plan

References
https://www.youtube.com/watch?v=GQSNJAzxOr8
https://databricks.com/blog/2015/04/13/deep-dive-intospark-sqls-catalyst-optimizer.html
http://spark.apache.org/sql/

Vous aimerez peut-être aussi