Anatomyofdataframeapi 150803064912 Lva1 App6892

Anatomy of Data Frame
API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Madhukara Phatak
Big data consultant and
trainer at datamantra.io
Consult in Hadoop, Spark
and Scala
www.madhukaraphatak.com
Agenda
Spark SQL library

Dataframe abstraction
Pig/Hive pipleline vs SparkSQL
Logical plan
Optimizer
Different steps in Query analysis
Spark SQL library

Data source API
Universal API for Loading/ Saving structured data
DataFrame API
Higher level representation for structured data
SQL interpreter and optimizer
Express data transformation in SQL
SQL service
Hive thrift server
Architecture of Spark SQL
Dataframe DSL
Spark SQL and HQL
Data Frame API

Data Source API
CSV
JSON
JDBC
DataFrame API
Single abstraction for representing structured data in
Spark
DataFrame = RDD + Schema (aka SchemaRDD)
All data source APIs return DataFrame
Introduced in 1.3
Inspired from R and Python panda
.rdd to convert to RDD representation resulting in RDD
[Row]
Support for DataFrame DSL in Spark
Need for new abstraction

Single abstraction for structured data
Ability to combine data from multiple sources
Uniform access from all different language APIs
Ability to support multiple DSLs
Familiar interface to Data scientists
Same API as R/ Panda
Easy to convert from R local data frame to Spark
New 1.4 SparkR is built around it
Data Structure of structured world

Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured
data
Having single data structure allows to build multiple
DSLs targeting different developers
All DSLs will be using same optimizer and code
generator underneath
Compare with Hadoop Pig and Hive
Pig and Hive pipeline

HiveQL
Pig latin
Hive queries
Hive parser
Pig latin script

Pig parser
Logical Plan
Optimizer
Logical Plan
Optimizer
Optimized Logical
Plan(M/R plan)
Executor
Optimized Logical
Plan(M/R plan)
Executor
Physical Plan
Physical Plan
Issue with Pig and Hive flow

Pig and hive shares a lot similar steps but independent
of each other
Each project implements its own optimizer and
executor which prevents benefiting from each others
work
There is no common data structure on which we can
build both Pig and Hive dialects
Optimizer is not flexible to accommodate multiple DSLs
Lot of duplicate effort and poor interoperability
Spark SQL pipeline

Dataframe
DSL
SparkQL
HiveQL
Hive queries
Hive parser
Spark SQL
queries
SparkSQL Parser
DataFrame
Catalyst
Spark RDD
code
Spark SQL flow

Multiple DSLs share same optimizer and executor
All DSLs ultimately generate Dataframes
Catalyst is a new optimizer built from ground up for
Spark which is rule based framework
Catalyst allows developers to plug custom rules specific
to their DSL
You can plug your own DSL too!!
What is a data frame?

Data frame is a container for Logical Plan
Logical Plan is a tree which represents data and
schema
Every transformation is represented as tree
manipulation
These trees are manipulated and optimized by catalyst
rules
Logical plan will be converted to physical plan for
execution
Explain Command
Explain command on dataframe allows us look at these
plans
There are three types of logical plans
Parsed logical plan
Analysed Logical Plan
Optimized logical Plan
Explain also shows Physical plan
DataFrameExample.scala
Filter example
In last example, all plans looked same as there were no
dataframe operations
In this example, we are going to apply two filters on the
data frame
Observe generated optimized plan
Example : FilterExampleTree.scala
Optimized Plan
Optimized plan normally allows spark to plug in set of
optimization rules
In our example, When multiple filters are added, spark
&& them for better performance
Even developer can plug in his/her own rules to
optimizer
Accessing Plan trees

Every dataframe is attached with queryExecution object
which allows us to access these plans individually.
We can access plans as follows
parsed plan - queryExecution.logical
Analysed - queryExecution.analyzed
Optimized - queryExecution.optimizedPlan
numberedTreeString on the plan allows us to see the
hierarchy
Example : FilterExampleTree.scala
Filter tree representation

00 Filter NOT (CAST(c2#0,
DoubleType) = CAST(0,
DoubleType))
01 Filter NOT (CAST(c1#0,

DoubleType) = CAST(0,
DoubleType))
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Filter (NOT (CAST(c1#0,

DoubleType) = 0.0) && NOT
(CAST(c2#1, DoubleType) = 0.0))
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Manipulating Trees
Every optimization in spark-sql is implemented as a tree
or logical transformation
Series of these transformation allows for modular
optimizer
All tree manipulations are done using scala case class
As developer we can write these manipulations too
Lets create an OR filter rather than and
OrFilter.scala
Understanding steps in plan

Logical plan goes through series of rules to resolve and
optimize plan
Each plan is a Tree manipulation we seen before
We can apply series of rules to see how a given plan
evolves over time
This understanding allows us to understand how to
tweak given query for better performance
Ex : StepsInQueryPlanning.scala
Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0
Parsed Plan
This is plan generated after parsing the DSL
Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
Usually they recognize the different transformations and
represent them in the tree nodes
Its a straightforward translation without much tweaking
This will be fed to analyser to generate analysed plan
Parsed Logical Plan

`Project
a.customerId
`Filter
(amount = 500)
`SubQuery
a
`Projection
'customerId,'amountPaid
`Filter
(1 = 1)
UnResolvedRelation
Sales
Analyzed plan
We use sqlContext.analyser access the rules to
generate analyzed plan
These rules has to be run in sequence to resolve
different entities in the logical plan
Different entities to be resolved is
Relations ( aka Table)
References Ex : Subquery, aliases etc
Data type casting
ResolveRelations Rule
This rule resolves all the relations ( tables) specified in
the plan
Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
Once it finds the relation, it resolves that with actual
relationship
Resolved Relation Logical Plan

`Project
a.customerId
`Project
a.customerId
`Filter
(amount = 500)
`Filter
(amount = 500)
`SubQuery
a
`SubQuery
a
`Projection
`Projection
`Filter
(1 = 1)
Filter
(1 = 1)
SubQuery - sales
UnResolvedRelation
Sales
JsonRelation
Sales[amountPaid..]
ResolveReferences
This rule resolves all the references in the Plan
All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
This unique numbering allows subqueries to removed
for better optimization
Resolved References Plan

`Project
a.customerId
Project
customerId#1L
`Filter
(amount = 500)
Filter
(amount#4 = 500)
`SubQuery
a
SubQuery
a
`Projection
Projection
customerId#1L,amountPaid#0
`Filter
(1 = 1)
`Filter
(1 = 1)
SubQuery - sales
SubQuery - sales
JsonRelation
Sales[amountPaid..]
JsonRelation
Sales[amountPaid#0..]
PromoteString
This rule allows analyser to promote string to right data
types
In our query, Filter( 1=1) we are comparing a double
with string
This rule puts a cast from string to double to have the
right semantics
Promote String Plan

Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Filter
(amount#4 = 500)
SubQuery
a
SubQuery
a
Projection
Projection
`Filter
(1 = 1)
`Filter
(1 = CAST(1, DoubleType))
SubQuery - sales
SubQuery - sales
JsonRelation
JsonRelation
Optimize
Eliminate Subqueries
This rule allows analyser to eliminate superfluous sub
queries
This is possible as we have unique identifier for each of
the references
Removal of sub queries allows us to do advanced
optimization in subsequent steps
Eliminate subqueries
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
SubQuery
a
Projection
`Filter
Filter
(amount#4 = 500)
Projection
`Filter
SubQuery - sales
JsonRelation
JsonRelation
Constant Folding
Simplifies expressions which result in constant values
In our plan, Filter(1=1) always results in true
So constant folding replaces it in true
ConstantFoldingPlan
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Filter
(amount#4 = 500)
Projection
Projection
`Filter
`Filter
True
JsonRelation
JsonRelation
Simplify Filters
This rule simplifies filters by
Removes always true filters
Removes entire plan subtree if filter is false
In our query, the true Filter will be removed
By simplifying filters, we can avoid multiple iterations on
data
Simplify Filter Plan

Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Projection
`Filter
True
JsonRelation
Filter
(amount#4 = 500)
Projection
JsonRelation
PushPredicateThroughFilter
Its always good to have filters near to the data source
for better optimizations
This rules pushes the filters near to the JsonRelation
When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
PushPredicateThroughFilter Plan
Project
customerId#1L
Project
customerId#1L
Filter
(amount#4 = 500)
Projection
Projection
Filter
(amountPaid#0 = 500)
JsonRelation
JsonRelation
Project Collapsing
Removes unnecessary projects from the plan
In our plan , we dont need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
So we can get rid of the second projection
This gives us most optimized plan
Project Collapsing Plan

Project
customerId#1L
Projection
Filter
JsonRelation
Project
customerId#1L
Filter
JsonRelation
Generating Physical Plan

Catalyser can take a logical plan and turn into a
physical plan or Spark plan
On queryExecutor, we have a plan called executedPlan
which gives us physical plan
On physical plan, we can call executeCollect or
executeTake to start evaluating the plan
References
https://www.youtube.com/watch?v=GQSNJAzxOr8
https://databricks.com/blog/2015/04/13/deep-dive-intospark-sqls-catalyst-optimizer.html
http://spark.apache.org/sql/

Anatomyofdataframeapi 150803064912 Lva1 App6892

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Anatomyofdataframeapi 150803064912 Lva1 App6892

Transféré par

Droits d'auteur :

Formats disponibles

Anatomy of Data Frame

Spark SQL library

Spark SQL library

Architecture of Spark SQL

Spark SQL and HQL

Data Frame API

Need for new abstraction

Data Structure of structured world

Pig and Hive pipeline

Pig latin script

Issue with Pig and Hive flow

Spark SQL pipeline

Spark SQL flow

What is a data frame?

Accessing Plan trees

Filter tree representation

01 Filter NOT (CAST(c1#0,

Filter (NOT (CAST(c1#0,

Understanding steps in plan

Parsed Logical Plan

Resolved Relation Logical Plan

Resolved References Plan

Promote String Plan

Simplify Filter Plan

Project Collapsing Plan

Generating Physical Plan

Vous aimerez peut-être aussi