Académique Documents
Professionnel Documents
Culture Documents
don@drakeconsulting.com
@dondrake
Overview
Who am I?
ETL
Daily Fantasy Baseball
Spark
Data Flow
Extracting - web crawler
Parquet
DataFrames
Python or Scala?
Transforming - RDDs
Transforming - Moving Average
Who Am I?
Don Drake @dondrake
Currently: Principal Big Data Consultant @ Allstate
5 years consulting on Hadoop
Independent consultant for last 14 years
Previous clients:
Navteq / Nokia
Sprint
Mobile Meridian - Co-Founder
MailLaunder.com - my SaaS anti-spam service
cars.com
Tribune Media Services
Family Video
Museum of Science and Industry
ETL
Informally: Any repeatable programmed data movement
Extraction - Get data from another source
Oracle/PostgreSQL
CSV
Web crawler
Transform - Normalize formatting of phone #s, addresses
Create surrogate keys
Joining data sources
Aggregate
Load -
Load into data warehouse
CSV
Sent to predictive model
Spark
Apache Spark is a fast and general purpose engine for large-scale data
processing.
It provides high-level APIs in Java, Scala, and Python (REPLs for Scala
and Python)
Includes an advanced DAG execution engine that supports in-memory
computing
RDD - Resilient Distributed Dataset Core construct of the framework
Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
Open Source, core committers from DataBricks
Latest version is 1.3.1, which includes DataFrames
Started in 2009 (AMPLab) as research project, Apache project since 2013.
LOTS of momentum.
Daily Fantasy Baseball
Spark 101- Execution
Driver - your programs main() method
Only 1 per application
Executors - do the distributed work
As many as your cluster can handle
You determine ahead of time how many
You determine the amount of RAM required
Spark 101 - RDD
RDD - Resilient Distributed Dataset
Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
RDDs have actions and transformations which
return pointers to new RDDs
RDDs can contain anything