Readme

DATA SCIENCE USING SPARK:
AN INTRODUCTION
TOPICS COVERED
 Introduction to Spark
 Getting Started with Spark
 Programming in Spark
 Data Science with Spark
 What next?
2
DATA SCIENCE PROCESS
Exploratory
Data Analysis
Raw data is Data is Clean

collected processed
Data Machine
Learning
Real World Algorithms
Statistical
Models
Build Communicate
Data Product ----------------
visualizations Make Decisions
----------------Report
Findings
Source: Doing Data Science by Rachel Schutt & Cathy O’Neil
3
DATA SCIENCE & DATA MINING
Distinctions are blurred
Business
Data Science
Analytics
Knowledge Discovery
Domain
Knowledge
Data Mining
Visual Data
Mining
(Structured) (Unstructured) Big Data

Data Mining Data Mining Engineering
Natural Language
Processing
Text Mining Web Mining
Machine Database Management

Statistics
Learning Management Science
4
WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK?
Data Input /Output

 Ability to read data in multiple formats
 Ability to read data from multiple sources
 Ability to deal with Big Data (Volume, Velocity, Veracity and Variety)
Data Transformations
 Easy to describe and perform transformations on rows and columns of data
 Requires abstraction of data and a dataflow paradigm
Model Development
 Library of Data Science Algorithms
 Ability to import / export models from other sources
 Data Science pipelines / workflow Development
Analytics Applications Development
 Seamless integration with programming languages / IDEs
5
INTRODUCTION TO SPARK
SHARAN KALWANI
6
WHAT IS SPARK?
▪ A distributed computing platform designed to be
▪ Fast
▪ General Purpose
▪ A general engine that allows combination of multiple types of computations

▪ Batch
▪ Interactive
▪ Iterative
▪ SQL Queries
▪ Text Processing
▪ Machine learning
7
Fast/Speed
 Computations in memory
 Faster than MR even for disk computations
Generality
 Designed for a wide range of workloads
 Single Engine to combine batch, interactive,
iterative, streaming algorithms.
 Has rich high-level libraries and simple native
APIs in Java, Scala and Python.
 Reduces the management burden of
maintaining separate tools.
8
SPARK UNIFIED STACK
9
CLUSTER MANAGERS
▪ Can run on a variety of cluster managers
▪ Hadoop YARN - Yet Another Resource Negotiator is a cluster management technology
and one of the key features in Hadoop 2.
▪ Apache Mesos - abstracts CPU, memory, storage, and other compute resources away
from machines, enabling fault-tolerant and elastic distributed systems.
▪ Spark Standalone Scheduler – provides an easy way to get started on an empty set
of machines.
▪ Spark can leverage existing Hadoop infrastructure
10
SPARK HISTORY
▪ Started in 2009 as a research project in UC Berkeley RAD lab which became
AMP Lab.
▪ Spark researchers found that Hadoop MapReduce was inefficient for iterative and
interactive computing.
▪ Spark was designed from the beginning to be fast for interactive, iterative with
support for in-memory storage and fault-tolerance.
▪ Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors.
▪ Spark was open sourced in March 2010 and transformed into Apache Foundation
project in June 2013.
11
SPARK VS HADOOP
Hadoop MapReduce
 Mostly suited for batch jobs
 Difficulty to program directly in MR
 Batch doesn’t compose well for large apps
 Specialized systems needed as a workaround
Spark
 Handles batch, interactive, and real-time
within a single framework
 Native integration with Java, Python, Scala
 Programming at a higher level of abstraction
 More general than MapReduce
CONFIDENTIAL AND PROPRIETARY 12

GETTING STARTED WITH SPARK
13
GETTING STARTED WITH SPARK …..NOT COVERED TODAY!
There are multiple ways of using Spark

▪ Certified Spark Distributions
▪ Datastax Enterprise (Cassandra + Spark)
▪ HortonWorks HDP
▪ MAPR
▪ Local/Standalone
▪ Databricks cloud
▪ Amazon AWS EC2
14
LOCAL MODE
▪ Install Java JDK 6/7 on MacOSX or Windows
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-
1880260.html
▪ Install Python 2.7 using Anaconda (only on Windows)

https://store.continuum.io/cshop/anaconda/
▪ Download Apache Spark from Databricks, unzip the downloaded file to a
convenient location
http://training.databricks.com/workshop/usb.zip
▪ Connect to the newly created spark-training directory

▪ Run the interactive Scala shell (REPL)
./spark/bin/spark-shell
val data = 1 to 1000
val distData = sc.parallelize(data)
val filteredData = distData.filter(s => s<25)
filteredData.collect()
15
DATABRICKS CLOUD
▪ A hosted data platform powered by Apache Spark
▪ Features
▪ Exploration and Visualization
▪ Managed Spark Clusters
▪ Production Pipelines
▪ Support for 3rd party apps (Tableau, Pentaho, Qlik View)
▪ Databricks Cloud Trail

http://databricks.com/registration
▪ Demo
16
DATABRICKS CLOUD
▪ Workspace
▪ Tables
▪ Clusters
17
DATABRICKS CLOUD
▪ Notebooks
▪ Python
▪ Scala
▪ SQL
▪ Visualizations
▪ Markup
▪ Comments
▪ Collaboration
18
DATABRICKS CLOUD
▪ Tables
▪ Hive tables
▪ SQL
▪ Dbfs
▪ S3
▪ CSV
▪ Databases
19
AMAZON EC2
▪ Launch a Linux instance on EC2 and setup EC2 Keys
20
AMAZON EC2
▪ Setup an EC2 pair from the AWS console
21
AMAZON EC2
▪ Spark binary ships with a spark-ec2 script to manage clusters on
EC2
▪ Launching Spark cluster on EC2
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
▪ Running Applications
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
▪ Terminating a cluster
./spark-ec2 destroy <cluster-name>
▪ Accessing data in S3
s3n://<bucket>/path
22
PROGRAMMING IN SPARK
SHARAN KALWANI
23
Spark Cluster
• Mesos
• YARN
• Standalone
24
Scala – Scalable Language
▪ Scala is a multi-paradigm programming language with focus on the functional

programming paradigm.
▪ In functional programming functions are used and they use variables that are
immutable.
▪ Every operator, variable and function is an object.
▪ Scala generates bytecode that runs on the top of any JVM and can also use any of
the java libraries.
▪ Spark is completely written in Scala.
▪ Spark SQL, GraphX, Spark Streaming etc. are libraries written in Scala.
▪ Scala Crash Course by Holden Karau @databricks
lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course.pdf
25
Spark Model
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)

 Read-only collections of objects that can be stored in memory or disk across a cluster
 Partitions are automatically rebuilt on failure
 Parallel functional transformations ( map, filter, ..)
 Familiar Scala collections API for distributed data and computation
 Lazy transformations
26
Spark Core
RDD – Resilient Distributed Dataset
▪ A primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.
▪ Two Types
▪ Parallelized Scala collections
▪ Hadoop datasets
▪ Transformations and Actions can be performed
on RDDs.
Transformations
▪ Operate on an RDD and return a new RDD.
▪ Are Lazily Evaluated
Actions
▪ Return a value after running a computation on a RDD.
▪ The DAG is evaluated only when an action takes place.
27
Spark Shell
Interactive Queries and prototyping
Local, YARN, Mesos
Static type checking and auto complete
28
Spark compared to Java (native Hadoop)
29
Spark compared to Java (native Hadoop)
30
Spark Streaming
• Real time computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS
31
32
Spark Streaming
33
DATA SCIENCE USING SPARK
34
WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK?
Data Input /Output

 Ability to read data in multiple formats
 Ability to read data from multiple sources
 Ability to deal with Big Data (Volume, Velocity, and Variety)
Data Transformations
 Easy to describe and perform transformations on rows and columns of data
 Requires abstraction of data and a dataflow paradigm
Model Development
 Library of Data Science Algorithms
 Ability to import / export models from other sources
 Data Science pipelines / workflow Development
Analytics Applications Development
 Seamless integration with programming languages / IDEs
35
WHY SPARK FOR DATA SCIENCE?
Distributed In-memory Platform

FAST
Scalable Small to Big Data; Well integrated into the Big Data Ecosystem
SPARK is HOT
Expressive Simple, higher level abstractions for describing computations
Flexible Extendible, Multiple language bindings (Scala, Java, Python, R)
CONFIDENTIAL AND PROPRIETARY 36

Traditional Data Science Tools
SAS RapidMiner
Matlab
And many others….
R
SPSS
 Designed to work on single machines

 Proprietary & Expensive
37
What is available in Spark?
Analytics Workflows (ML Pipeline)
Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms)
Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams)
Basic RDD (Transformations & Actions)
38
DATA TYPES FOR DATA SCIENCE (MLLIB)
Single Machine Data Types Distributed Data Types

(supported by RDDs)
Distributed Matrix
Local Vector
 RowMatrix
Labeled Point  IndexedRowMatrix
Local Matrix  CoordinateMatrix
39
Schema RDDs
40
41
R to Spark Dataflow
Worker
Local
tasks
Spark
R
Executor
broadcast vars
R pacakges
Spark Java
R Context Spark
(ref. in R) Context
Worker
Spark tasks
Executor R
broadcast vars
R pacakges
42
43
MESOS
SHARK
SPARK
44

Readme

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Readme

Transféré par

Droits d'auteur :

Formats disponibles

DATA SCIENCE USING SPARK:

 Getting Started with Spark

 Data Science with Spark

Raw data is Data is Clean

Source: Doing Data Science by Rachel Schutt & Cathy O’Neil

(Structured) (Unstructured) Big Data

Machine Database Management

Data Input /Output

▪ A general engine that allows combination of multiple types of computations

▪ Spark can leverage existing Hadoop infrastructure

CONFIDENTIAL AND PROPRIETARY 12

There are multiple ways of using Spark

▪ Install Python 2.7 using Anaconda (only on Windows)

▪ Connect to the newly created spark-training directory

▪ Databricks Cloud Trail

▪ Scala is a multi-paradigm programming language with focus on the functional

Resilient Distributed Datasets (RDDs)

▪ Transformations and Actions can be performed

Data Input /Output

Distributed In-memory Platform

Flexible Extendible, Multiple language bindings (Scala, Java, Python, R)

CONFIDENTIAL AND PROPRIETARY 36

 Designed to work on single machines

Analytics Workflows (ML Pipeline)

Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms)

Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams)

Basic RDD (Transformations & Actions)

Single Machine Data Types Distributed Data Types

Vous aimerez peut-être aussi