Vous êtes sur la page 1sur 57

Real-Time Analytics with Apache

Cassandra and Apache Spark


Guido Schmutz

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE


HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
Guido Schmutz

• Working for Trivadis for more than 18 years


• Oracle ACE Director for Fusion Middleware and SOA
• Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
• Technology Manager @ Trivadis

• More than 25 years of software development experience

• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz
Agenda

1. Introduction
2. Apache Spark
3. Apache Cassandra
4. Combining Spark & Cassandra
5. Summary
Big Data Definition (4 Vs) Characteristics of Big Data: Its Volume,
Velocity and Variety in combination

+ Time to action ? – Big Data + Real-Time = Stream Processing


What is Real-Time Analytics?
Short time to
analyze &
What is it? Why do we need
respond
it?

How does it work?


• Collect real-time data
• Process data as it flows in
• Data in Motion over Data at
Rest
• Reports and Dashboard Events Analyze Respond
access processed data
Time
§ Required - for new business models
§ Desired - for competitive advantage
Real Time Analytics Use Cases

• Algorithmic Trading • Recommendations


• Online Fraud Detection • Churn detection
• Geo Fencing • Internet of Things (IoT) / Intelligence
• Proximity/Location Tracking Sensors
• Intrusion detection systems • Social Media/Data Analytics
• Traffic Management • Gaming Data Feed
• …
Apache Spark
Motivation – Why Apache Spark?

Hadoop MapReduce: Data Sharing on Disk

HDFS HDFS HDFS HDFS


read write read write
map reduce . . .
Input Output

Spark: Speed up processing by using Memory instead of Disks

op1 op2
. . .
Input Output
Apache Spark

Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
Apache Spark
Libraries
Blink DB MLlib, Spark R
Spark SQL Spark Streaming GraphX
(Approximate (Machine
(Batch Processing) (Real-Time) (Graph Processing)
Querying) Learning)

Core Runtime

Spark Core API and Execution Model

Cluster Resource Managers Data Stores

Spark Elastic
MESOS YARN HDFS NoSQL S3
Standalone Search
Resilient Distributed Dataset (RDD)

Are Have Transformations


• Produce new RDD
• Immutable • Rich set of transformation available
• Re-computable
• filter(), flatMap(), map(),
• Fault tolerant distinct(), groupBy(), union(),
• Reusable join(), sortByKey(),
reduceByKey(), subtract(), ...

Have Actions
• Start cluster computing operations
• Rich set of action available
• collect(), count(), fold(),
reduce(), count(), …
RDD RDD

Input Source

• File
• Database .count() -> 100
• Stream
• Collection

Data
Partitions RDD
Partition 0
Server 1
Partition 1
Partition 2
Server 2
Partition 3
Partition 4
Server 3
Partition 5
Partition 6
Server 4
Partition 7
Partition 8
Server 5
Partition 9
Data
Partitions RDD
Partition 0
Server 1
Partition 1
Partition 2
Server 2
Partition 3
Partition 4
Server 3
Partition 5
Partition 6
Server 4
Partition 7
Partition 8
Server 5
Partition 9
Data
Partitions RDD
Partition 0
Partition 1
Server 2
Partition 2
Partition 3
Partition 4
Server 3
Partition 5
Partition 6
Server 4
Partition 7
Partition 8
Server 5
Partition 9
Data
Spark Workflow Input HDFS File

sc.hapoopFile()
Stage 1 – flatMap() + map()
HadoopRDD MappedRDD

flatMap() P0

MappedRDD P1
Master
Transformations P3
map() DAG
(Lazy) Scheduler
MappedRDD

reduceByKey() Stage 1 – reduceByKey()

ShuffledRDD ShuffledRDD

Action sc.saveAsTextFile() P0
(Execute Text File Output
Transformations)
Spark Workflow HDFS File Input 1

SparkContext.hadoopFile()
HDFS File Input 2
HadoopRDD
filter() SparkContext.hadoopFile()
Transformations FilteredRDD HadoopRDD
(Lazy)
map() map()

MappedRDD MappedRDD

join()

ShuffledRDD

SparkContext.saveAsHadoopFile()
Action HDFS File Output
(Execute Transformations)
Spark Execution Model
Server
Master
Worker

Executer
Data
Storage
Executer

Executer
Spark Execution Model

Worker
Master Narrow Transformation
Executer Stage 1 – flatMap() + map()
Data
Storage
RDD

Worker
P0
P1
Executer Master
Data
Storage P3
Worker

Executer Worker filter()


Data
Storage map()
Executer sample()
Data
Storage flatMap()
Spark Execution Model

Worker
Master
Executer
Data join()
Storage
reduceByKey()
union()
Shuffle ! Worker
groupByKey()
Executer
Data
Worker Storage Wide Transformation
Stage 2 – reduceByKey()
Executer Worker
Data
Storage
RDD
Executer P0
Data
Storage
Batch vs. Real-Time Processing

Per Second
Gigabytes
Petabytes of Data
Various Input Sources
Apache Kafka

distributed publish-subscribe messaging system


Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, …)
Initially developed at LinkedIn, now part of Apache Producer Producer Producer

Does not use JMS API and standards


Kafka Cluster
Kafka maintains feeds of messages in topics

Consumer Consumer Consumer


Apache Kafka

Kafka Broker

Temperature Topic

Temperature
1 2 3 4 5 6 Processor
Weather
Station

Rainfall Topic

Rainfall
1 2 3 4 5 6 Processor
Apache Kafka

Kafka Broker

Temperature Topic
Partition 0 Temperature
1 2 3 4 5 6 Processor

Weather Partition 1 Temperature


Station 1 2 3 4 5 6 Processor

Rainfall Topic
Partition 0 Rainfall
1 2 3 4 5 6 Processor
Apache Kafka Broker
Temperature Topic
Kafka P 0 1 2 3 4 5

P 1 1 2 3 4 5

Rainfall Topic
P 0 1 2 3 4 5 Temperature
Processor
Weather
Station Temperature
Kafka Broker Processor
Temperature Topic
P 0 1 2 3 4 5

P 1 1 2 3 4 5
Rainfall
Processor
Rainfall Topic
P 0 1 2 3 4 5
Discretized Stream (DStream)

Weather
Station

Weather Kafka
Station

Weather
Station
Discretized Stream (DStream)

Weather
Station

Weather Kafka
Station

Weather
Station
Discretized Stream (DStream)

Weather
Station

Weather Kafka
Station

Weather
Station
Discretized Stream (DStream)

Weather
Station Individual Event

Weather Kafka
Station

Weather
Station Discrete by time
DStream = RDD
Discretized Stream (DStream)

DStream DStream

Transform

.countByValue()
.reduceByKey()
.join
.map

X Seconds
Discretized Stream (DStream) Time Increasing

time 1 time 2 time 3 …. time n


DStream Transformation Lineage

Input Stream message message message message

RDD @time 1 RDD @time 2 RDD @time 3 RDD @time n


message 1 message 1 message 1 message 1

Event DStream message 2 message 2


….
message 2
….
message 2
…. ….
message n message n message n message n

RDD @time 1 RDD @time 2 RDD @time 3 RDD @time n


f(message 1) f(message 1) f(message 1) f(message 1)
MappedDStream f(message 2) f(message 2)
….
f(message 2)
map()
f(message
….
2)
…. ….
f(message n) f(message n) f(message n) f(message n)
Actions Trigger

result 1 result 1 result 1 result 1


Spark Jobs

result 2 result 2
….
result 2
saveAsHadoopFiles() result 2
…. …. ….
result n result n result n result n

Adapted from Chris Fregly: http://slidesha.re/11PP7FV


Apache Spark Streaming – Core concepts

Discretized Stream (DStream) Input DStreams


• Core Spark Streaming abstraction • Represents the stream of raw data received
from streaming sources
• micro batches of RDD’s
• Data can be ingested from many sources:
• Operations similar to RDD
Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP
Socket, Akka actors, etc.
• Custom Sources can be easily written for
custom data sources

Operations
• Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)
Apache Cassandra
Apache Cassandra

Apache Cassandra™ is a free


• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i.e. no single point of failure)…
post-relational database solution

Optimized for high write throughput


Apache Cassandra - History
Bigtable Dynamo
Motivation - Why NoSQL Databases?

aaa • Dynamo Paper (2007)

• How to build a data store that is


• Reliable
• Performant
• “Always On”

• Nothing new and shiny


• 24 other papers cited

• Evolutionary
Motivation - Why NoSQL Databases?

• Google Big Table (2006)

• Richer data model


• 1 key and lot’s of values
• Fast sequential access

• 38 other papers cited


Motivation - Why NoSQL Databases?

• Cassandra Paper (2008)

• Distributed features of Dynamo


• Data Model and storage from BigTable
• February 2010 graduated to a top-level Apache
Project
Apache Cassandra – More than one server

All nodes participate in a cluster [76-100]


[51-75]
Shared nothing [0-25]
Add or remove as needed Node 1
More capacity? Add more servers
[51-75]
[26-50] [26-50]
Node is a basic unit inside a cluster [76-100] Node 4 Node 2
[0-25]
[76-100]
Each node owns a range of partitions
Node 3
Consistent Hashing [51-75]
[0-25]
[26-50]
Apache Cassandra – Fully Replicated
West East
Client
Client writes local
Data syncs across WAN
Replication per Data Center

Node 1 Node 1

Node 4 Node 2 Node 4 Node 2

Node 3 Node 3
Apache Cassandra

What is Cassandra NOT? What are good use cases?

• A Data Ocean • Product Catalog / Playlists


• A Data Lake
• A Data Pond • Personalization (Ads, Recommendations)

• Fraud Detection
• An In-Memory Database
• Time Series (Finance, Smart Meter)
• A Key-Value Store
• IoT / Sensor Data
• Not for Data Warehousing
• Graph / Network data
How Cassandra stores data

• Model brought from Google Bigtable


• Row Key and a lot of columns
• Column names sorted (UTF8, Int, Timestamp, etc.)
1 2 Billion
Column Name … Column Name
Billion of Rows

Column Value Column Value


Row Key
Timestamp Timestamp

TTL TTL
Combining Spark & Cassandra
Spark and Cassandra Architecture – Great Combo

Good at analyzing a huge amount Good at storing a huge amount of


of data data
Spark and Cassandra Architecture

Spark Streaming SparkSQL MLlib GraphX


(Near Real-Time) (Structured Data) (Machine Learning) (Graph Analysis)
Spark and Cassandra Architecture
Weather
Station
Spark Streaming SparkSQL MLlib GraphX
(Near Real-Time) (Structured Data) (Machine Learning) (Graph Analysis)
Weather
Station

Weather
Station Spark Connector
Weather
Station

Weather
Station
Spark and Cassandra Architecture
Server
Master • Single Node running Cassandra

Worker • Spark Worker is really small

• Spark Master lives outside a


node
Executer
• Spark Worker starts Spark
Executer Executer in separate JVM

• Node local
Executer
Spark and Cassandra Architecture

Will only have 0-25 Master • Each node runs Spark and
to analyze 25%
of data! Worker
Cassandra

• Spark Master can make


76-100 26-50 decisions based on Token
Ranges
Worker Worker

• Spark likes to work on small


partitions of data across a
51-75 large cluster
Worker
• Cassandra likes to spread out
data in a large cluster
Spark and Cassandra Architecture
Transactional Analytics
0-25 0-25
Master
Worker

76-100 26-50 76-100 26-50

Worker Worker

51-75 51-75

Worker
Cassandra and Spark

Cassandra Cassandra & Spark

Joins and Unions No Yes

Transformations Limited Yes

Outside Data Integration No Yes

Aggregations Limited Yes


Summary
Summary

Kafka Spark
• Topics store information broken into • Replacement for Hadoop Map Reduce
partitions • In memory
• Brokers store partitions • More operations than just Map and Reduce
• Partitions are replicated for data • Makes data analysis easier
resilience • Spark Streaming can take a variety of sources

Cassandra Spark + Cassandra


• Goals of Apache Cassandra are all • Cassandra acts as the storage layer for Spark
about staying online and performant • Deploy in a mixed cluster configuration
• Best for applications close to your users • Spark executors access Cassandra using the
• Partitions are similar data grouped by a DataStax connector
partition key
Lambda Architecture with Spark/Cassandra
Data Data (Analytical) Batch Data Processing Result Store Data
Sources Collection Access

Result Store
Raw Data Computed
(Reservoir) Information Reports
Batch Query
compute Engine
Service
Social Channel
Batch
(Analytical) Real-Time Data Processing Analytic
compute
Tools

Result Store Alerting


Stream/Event Processing Tools
Messaging
Lambda Architecture with Spark/Cassandra
Data Data (Analytical) Batch Data Processing Result Store Data
Sources Collection Access

Result Store
Raw Data Computed
(Reservoir) Information Reports
Batch Query
compute Engine
Service
Social Channel
Batch
(Analytical) Real-Time Data Processing Analytic
compute
Tools

Result Store Alerting


Stream/Event Processing Tools
Messaging
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com