6 DataStream

CS 4440 Lecture 6
Data Stream Processing An Overview

Sangeetha Seshadri sangeeta@cc.gatech.edu
Agenda
y Data Streams
y What are they? y Why now? Applications..
y DSMS: Architecture & Issues y Query Processing
Data Streams What and Where?

y Continuous, unbounded, rapid, time-varying streams of
data elements (tuples). y Occur in a variety of modern applications

y y y y y y
Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes
y DSMS = Data Stream Management System

stanfordstreamdatamanager
DBMS versus DSMS

y Persistent relations y One-time queries y Random access y Access plan determined by y Transient streams (and
persistent relations) y Continuous queries

y Sequential access y Unpredictable data
query processor and physical DB design
characteristics and arrival patterns
Continuous Queries
y One time queries
Run once to completion over the
current data set.

y Continuous queries
Issued once and then continuously evaluated over the data.

y Example: y Notify me when the temperature drops below X y Tell me when prices of stock Y > 300
The (Simplified) Big Picture
Register Query
Streamed Result
Stored Result
DSMS
Input streams Archive
Scratch Store Stored Relations
(Simplified) Network Monitoring

Intrusion Warnings Register Monitoring Queries Online Performance Metrics
DSMS
Network measurements, Packet traces
Archive
Scratch Store Lookup Tables
Triggers?
y Recall triggers in traditional DBMSs? y Why not use triggers to process continuous queries over
data streams?
Making Things Concrete

ALICE BOB
Central Office Outgoing (call_ID, caller, time, event)
Central Office Incoming (call_ID, callee, time, event)

DSMS
event = start or end

9 R.Motwani, Models & Issues in Data Streams PODS 2002
Query 1 (self-join)
y Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)
y Result requires unbounded storage y Can provide result as data stream y Can output after 2 min, without seeing end
Query 2 (join)
y Pair up callers and callees
SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID

y Can still provide result as data stream y Requires unbounded temporary storage y
unless streams are near-synchronized
Query 3 (group-by aggregation)

y Total connection time for each caller
SELECT FROM WHERE
O1.caller, sum(O2.time O1.time) Outgoing O1, Outgoing O2 (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller
y Cannot provide result in (append-only) stream

y Output updates? y Provide current value on demand? y Memory?
DSMS Architecture & Issues

y Data streams and stored relations
Architectural
differences.
y Declarative language for registering continuous queries y Flexible query plans and execution strategies y Centralized ? Distributed ?
13
Agenda
y Data Streams
DSMS Issues
y Relation: Tuple Set or Sequence? y Updates: Modifications or Appends? y Query Answer: Exact or Approximate? y Query Evaluation: One of multiple Pass? y Query Plan: Fixed or Adaptive?
Architectural Issues
DSMS
y y y y y
DBMS y Resource (memory, disk, per-tuple computation) rich Resource (memory, per-tuple y Extremely sophisticated computation) limited query processing, analysis Reasonably complex, near real y Useful to audit query results time, query processing of data stream systems. Useful to identify what data to y Query Evaluation: Arbitrary populate in database y Query Plan: Fixed. Query Evaluation: One pass Query Plan: Adaptive
N.Koudas, D. Srivastava (2003) AT&T Labs-Research
STREAM System Challenges

Must cope with:
Stream rates that may be high,variable, bursty Stream data that may be unpredictable, variable Continuous query loads that may be high, variable
17

Must cope with:
Stream rates that may be high,variable, bursty Stream data that may be unpredictable, variable Continuous query loads that may be high, variable
Overload
18

Must cope with:
Stream rates that may be high,variable bursty variable, Stream data that may be unpredictable, variable Continuous query loads that may be high, variable
Overload need to use resources very carefully. Changing conditions adaptive strategy.
19
Query Model
User/Application
Query Registration
Predefined
Answer Availability
One-time Event/timer based Multiple-time, periodic Continuous (stored or streamed)
Ad-hoc Predefined, inactive until invoked
Query Processor
Stream Access
Arbitrary Weighted history Sliding window (special case: size = 1)
DSMS
R.Motwani, Models & Issues in Data Streams PODS 2002
20
Agenda
y Data Streams

y y y y
Language Operators Optimization Multi-Query Optimization
Stream Query Language

y SQL extension y Queries reference/produce relations or streams y Examples: GSQL [Gigascope], CQL [STREAM]
Stream or Finite Relation
Stream Query Language
Stream or Finite Relation
22
Example: Continuous Query Language CQL

Start with SQL Then add y Streams as new data type
y Continuous instead of one-time semantics y Windows on streams (derived from SQL-99) y Sampling on streams (basic)
23
Impact of Limited Memory

y Continuous streams grow unboundedly y Queries may require unbounded memory y One solution: Approximate query evaluation
Approximate Query Evaluation

y Why? y Handling load streams coming too fast y Avoid unbounded storage and computation y Ad hoc queries need approximate history y How? Sliding windows, synopsis, samples, load-shed y Major Issues? y Metric for set-valued queries y Composition of approximate operators y How is it understood/controlled by user? y Integrate into query language y Query planning and interaction with resource allocation y Accuracy-efficiency-storage tradeoff and global metric
Windows
y Mechanism for extracting a finite relation from an infinite
stream y Various window proposals for restricting operator scope.

y Windows based on ordering attribute (e.g. time) y Windows based on tuple counts y Windows based on explicit markers (e.g. punctuations) y Variants (e.g., partitioning tuples in a window)
Window specifications Stream Finite relations manipulated using SQL streamify Stream
Windows
y Terminology
Start time t1 t2 t3 t4 t5 time
Current time
Sliding Window
time
Tumbling Window
Query Operators
y Selections - Where clause y Projections - Select clause y Joins - From clause y Group-by (Aggregations)
Group-by clause
Query Operators
y Selections and projections on streams - straightforward
y Local per-element operators
y Projection may need to include ordering attribute. y Joins
Problematic
y May need to join tuples that are arbitrarily far apart. y Equijoin on stream ordering attributes may be tractable.
y Majority of the work focuses on joins using windows.
Blocking Operators
y Blocking y No output until entire input seen y Streams input never ends y Simple Aggregates
output update stream
y Set Output (sort, group-by) y Root could maintain output data structure y Intermediate nodes try non-blocking analogs y Join y Apply sliding-window restrictions
Optimization in DSMS
y Traditionally table based cardinalities used in query
optimizer.
y Goal of query optimizer: Minimize the size of intermediate
results.
y Problematic in a streaming environment
All streams are
unbounded = infinite size!

y Need novel optimization objectives that are relevant when
the input sources are streams.

Query Optimization in DSMS

y Novel notions of optimization:
y Stream rate based [e.g. NiagaraCQ] y Resource-based [e.g. STREAM] y QoS based [e.g. Aurora]
y Continuous adaptive optimization y Possibilities that objectives cannot be met:

y Resource constraints y Bursty arrivals under limited processing capabilities.
Stream Projects
y Amazon/Cougar (Cornell) y y y y y y y y
sensors Aurora (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) telecom streams Niagara (OGI/Wisconsin) Internet XML databases OpenCQ (Georgia) triggers, incr. view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pub/sub content-based filtering Telegraph (Berkeley) adaptive engine for sensors Tribeca (Bellcore) network monitoring
Optimizing Multiple Distributed Stream Queries Using Hierarchical Network Partitions

Sangeetha Seshadri* Jointly with: Vibhore Kumar*, Brian F. Cooper , Ling Liu* and Karsten Schwan *
*College
of Computing Georgia Tech
Yahoo! Research
IPDPS 07 March 29th 2007
Talk Outline
y Motivation y Challenges y Our Approach y Experimental Results y Future Work
35
Distributed Data Stream Systems

Can low-capacity flights be cancelled? Weather What is the status of my flight? Flight information
Web sources
Local Weather
Travel Agent
36 Centralized DB
Motivation
y Lots of data produced in lots of places y Examples: operational information systems, scientific collaborations, web traffic data, financial applications y Centralized processing does not scale
Challenges
y Choosing efficient deployments.
y Fast and efficient initial deployments. y Utilize reuse opportunities.
y Handling dynamic nature of system.

y Queries arrive or leave. y Nodes join (recover) or leave (fail). y Network conditions change. y Data conditions (e.g. rate) changes.
38
Approach Outline
Query Planning
Deployment
Adaptivity
Typical Approaches
Query Planning & Deployment Our Approach
Adaptivity
39
Query Planning
C B
B A A B
(B
C) (A
A Sink B) C
SELECT * FROM A
40
Query Deployment
A B (A B) C
C N4 N1 A
Sink3
Sink1
N3
Sink4 N5 Sink2 B Sink5

41
N2
An Illustrative Example..
SELECT * FROM A C
SELECT * FROM A
42
Why an integrated approach?

y Integrated approach decreases cost by > 50 %
Setup: 64 node network, 100 queries over 5 stream sources each. Y-axis represents communication costs.
Problem
y Massive Search Space. y Example: 5 stream sources, 64 nodes
y 2,880,000,000 (approx) plans considered. y Lemma 1:
K ( K 1)( K 1) ( K 1) 3exhaustive ! v (N ) 6 Our Solution:

Trade some optimality for smaller search space
44
Solution
y Organize the nodes into a virtual Network Hierarchy. y Operator reuse through Stream Advertisements y Two approximation based algorithms:
y Top-Down y Bottom-Up
45
Optimization Metric
y Minimize `network usage y Network usage: total amount of data in transit at any
point in time. y Encapsulates both bandwidth and latency of links.
46
Network Hierarchy
y Cluster network nodes based on cost . y User defined parameter maxcs
Coordinator Nodes
47
Stream Advertisements for Reuse

A, C and A C B Coordinator Nodes
B A
A C
48
Optimization Algorithms
Top-Down Bottom-Up
49
Planning algorithms
y Top down
A
A B
50
Top-Down Algorithm: Features

y Reduced search space: y Search space reduced by a factor .
max F ! h N (h = height of hierarchy, N = network size, K = number of sources). y User defined parameter maxcs allows to tune trade-off between search space and sub-optimality.
cs
K 1
y Operators re-used when beneficial through stream
advertisements.
51
Planning algorithms
y Bottom up
A
A B
B
C D
B A B C D
52
Bottom-Up Algorithm: Features

y Reduced search space.
y Deploys only sub-queries within current cluster. y Analytical bounds: Search space reduced by factor .
y Operators re-used when beneficial. y But, may choose sub-optimal join-orders.
53
Experiments
y Simulation and prototype based experiments. y 128 node network: Used GT-ITM internetwork topology
generator.
y Uniformly random workload generator: 10 sources, 100
queries, 2-5 join operators, random sink placements.
54
Cost with Bottom-Up Algorithm
Comparison with existing approaches
56
Comparison of Search Space
57
Future Work
y We have built a prototype based on IFLOW
a distributed
data stream system built at Georgia Tech. y Aggregations y Modifying existing deployments at runtime
y Relaxing filter conditions y Modifying join ordering at runtime.
58
Related Work
y Distributed query optimization
y Distributed INGRES, R*, SDD-1
y Stream data processing engines

y Centralized - STREAM, Aurora, TelegraphCQ y Distributed - Borealis, Flux
59
Conclusion
y Integrated approach to query optimization y Hierarchical clustering of network and stream
advertisements. y Approximation based algorithms

y Top-Down y Bottom-Up
y Design Highlights
y Trade some optimality for smaller search space. y Decrease search space while offering bounds on the sub-
optimality.
60
For further information

y http://www.cc.gatech.edu/~sangeeta y Contact: sangeeta@cc.gatech.edu
Thank You!
61
Deployment Times
62
Example
Simple use-case for pushing down selections: Query 1: SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH Query 2: SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'
63
The Big Picture

y Large number of possibilities y System Model
y Stream processing systems (SQL-style queries) y Pub-sub systems y Runtime annotators (keyword-based queries).
y Trade-offs Cost with y Search space y Reliability y Availability. y Adaptivity y Admission Control y Moving operators y Dropping data y Migrating plans.
64
Real Enterprise Workload

y Delta Airlines Operational information system y Q1 (15%): Terminal Overhead Display (Lifetime = 12 hours) y Q2 (80%): Gate Agent Query (Lifetime = 2 hours) y Q3 (5%): Ad-hoc flight status monitoring queries (Lifetime = 6 hours)
65
Real Enterprise Workload
66
Backups
Data Model
y Append-only
y Call records
y Updates
y Stock tickers
y Deletes
y Transactional data
y Meta-Data
y Control signals, punctuations
System Internals probably need all above
68
R.Motwani PODS 2002
Aurora/STREAM Overview
Synopses Running Op Ready Op Waiting Op Applications register continuous queries
Query Plans
Output streams
x
Users issue continuous and ad-hoc queries
Historical Storage
Input streams
69
Administrator monitors query execution and adjusts runtime parameters
R.Motwani, Models & Issues in Data Streams PODS 2002
Sliding Window Approximation

011000011100000101010
y Why?
y Approximation technique for bounded memory y Natural in applications (emphasizes recent data) y Well-specified and deterministic semantics
y Issues
y Extend relational algebra, SQL, query optimization y Algorithmic work y Timestamps?
70
PODS 2002
Adaptivity (Telegraph)
STeMs for join R S T Input Streams R S T Output Queues
EDDY RxSxT
grouped filter (R.A) grouped filter (S.B)
y Runtime Adaptivity y Multi-query Optimization y Framework

71
implements arbitrary schemes

R.Motwani PODS 2002
Query-Split Scheme (Niagara)

trig.Act.i scan file i split Quotes.XML scan join Symbol = Const.Value scan constant table trig.Act.j scan file j IBM MSFT file i file j
y Aggregate subscription for efficiency y Split evaluate trigger only when file updated y Triggers multi-query optimization
72 R.Motwani PODS 2002
Shared Predicates [Niagara, Telegraph]

Predicates for R.A R.A > 1 R.A > 7 R.A > 11 R.A < 3 R.A < 5 R.A = 6 R.A = 8 R.A 9
>
1
7 11 A>1 A>7 A>11
<
A<3
Tuple A=8
3 A<5 8 9
R.Motwani PODS 2002
73

6 DataStream

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

6 DataStream

Transféré par

Droits d'auteur :

Formats disponibles

CS 4440 Lecture 6

Data Stream Processing An Overview

y DSMS: Architecture & Issues y Query Processing

Data Streams What and Where?

data elements (tuples). y Occur in a variety of modern applications

y DSMS = Data Stream Management System

DBMS versus DSMS

persistent relations) y Continuous queries

query processor and physical DB design

characteristics and arrival patterns

Run once to completion over the

current data set.

Issued once and then continuously evaluated over the data.

The (Simplified) Big Picture

(Simplified) Network Monitoring

Making Things Concrete

Central Office Outgoing (call_ID, caller, time, event)

Central Office Incoming (call_ID, callee, time, event)

event = start or end

10 R.Motwani, Models & Issues in Data Streams PODS 2002

SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID

unless streams are near-synchronized

11 R.Motwani, Models & Issues in Data Streams PODS 2002

Query 3 (group-by aggregation)

SELECT FROM WHERE

y Cannot provide result in (append-only) stream

12 R.Motwani, Models & Issues in Data Streams PODS 2002

DSMS Architecture & Issues

y DSMS: Architecture & Issues y Query Processing

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

STREAM System Challenges

STREAM System Challenges

STREAM System Challenges

Ad-hoc Predefined, inactive until invoked

y DSMS: Architecture & Issues y Query Processing

Language Operators Optimization Multi-Query Optimization

Stream Query Language

Stream or Finite Relation

Stream Query Language

Stream or Finite Relation

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Example: Continuous Query Language CQL

Impact of Limited Memory

24 R.Motwani, Models & Issues in Data Streams PODS 2002

Approximate Query Evaluation

25 R.Motwani, Models & Issues in Data Streams PODS 2002

stream y Various window proposals for restricting operator scope.

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Start time t1 t2 t3 t4 t5 time

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

y Projection may need to include ordering attribute. y Joins

y Majority of the work focuses on joins using windows.

output update stream

All streams are

unbounded = infinite size!

the input sources are streams.

Query Optimization in DSMS

y Continuous adaptive optimization y Possibilities that objectives cannot be met:

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

33 R.Motwani, Models & Issues in Data Streams PODS 2002

Optimizing Multiple Distributed Stream Queries Using Hierarchical Network Partitions

of Computing Georgia Tech

K ( K 1)( K 1) ( K 1) 3exhaustive ! v (N ) 6 Our Solution: