Vous êtes sur la page 1sur 73

CS 4440 Lecture 6

Data Stream Processing An Overview


Sangeetha Seshadri sangeeta@cc.gatech.edu

Agenda
y Data Streams
y What are they? y Why now? Applications..

y DSMS: Architecture & Issues y Query Processing

Data Streams What and Where?


y Continuous, unbounded, rapid, time-varying streams of

data elements (tuples). y Occur in a variety of modern applications


y y y y y y

Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes

y DSMS = Data Stream Management System


stanfordstreamdatamanager

DBMS versus DSMS


y Persistent relations y One-time queries y Random access y Access plan determined by y Transient streams (and

persistent relations) y Continuous queries


y Sequential access y Unpredictable data

query processor and physical DB design

characteristics and arrival patterns

stanfordstreamdatamanager

Continuous Queries
y One time queries

Run once to completion over the

current data set.


y Continuous queries

Issued once and then continuously evaluated over the data.


y Example: y Notify me when the temperature drops below X y Tell me when prices of stock Y > 300

The (Simplified) Big Picture

Register Query

Streamed Result

Stored Result

DSMS
Input streams Archive
Scratch Store Stored Relations
stanfordstreamdatamanager

(Simplified) Network Monitoring


Intrusion Warnings Register Monitoring Queries Online Performance Metrics

DSMS
Network measurements, Packet traces

Archive
Scratch Store Lookup Tables
stanfordstreamdatamanager

Triggers?
y Recall triggers in traditional DBMSs? y Why not use triggers to process continuous queries over

data streams?

Making Things Concrete


ALICE BOB

Central Office Outgoing (call_ID, caller, time, event)

Central Office Incoming (call_ID, callee, time, event)


DSMS

event = start or end


9 R.Motwani, Models & Issues in Data Streams PODS 2002

Query 1 (self-join)
y Find all outgoing calls longer than 2 minutes

SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)
y Result requires unbounded storage y Can provide result as data stream y Can output after 2 min, without seeing end

10 R.Motwani, Models & Issues in Data Streams PODS 2002

Query 2 (join)
y Pair up callers and callees

SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID


y Can still provide result as data stream y Requires unbounded temporary storage y

unless streams are near-synchronized

11 R.Motwani, Models & Issues in Data Streams PODS 2002

Query 3 (group-by aggregation)


y Total connection time for each caller

SELECT FROM WHERE

O1.caller, sum(O2.time O1.time) Outgoing O1, Outgoing O2 (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller

y Cannot provide result in (append-only) stream


y Output updates? y Provide current value on demand? y Memory?

12 R.Motwani, Models & Issues in Data Streams PODS 2002

DSMS Architecture & Issues


y Data streams and stored relations

Architectural

differences.
y Declarative language for registering continuous queries y Flexible query plans and execution strategies y Centralized ? Distributed ?

13

Agenda
y Data Streams
y What are they? y Why now? Applications..

y DSMS: Architecture & Issues y Query Processing

DSMS Issues
y Relation: Tuple Set or Sequence? y Updates: Modifications or Appends? y Query Answer: Exact or Approximate? y Query Evaluation: One of multiple Pass? y Query Plan: Fixed or Adaptive?

Architectural Issues
DSMS
y y y y y

DBMS y Resource (memory, disk, per-tuple computation) rich Resource (memory, per-tuple y Extremely sophisticated computation) limited query processing, analysis Reasonably complex, near real y Useful to audit query results time, query processing of data stream systems. Useful to identify what data to y Query Evaluation: Arbitrary populate in database y Query Plan: Fixed. Query Evaluation: One pass Query Plan: Adaptive

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

STREAM System Challenges


Must cope with:
Stream rates that may be high,variable, bursty Stream data that may be unpredictable, variable Continuous query loads that may be high, variable

17

stanfordstreamdatamanager

STREAM System Challenges


Must cope with:
Stream rates that may be high,variable, bursty Stream data that may be unpredictable, variable Continuous query loads that may be high, variable

 Overload

18

stanfordstreamdatamanager

STREAM System Challenges


Must cope with:
Stream rates that may be high,variable bursty variable, Stream data that may be unpredictable, variable Continuous query loads that may be high, variable

 Overload need to use resources very carefully.  Changing conditions adaptive strategy.

19

stanfordstreamdatamanager

Query Model
User/Application

Query Registration
Predefined

Answer Availability
One-time Event/timer based Multiple-time, periodic Continuous (stored or streamed)

Ad-hoc Predefined, inactive until invoked

Query Processor

Stream Access
Arbitrary Weighted history Sliding window (special case: size = 1)

DSMS
R.Motwani, Models & Issues in Data Streams PODS 2002

20

Agenda
y Data Streams
y What are they? y Why now? Applications..

y DSMS: Architecture & Issues y Query Processing


y y y y

Language Operators Optimization Multi-Query Optimization

Stream Query Language


y SQL extension y Queries reference/produce relations or streams y Examples: GSQL [Gigascope], CQL [STREAM]

Stream or Finite Relation

Stream Query Language

Stream or Finite Relation

22

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Example: Continuous Query Language CQL


Start with SQL Then add y Streams as new data type
y Continuous instead of one-time semantics y Windows on streams (derived from SQL-99) y Sampling on streams (basic)

23

stanfordstreamdatamanager

Impact of Limited Memory


y Continuous streams grow unboundedly y Queries may require unbounded memory y One solution: Approximate query evaluation

24 R.Motwani, Models & Issues in Data Streams PODS 2002

Approximate Query Evaluation


y Why? y Handling load streams coming too fast y Avoid unbounded storage and computation y Ad hoc queries need approximate history y How? Sliding windows, synopsis, samples, load-shed y Major Issues? y Metric for set-valued queries y Composition of approximate operators y How is it understood/controlled by user? y Integrate into query language y Query planning and interaction with resource allocation y Accuracy-efficiency-storage tradeoff and global metric

25 R.Motwani, Models & Issues in Data Streams PODS 2002

Windows
y Mechanism for extracting a finite relation from an infinite

stream y Various window proposals for restricting operator scope.


y Windows based on ordering attribute (e.g. time) y Windows based on tuple counts y Windows based on explicit markers (e.g. punctuations) y Variants (e.g., partitioning tuples in a window)
Window specifications Stream Finite relations manipulated using SQL streamify Stream

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Windows
y Terminology

Start time t1 t2 t3 t4 t5 time

Current time

Sliding Window

time

Tumbling Window

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Query Operators
y Selections - Where clause y Projections - Select clause y Joins - From clause y Group-by (Aggregations)

Group-by clause

Query Operators
y Selections and projections on streams - straightforward
y Local per-element operators

y Projection may need to include ordering attribute. y Joins

Problematic

y May need to join tuples that are arbitrarily far apart. y Equijoin on stream ordering attributes may be tractable.

y Majority of the work focuses on joins using windows.

Blocking Operators
y Blocking y No output until entire input seen y Streams input never ends y Simple Aggregates

output update stream

y Set Output (sort, group-by) y Root could maintain output data structure y Intermediate nodes try non-blocking analogs y Join y Apply sliding-window restrictions
30 R.Motwani, Models & Issues in Data Streams PODS 2002

Optimization in DSMS
y Traditionally table based cardinalities used in query

optimizer.
y Goal of query optimizer: Minimize the size of intermediate

results.
y Problematic in a streaming environment

All streams are

unbounded = infinite size!


y Need novel optimization objectives that are relevant when

the input sources are streams.


N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Query Optimization in DSMS


y Novel notions of optimization:
y Stream rate based [e.g. NiagaraCQ] y Resource-based [e.g. STREAM] y QoS based [e.g. Aurora]

y Continuous adaptive optimization y Possibilities that objectives cannot be met:


y Resource constraints y Bursty arrivals under limited processing capabilities.

N.Koudas, D. Srivastava (2003) AT&T Labs-Research

Stream Projects
y Amazon/Cougar (Cornell) y y y y y y y y

sensors Aurora (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) telecom streams Niagara (OGI/Wisconsin) Internet XML databases OpenCQ (Georgia) triggers, incr. view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pub/sub content-based filtering Telegraph (Berkeley) adaptive engine for sensors Tribeca (Bellcore) network monitoring

33 R.Motwani, Models & Issues in Data Streams PODS 2002

Optimizing Multiple Distributed Stream Queries Using Hierarchical Network Partitions


Sangeetha Seshadri* Jointly with: Vibhore Kumar*, Brian F. Cooper , Ling Liu* and Karsten Schwan *
*College

of Computing Georgia Tech

Yahoo! Research

IPDPS 07 March 29th 2007

Talk Outline
y Motivation y Challenges y Our Approach y Experimental Results y Future Work

35

Distributed Data Stream Systems


Can low-capacity flights be cancelled? Weather What is the status of my flight? Flight information

Web sources

Local Weather

Travel Agent

36 Centralized DB

Motivation
y Lots of data produced in lots of places y Examples: operational information systems, scientific collaborations, web traffic data, financial applications y Centralized processing does not scale

Challenges
y Choosing efficient deployments.
y Fast and efficient initial deployments. y Utilize reuse opportunities.

y Handling dynamic nature of system.


y Queries arrive or leave. y Nodes join (recover) or leave (fail). y Network conditions change. y Data conditions (e.g. rate) changes.

38

Approach Outline

Query Planning

Deployment

Adaptivity

Typical Approaches

Query Planning & Deployment Our Approach

Adaptivity

39

Query Planning
C B

B A A B

(B

C) (A

A Sink B) C

SELECT * FROM A

40

Query Deployment
A B (A B) C

C N4 N1 A

Sink3

Sink1

N3

Sink4 N5 Sink2 B Sink5


41

N2

An Illustrative Example..
SELECT * FROM A C

SELECT * FROM A

42

Why an integrated approach?


y Integrated approach decreases cost by > 50 %

Setup: 64 node network, 100 queries over 5 stream sources each. Y-axis represents communication costs.

Problem
y Massive Search Space. y Example: 5 stream sources, 64 nodes
y 2,880,000,000 (approx) plans considered. y Lemma 1:

K ( K  1)( K  1) ( K 1) 3exhaustive ! v (N ) 6 Our Solution:


Trade some optimality for smaller search space

44

Solution
y Organize the nodes into a virtual Network Hierarchy. y Operator reuse through Stream Advertisements y Two approximation based algorithms:
y Top-Down y Bottom-Up

45

Optimization Metric
y Minimize `network usage y Network usage: total amount of data in transit at any

point in time. y Encapsulates both bandwidth and latency of links.

46

Network Hierarchy
y Cluster network nodes based on cost . y User defined parameter maxcs
Coordinator Nodes

47

Stream Advertisements for Reuse


A, C and A C B Coordinator Nodes

B A

A C

48

Optimization Algorithms
Top-Down Bottom-Up

49

Planning algorithms
y Top down

A
A B

50

Top-Down Algorithm: Features


y Reduced search space: y Search space reduced by a factor .

max F ! h N (h = height of hierarchy, N = network size, K = number of sources). y User defined parameter maxcs allows to tune trade-off between search space and sub-optimality.
cs

K 1

y Operators re-used when beneficial through stream

advertisements.
51

Planning algorithms
y Bottom up

A
A B

B
C D

B A B C D
52

Bottom-Up Algorithm: Features


y Reduced search space.
y Deploys only sub-queries within current cluster. y Analytical bounds: Search space reduced by factor .

y Operators re-used when beneficial. y But, may choose sub-optimal join-orders.

53

Experiments
y Simulation and prototype based experiments. y 128 node network: Used GT-ITM internetwork topology

generator.
y Uniformly random workload generator: 10 sources, 100

queries, 2-5 join operators, random sink placements.

54

Cost with Bottom-Up Algorithm

Comparison with existing approaches

56

Comparison of Search Space

57

Future Work
y We have built a prototype based on IFLOW

a distributed

data stream system built at Georgia Tech. y Aggregations y Modifying existing deployments at runtime
y Relaxing filter conditions y Modifying join ordering at runtime.

58

Related Work
y Distributed query optimization
y Distributed INGRES, R*, SDD-1

y Stream data processing engines


y Centralized - STREAM, Aurora, TelegraphCQ y Distributed - Borealis, Flux

59

Conclusion
y Integrated approach to query optimization y Hierarchical clustering of network and stream

advertisements. y Approximation based algorithms


y Top-Down y Bottom-Up

y Design Highlights
y Trade some optimality for smaller search space. y Decrease search space while offering bounds on the sub-

optimality.

60

For further information


y http://www.cc.gatech.edu/~sangeeta y Contact: sangeeta@cc.gatech.edu

Thank You!

61

Deployment Times

62

Example
Simple use-case for pushing down selections: Query 1: SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH Query 2: SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'

63

The Big Picture


y Large number of possibilities y System Model
y Stream processing systems (SQL-style queries) y Pub-sub systems y Runtime annotators (keyword-based queries).

y Trade-offs Cost with y Search space y Reliability y Availability. y Adaptivity y Admission Control y Moving operators y Dropping data y Migrating plans.
64

Real Enterprise Workload


y Delta Airlines Operational information system y Q1 (15%): Terminal Overhead Display (Lifetime = 12 hours) y Q2 (80%): Gate Agent Query (Lifetime = 2 hours) y Q3 (5%): Ad-hoc flight status monitoring queries (Lifetime = 6 hours)

65

Real Enterprise Workload

66

Backups

Data Model
y Append-only
y Call records

y Updates
y Stock tickers

y Deletes
y Transactional data

y Meta-Data
y Control signals, punctuations

System Internals probably need all above

68

R.Motwani PODS 2002

Aurora/STREAM Overview
Synopses Running Op Ready Op Waiting Op Applications register continuous queries

Query Plans

Output streams

x
Users issue continuous and ad-hoc queries

Historical Storage

Input streams
69

Administrator monitors query execution and adjusts runtime parameters

R.Motwani, Models & Issues in Data Streams PODS 2002

Sliding Window Approximation


011000011100000101010

y Why?
y Approximation technique for bounded memory y Natural in applications (emphasizes recent data) y Well-specified and deterministic semantics

y Issues
y Extend relational algebra, SQL, query optimization y Algorithmic work y Timestamps?

70

PODS 2002

Adaptivity (Telegraph)
STeMs for join R S T Input Streams R S T Output Queues

EDDY RxSxT

grouped filter (R.A) grouped filter (S.B)

y Runtime Adaptivity y Multi-query Optimization y Framework


71

implements arbitrary schemes


R.Motwani PODS 2002

Query-Split Scheme (Niagara)


trig.Act.i scan file i split Quotes.XML scan join Symbol = Const.Value scan constant table trig.Act.j scan file j IBM MSFT file i file j

y Aggregate subscription for efficiency y Split evaluate trigger only when file updated y Triggers multi-query optimization
72 R.Motwani PODS 2002

Shared Predicates [Niagara, Telegraph]


Predicates for R.A R.A > 1 R.A > 7 R.A > 11 R.A < 3 R.A < 5 R.A = 6 R.A = 8 R.A 9

>
1

7 11 A>1 A>7 A>11

<
A<3

Tuple A=8

3 A<5 8 9
R.Motwani PODS 2002

73

Vous aimerez peut-être aussi