Académique Documents
Professionnel Documents
Culture Documents
Agenda
y Data Streams
y What are they? y Why now? Applications..
Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes
stanfordstreamdatamanager
Continuous Queries
y One time queries
Register Query
Streamed Result
Stored Result
DSMS
Input streams Archive
Scratch Store Stored Relations
stanfordstreamdatamanager
DSMS
Network measurements, Packet traces
Archive
Scratch Store Lookup Tables
stanfordstreamdatamanager
Triggers?
y Recall triggers in traditional DBMSs? y Why not use triggers to process continuous queries over
data streams?
Query 1 (self-join)
y Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)
y Result requires unbounded storage y Can provide result as data stream y Can output after 2 min, without seeing end
Query 2 (join)
y Pair up callers and callees
O1.caller, sum(O2.time O1.time) Outgoing O1, Outgoing O2 (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller
Architectural
differences.
y Declarative language for registering continuous queries y Flexible query plans and execution strategies y Centralized ? Distributed ?
13
Agenda
y Data Streams
y What are they? y Why now? Applications..
DSMS Issues
y Relation: Tuple Set or Sequence? y Updates: Modifications or Appends? y Query Answer: Exact or Approximate? y Query Evaluation: One of multiple Pass? y Query Plan: Fixed or Adaptive?
Architectural Issues
DSMS
y y y y y
DBMS y Resource (memory, disk, per-tuple computation) rich Resource (memory, per-tuple y Extremely sophisticated computation) limited query processing, analysis Reasonably complex, near real y Useful to audit query results time, query processing of data stream systems. Useful to identify what data to y Query Evaluation: Arbitrary populate in database y Query Plan: Fixed. Query Evaluation: One pass Query Plan: Adaptive
17
stanfordstreamdatamanager
Overload
18
stanfordstreamdatamanager
Overload need to use resources very carefully. Changing conditions adaptive strategy.
19
stanfordstreamdatamanager
Query Model
User/Application
Query Registration
Predefined
Answer Availability
One-time Event/timer based Multiple-time, periodic Continuous (stored or streamed)
Query Processor
Stream Access
Arbitrary Weighted history Sliding window (special case: size = 1)
DSMS
R.Motwani, Models & Issues in Data Streams PODS 2002
20
Agenda
y Data Streams
y What are they? y Why now? Applications..
22
23
stanfordstreamdatamanager
Windows
y Mechanism for extracting a finite relation from an infinite
Windows
y Terminology
Current time
Sliding Window
time
Tumbling Window
Query Operators
y Selections - Where clause y Projections - Select clause y Joins - From clause y Group-by (Aggregations)
Group-by clause
Query Operators
y Selections and projections on streams - straightforward
y Local per-element operators
Problematic
y May need to join tuples that are arbitrarily far apart. y Equijoin on stream ordering attributes may be tractable.
Blocking Operators
y Blocking y No output until entire input seen y Streams input never ends y Simple Aggregates
y Set Output (sort, group-by) y Root could maintain output data structure y Intermediate nodes try non-blocking analogs y Join y Apply sliding-window restrictions
30 R.Motwani, Models & Issues in Data Streams PODS 2002
Optimization in DSMS
y Traditionally table based cardinalities used in query
optimizer.
y Goal of query optimizer: Minimize the size of intermediate
results.
y Problematic in a streaming environment
Stream Projects
y Amazon/Cougar (Cornell) y y y y y y y y
sensors Aurora (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) telecom streams Niagara (OGI/Wisconsin) Internet XML databases OpenCQ (Georgia) triggers, incr. view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pub/sub content-based filtering Telegraph (Berkeley) adaptive engine for sensors Tribeca (Bellcore) network monitoring
Yahoo! Research
Talk Outline
y Motivation y Challenges y Our Approach y Experimental Results y Future Work
35
Web sources
Local Weather
Travel Agent
36 Centralized DB
Motivation
y Lots of data produced in lots of places y Examples: operational information systems, scientific collaborations, web traffic data, financial applications y Centralized processing does not scale
Challenges
y Choosing efficient deployments.
y Fast and efficient initial deployments. y Utilize reuse opportunities.
38
Approach Outline
Query Planning
Deployment
Adaptivity
Typical Approaches
Adaptivity
39
Query Planning
C B
B A A B
(B
C) (A
A Sink B) C
SELECT * FROM A
40
Query Deployment
A B (A B) C
C N4 N1 A
Sink3
Sink1
N3
N2
An Illustrative Example..
SELECT * FROM A C
SELECT * FROM A
42
Setup: 64 node network, 100 queries over 5 stream sources each. Y-axis represents communication costs.
Problem
y Massive Search Space. y Example: 5 stream sources, 64 nodes
y 2,880,000,000 (approx) plans considered. y Lemma 1:
44
Solution
y Organize the nodes into a virtual Network Hierarchy. y Operator reuse through Stream Advertisements y Two approximation based algorithms:
y Top-Down y Bottom-Up
45
Optimization Metric
y Minimize `network usage y Network usage: total amount of data in transit at any
46
Network Hierarchy
y Cluster network nodes based on cost . y User defined parameter maxcs
Coordinator Nodes
47
B A
A C
48
Optimization Algorithms
Top-Down Bottom-Up
49
Planning algorithms
y Top down
A
A B
50
max F ! h N (h = height of hierarchy, N = network size, K = number of sources). y User defined parameter maxcs allows to tune trade-off between search space and sub-optimality.
cs
K 1
advertisements.
51
Planning algorithms
y Bottom up
A
A B
B
C D
B A B C D
52
53
Experiments
y Simulation and prototype based experiments. y 128 node network: Used GT-ITM internetwork topology
generator.
y Uniformly random workload generator: 10 sources, 100
54
56
57
Future Work
y We have built a prototype based on IFLOW
a distributed
data stream system built at Georgia Tech. y Aggregations y Modifying existing deployments at runtime
y Relaxing filter conditions y Modifying join ordering at runtime.
58
Related Work
y Distributed query optimization
y Distributed INGRES, R*, SDD-1
59
Conclusion
y Integrated approach to query optimization y Hierarchical clustering of network and stream
y Design Highlights
y Trade some optimality for smaller search space. y Decrease search space while offering bounds on the sub-
optimality.
60
Thank You!
61
Deployment Times
62
Example
Simple use-case for pushing down selections: Query 1: SELECT FLIGHTS.Number, FLIGHTS.Status CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL SOUTH Query 2: SELECT FLIGHTS.Number, FLIGHTS.Status, CARRIER_CODES.Name FROM FLIGHTS, CARRIER_CODES WHERE FLIGHTS.Departing = ATLANTA AND FLIGHTS.Carrier_Code = CARRIER_CODES.Code AND FLIGHTS.Departure_terminal = `TERMINAL NORTH'
63
y Trade-offs Cost with y Search space y Reliability y Availability. y Adaptivity y Admission Control y Moving operators y Dropping data y Migrating plans.
64
65
66
Backups
Data Model
y Append-only
y Call records
y Updates
y Stock tickers
y Deletes
y Transactional data
y Meta-Data
y Control signals, punctuations
68
Aurora/STREAM Overview
Synopses Running Op Ready Op Waiting Op Applications register continuous queries
Query Plans
Output streams
x
Users issue continuous and ad-hoc queries
Historical Storage
Input streams
69
y Why?
y Approximation technique for bounded memory y Natural in applications (emphasizes recent data) y Well-specified and deterministic semantics
y Issues
y Extend relational algebra, SQL, query optimization y Algorithmic work y Timestamps?
70
PODS 2002
Adaptivity (Telegraph)
STeMs for join R S T Input Streams R S T Output Queues
EDDY RxSxT
y Aggregate subscription for efficiency y Split evaluate trigger only when file updated y Triggers multi-query optimization
72 R.Motwani PODS 2002
>
1
<
A<3
Tuple A=8
3 A<5 8 9
R.Motwani PODS 2002
73