Vous êtes sur la page 1sur 24

Distributed Query Processing

Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de

Agenda
Query Processing 101
centralized query processing distributed query processing

Middleware
SQL and XML data integration

The Role of Web Services

Problem Statement
Input: Query
How many times has the moon circled around the earth in the last twenty years?

Output: Answer
240!

Objectives:
response time, throughput, first answers, little IO, ...

Centralized vs. Distributed Query Processing


same problem but, different parameters and objectives

Query Processing 101


Input: Declarative Query
SQL, OQL, XQuery, ...

Step 1: Translate Query into Algebra


Tree of operators

Step 2: Optimize Query (physical and logical)


Tree of operators (Compilation)

Step 3: Interpretation
Query result

Algebra
A.d

SELECT A.d FROM A, B WHERE A.a = B.b AND A.c = 35


A

A.a = B.b, A.c = 35 X B

relational algebra for SQL very well understood algebra for OQL fairly well understood algebra for XQuery (work in progress)

Query Optimization
A.d A.a = B.b, A.c = 35 X A B index A.c A.d hashjoin B.b B

no brainers (e.g., push down cheap predicates) enumerate alternative plans, apply cost model use search heuristics to find cheapest plan

Query Execution
John A.d

(John, 35, CS)


hashjoin (John, 35, CS) (Mary, 35, EE) index A.c B.b B (CS) (AS)

(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)

library of operators (hash join, merge join, ...) pipelining (iterator model) lazy evaluation exploit indexes and clustering in database

Summary: Centralized Queries


Basic SQL (SPJG, nesting) well understood Very good extensibility
nearest neighbor search, spatial joins, time series, UDF, roll-up, cube, ...

Current problems
statistics, cost model for optimization physical database design expensive

Trends
interactiveness during execution approximate answers more and more functionality, powerful models (XML)

Distributed Query Processing 101


Idea:
This is just an extension of centralized query processing. (System R* et al. in the early 80s)

What is different?
extend physical algebra: send&receive operators resource vectors, network interconnect matrix caching and replication optimize for response time less predictability in cost model (adaptive algos) heterogeneity in data formats and data models

Distributed Query Plan


A.d hashjoin

receive
send

receive
send

B.b
index A.c B

Cost
1 8

Total Cost = Sum of Cost of Ops


Cost = 40

1
1

6
6

2
5 10

Response Time
25, 33 independent, pipelined parallelism 24, 32 Total Cost = 40 first tuple = 25 last tuple = 33

0, 7
0, 6

0, 24
0, 18

0, 12
0, 5 0, 10 first tuple = 0 last tuple = 10

Adaptive Algorithms
Deal with unpredictable events at run time
delays in arrival of data, burstiness of network autonomity of nodes, change in policies

Example: double pipelined hash joins


build hash table for both input streams read inputs in separate threads good for bursty arrival of data

re-optimization at run time


monitor execution of query adjust estimates of cost model re-optimize if delta is too large

Heterogeneity
Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging Wrappers map from local to global schema Wrappers carry out caching
connections, cursors, data, ...

Wrappers map queries into local dialect Wrappers participate in query planning!!!
define the subset of queries that can be handled give cost information, statistics capability-based rewrite (HKWY, VLDB 1997)

Data Cleaning
Are two objects the same? Is D. A. Kossman the same as Kossmann? Is the object that was at Position x 10 min. ago the same as the object at Position y now? Approaches (combination of)
statistical domain knowledge human interspection

Very Expensive

Summary
Theory very well understood
extend traditional (centralized) query processing add some bells and whistles heterogeinity needs manual work and wrappers

Problems in Practice
cost model, statistics architectures are not fit for adaptivity, heterogeneity optimizers do not scale for 10,000s of sites autonomy of sites, systems not built for asynchronous communication data cleaning

Middleware
Two kinds of middleware
data warehouses virtual integration

Data Warehouses
good: query response times good: materializes results of data cleaning bad: high resource requirements in middleware bad: staleness of data

Virtual Integration
the opposite caching possible to improve response times

Virtual Integration
Query

Middleware (query decomposition, result composition) wrapper sub query wrapper sub query

DB1

DB2

IBM Data Joiner


SQL Query

Data Joiner wrapper sub query wrapper sub query

SQL DB1

SQL DB2

Adding XML
Query XML Publishing Middleware (SQL) wrapper wrapper

sub query
DB1

sub query
DB2

XML Data Integration


XML Query

Middleware (XML) XML query wrapper XML query wrapper

DB1

DB2

XML Data Integration


Example: BEA Liquid Data Advantage
Availability of XML wrappers for all major databases

Problems
XML - SQL mapping is very difficult XML is not always the right language (e.g., decision support style queries)

Summary
Middleware looks like a homogenous, centralized database
location transparency data model transparency

Middleware provides global schema


data sources map local schemas to global schema

Various kinds of middleware (SQL, OQL, XML) Stacks of middleware possible Data Cleaning requires special attention

A Note on Web Services


Idea: Encapsulate Data Source
provide WSDL interface to access data works very well if query pattern is known

Problem: Exploit Capability of Source


WSDL limits capabilities of data source good optimization requires white box example: access by id, access by name, full scan should all combinations be listed in WSDL?

Solution: WSDL for Query Planning


Details ???

Vous aimerez peut-être aussi