Distributed Query Processing +

Al-Quds University
Distributed Query Processing
Database
. For Dr : badee sartawi
. Name : Hiba Jafar
Date :30-5-2003
1
Table of contents:
Sec.1 An Imtroduction of (DQP) .
Sec.2 The Objectives.
Sec.3 General area of research .
Sec.4 History .
Sec.5 What is (DQP).
Sec.6 Why to use (DQP). issues make distributed data processing
still a complex .
Sec.7 Where and How (DQP) work.
Sec.8 Query processing .
Sec.9 ARCHITECTURE OF THE SYSTEM .
Sec.10 some advantages of (DQP).
Sec.11 Running example .
Sec.12 Summary .
. Referancing …..
2
Abstract :
The paper presents the “textbook” architecture for distributed query processing and a
series of techniques that are particularly useful for distributed database systems. and
shows how query processing works in these systems.
A very large body of work in the general area of database systems exists. All this
work can be roughly classified into work on architectures and techniques for
.transaction processing (i.e., quickly processing small update operations), work on
query processing (i.e., mostly read operations that explore large amounts of data), and
work on data models, languages and user interfaces for advanced applications. In this
paper, we will focus primarily on query processing. A discussion of transaction
processing and of alternative data models is beyond the scope of this paper. This
paper will not even be able to give a full coverage of all query processing techniques
used today.
1. INTRODUCTION
A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network . The distribution of databases on a network achieves
the advantages of performance, reliability, availability and modularity that are inherent in
distributed systems.
As with traditional centralized databases, distributed database systems (DDBS's) must
provide an efficient user interface that hides all of the underlying data distribution details of
the DDB from the users. The use of a relational query allows the user to specify a description
of the data that is required without having to know where the data is physically located.
3
The retrieval of data from different sites in a DDB is referred to as distributed query
processing.
Oracle distributed database systems employ a distributed processing architecture. Thus, an
Oracle database server acts as a client when it requests data that another Oracle database
server manages. For example, the following query accesses data from the local database as
well as the remote sales database. The first table (EMP) found in site1 and the second table
(DEPT) found in site2:
SELECT ename, dname
FROM scott.emp e, scott.dept@sales.us.americas.acme_auto.com d
WHERE e.deptno = d.deptno
So a distributed query is one that selects data from databases located at multiple sites in a
network and distributed processing performs computations on multiple CPUs to achieve a
single result. Any SQL data manipulation statement that references tables at sites other than
the site an application program is submitted to for compilation (i.e., the query site) is a called
distributed query and need to be processed.
Query processing is much more difficult in distributed environment than in centralized
environment because:
• A large number of parameters affect the performance of distributed queries.
• Relations involved in a distributed query may be fragmented and/or replicated.
• With many sites to access, query response time may become very high
•.
4
It is quite evident that the performance of a DDBS is critically dependant upon the ability of
the query optimization algorithm to derive efficient query processing strategies. DDBMS
query optimization algorithms attempts to reduce the quantity of data transferred.
Minimizing the quantity of data transferred is a desirable optimization criterion since more
data transported across telecommunications networks requires more time and labor. The
distributed query optimization has several problems that relate to: cost model, larger set of
queries, optimization cost & execution cost tradeoff, and optimization / reoptimization
interval.
2.The Objectevs:
* The objectives of this paper is :
• To learn what is distrebuted query processing .

• To understand the main idea in (DQP) .
• To understand how does it work .
• To know the target of (DQP).
* The objectives of use (DQP) itself :
The goal is to execute such queries as efficiently as possible in order to minimize the
response time that users must wait for answers or the time application programs are
delayed. And to minimizes the total communication costs associated with a query,
to improved throughput via parallel processing, sharing of data and equipment, and
modular expansion of data management capacity. In addition, when redundant data is
maintained, one also achieves increased data reliability and improved response time.
3.General area of research : This paper shows that there are many different reasons to
rely on distributed architectures and correspondingly many different kinds of distributed
systems exist. Sometimes it is only the software and not the hardware that is distributed.The
purpose of this paper is to give a comprehensive overview of what query processing
5
techniques are needed to implement any kind of distributed database and information system.
It is assumed that users and application programs issue queries using a declarative query
language such as SQL or OQL and without knowing where and in which format the data is
stored in the distributed system..
4. History(Background and Motivation ):
Researchers and practitioners have beeninterested in distributed database systems
since the 1970s. At that time,the main focus was on supporting distributed
data management for large corporations and organizations that kept their data at
different offices or subsidiaries.
Although there was a clear need and many good ideas and prototypes, and Distributed
Ingres the early efforts in building distributed database systems were never
commercially successful In some aspects, the early distributed database systems were
ahead of their time. First, communication technology was not stable enough to ship
megabytes of data as required for these systems. Second, large businesses somehow
managed to survive without sophisticated distributed database technology by sending
tapes, diskettes, or just paper to exchange data between their offices. Today, the
situation has changed dramatically.
Distributed data processing is both feasible and needed. Almost all major database
system vendors offer products to support distributed data processing (e.g.,IBM,
Informix, Microsoft, Oracle, Sybase), and large database application systems have a
distributed architecture (e.g., business application systems such as Baan IV, Oracle
Finance, Peoplesoft 7.5, and SAP R/3). Distributed data processing is feasible because
of recent technological advances (e.g., hardware, software protocols,
6
standards). Distributed data processing is needed because of changing business
requirements, which have made distributed data processing cost-effective and in
certain situations the only viable option.
5.What is Distributed Query Processing:
A distributed database (DDB) consists of copies of datafiles (often redundant)
distributed on a network of computers.
Query processing (or data retrieval) is an important problem in distributed
databases. Accessing data distributed in different computer sites necessitates
the transmission of data over communication links. Since communication delay is
substantial, the database management system must devise an efficient strategy to
coordinate data processing at local computer sites and data transmission between
sites. This problem is enhanced in a redundant database because which of the
redundant copies to access becomes an important issue.
6.Why to use (DQP) :
Distributed data processing is becoming a reality. Businesses want to do it for many
reasons, and they often must do it in order to stay competitive. While much of the
infrastructure for distributed data processing is already there (e.g., modern network
technology) Specifically, businesses are beginning to rely on distributed rather than
centralized databases for the following reasons:
1.Cost and scalability.
Today, one thousand PC processors are cheaper and significantly more powerful than
one big mainframe computer. So, it makes economic sense to replace a mainframe by
7
a network of small, off-the-shelf processors. Furthermore, it is very difficult to “up-
size” a mainframe computer if a company grows, while new PCs can be added to the
network at any time in order to meet a company’s new requirements. High availability
can be achieved by mirroring (replicating) data.
2.Integration of different software modules.
It has become clear that no single software package can meet all the requirements of a
company. Companies must, therefore, install several different packages, each
potentially with its own database, and the result is a distributed database system. Even
single software packages offered by one vendor have a distributed, componentbased
architecture so that the vendor can market and offer upgrades for every component
individually.
3.Integration of legacy systems.
The integration of legacy systems is one particular example that demonstrates how
some companies are forced to rely on distributed data processing in which their old
legacy systems need to coexist with new modern systems.
4.New applications.
There are a number of new emerging applications that rely heavily on distributed
database technology; examples are workflow management, computer-supported
collaborative work, tele-conferencing, and electronic commerce.
5.Market forces.
Many companies are forced to reorganize their businesses and use state-of-the-art
distributed information technology in order to remain competitive. As an example,
people will probably not eat more Pizza because of the Internet, but a Pizza delivery
service is definitely going to lose some of its market share if it does not allow people
to order Pizza on the Web.
8
a number of issues make distributed data processing still a complex undertaking:
(1) distributed systems can become very large, involving thousands of
heterogeneous sites including PCs and mainframe server machines; (2) the state of a
distributed system changes rapidly because the load of sites varies over time and new
sites are added to the system; (3) legacy systems need to be integrated—such legacy
systems usually have not been designed for distributed data processing and now need
to interact with other (modern) systems in a distributed environment.
7.Where and How (DQP) work :
In the field of data management tools, the developments in distributed computing
technologies lead up to distributed database management systems. These systems
should shield the users from complexities of the distribution. Distributed query
processing refers to the process that obtains an answer for a global query from
distributed sources. A pre-existing local database should be used as one of the
sources. This paper deals with principles of query processing in distributed database
systems and describes some specific architectural issues, which enable the integration
of pre-existing local databases into a distributed system.
A distributed database is a collection of multiple, logically interrelated local
databases distributed over a computer network. A distributed database
management system (distributed DBMS) is a software system that permits the
management of a distributed database and makes the distribution transparent to users.
A distributed database together with a distributed DBMS is called distributed
database system. Every local database has its own exported scheme that describes
the local data available for the system. An exported scheme should be a subset of a
9
local scheme. At the global system level, exported schemes of all local databases are
integrated into a global data scheme.
Different authors differ in classifying distributed DBMSs. Usually the classification
deals with three properties: autonomy, integration, and heterogeneity. Autonomy
refers to the degree to which local DBMSs can operate independently. Integration
describes the degree to which local databases are integrated into the global system,
that is, whether any distribution is transparent to the user or not. Heterogeneity
covers hardware, networking protocols, and local DBMS (including data model,
query language, interface...) heterogeneity.
8. QUERY PROCESSING
We will deal with queries over relational data model . a distributed DBMS provides
transparent access to distributed resources. There must be a module in the system
architecture that gets a global query and manages a distributed evaluation. The whole
process usually goes through the following steps:
•parsing the global query,
•query optimization,
•query execution.
When parsing the global query, each global relation is substituted by the expression
with local relations according to global scheme. Then the query is simplified by
eliminating redundant predicates. Finally, the query is transformed into relation
algebra expression. This intermediate expression of the query is called its canonical
form. During query optimization step, a distributed execution plan that obtains the
answer is prepared. Although query languages are usually non-procedural, the
execution plan gives the procedure how to extract data. The execution plan says
10
which local data are required, how to access them, which operations must be done at
which sites. Moreover, the execution plan should be optimized. it minimizes the
„execution cost“. Finally, the plan is executed in the query execution step.
9.ARCHITECTURE OF THE SYSTEM
Every local database has its exported relational scheme that describes local data
accessible from the global system. The local exported scheme should be just a subset
of a local scheme used by a local user. Exported schemes of all local databases are
integrated into a global relation scheme. Every global relation is expressed by a
relation algebra expression over relations from exported schemes. The relation
algebra expression should be arbitrary complicated, but the MINUS operator mustn’t
be used. The global user expresses queries over global relations only.
There are four different functional units (processes) in the functional model of the
system: the client, the distributed data server, the local database unit, and the partial
query integrator. Every unit should be at different site of the network. The client takes
over user’s query and presents gathered partial results back to the user. After the
client took up user’s query, it sends the query to the distributed data server and waits
to the server’s answer. In the server’s answer, there is just an information which sites
will deliver partial results to the client. The client waits for these partial results and
their union presents to the user. Notice that UNION operation is the only one that the
client must be able to execute. The distributed data server manages a distributed
query execution. It parses the query and generates the execution plan how to extract
required data from local databases. The server informs the client about the plan and
sends requests to the local database units and partial query integrators. Then, the
server waits and informs the client about some errors during execution.
11
The local database unit and the partial query integrator are the only units that
access the local data. They are based on the underlying local DBMSs. The unit’s
process runs on the same site as the local DBMS and manages the local database
access and communication with other sites in the system. The local database unit
accesses only the local data. There is coded what local data are required and where to
send them in the distributed data server’s request. The addressee of acquired local
data should be either the client or the partial query integrator. Additionally to the
function of the local database unit, the partial query integrator integrates incoming
partial results and possibly local data according to the distributed data server requests.
Again, the addressee of the results should be either another partial query integrator or
the client.
Fig. 1 Functional model
12
When there any error emerges in the system during the distributed executing, the
distributed data server is notified immediately. Then the server notifies the waiting
client and aborts all related requests.
10.Some advantages of (DQP) :
The distribution of data in a network also offers advantages over the centralization of
data at one computer. These advantages include: improved throughput via parallel
processing, sharing of data and equipment, and modular expansion of data
management capacity. In addition, when redundant data is maintained, one also
achieves increased data reliability and improved response time
Electronic market places and virtual enterprises have become very important
applications for query processing . Building a scalable virtual business-to-business
(B2B) market place with hundreds or thousands of participating suppliers requires
highly flexible, distributed query processing capabilities. Architecting such an
electronic market place as a data warehouse by integrating all the data from all
participating enterprises in one centralized data repository incurs severe problems:
¤ Security and privacy violations: The participants of the market place have to
relinquish the control over their data and entrust sensitive information (e.g., pricing
conditions) to the market place host.
¤ Coherence problems: The coherence of highly dynamic data, such as availability
and shipping information, may be violated due to outdated materialized data in the
market place’s data warehouse.
13
¤ Schema integration problems: Using the warehouse approach all relevant data from
all participants have to be converted `a priori into the same format. Often, it would be
easier to leave the data inside the participant’s information systems, e.g., legacy
systems, within the local sites, and apply particular local wrapper/transformer
operations. This way, data is only converted on demand and the most recent coherent
state of the data is returned.
¤ Fixed query operators: In a fully integrated (data warehouse-like) electronic market
place, all information is converted into materialized data. This is often not desirable in
such complex applications like electronic procurement/bidding. For example, in
pricing offers one would like to have vastly different choices:
– fixed pricing via materialized data
– operators which calculate the prices based on a multitude of local and global
parameters (identity of the consumer company, availability, local plant utilization,
subcontractor prices, etc.)
– even human interaction during the processing of such complex e-procurement
queries is desirable. In some participating enterprises the pricing could be done
by a human via an interactive “query operator”.
11.Running Example:
We demonstrate the HyperQuery technique with a scenario of the car manufacturing
industry. We assume a hierarchical supply chain of suppliers and sub-contractors. A
14
typical process of e-procurement to cover unscheduled demands of the production is
to query a market place for these products and to select the incoming offers by price,
terms of delivery, available quantity, etc. The price of the needed products can vary
by customer/supplier-specific sales discounts, the quantity of materials to be provided,
duties, plant utilization, etc. Thus the price cannot be a materialized attribute as in
traditional query processing systems. Instead it is an individually calculated,
dynamically changing attribute and a hyperlink to the supplier is contained, where
the price will be computed on demand.
In traditional distributed query processing systems such a query can only be executed
if a global schema exists or all local databases are replicated at the market place.
Considering an environment, where hundreds of suppliers participate in a market
place, one global query which integrates the sub-queries for all participants would be
too complex and error-prone, i.e., if one supplier’s host is down, the whole query
execution would fail. Following our approach the suppliers have to register their
products at the market place, which they want to participate in, and specify, by which
sub-plans the price information can be computed at their sites. This calculation can be
arbitrarily complex and involve their subcontractors, too. The allocation schema given
by the data at the market place is exploited for execution.
Figure 1 shows an SQL-like query, that returns the prices and suppliers of all needed
products. The execution is stopped at the latest at the given value of the expires
attribute. Only the results gathered so far are considered. Figure 2 shows two possible
execution traces of this query—both are supported by our evaluation technique. In the
hierarchical execution of Figure 2(a) the resulting objects flow back to the sites,
where the original input objects came from, whereas in the broadcast execution of
15
Figure 2(b) the objects do not flow all the way through intermediates back to the
client, but are routed directly to the client, which issued the query. 4
select p.ProductDescription, c.Supplier, c.AdditionalData, c.Price

from NeededProducts p, Catalog@MarketPlace c
where p.ProductDescription = c.ProductDescription
order by p.ProductDescription, c.Pric
expires Friday, May 18, 2001 5:00:00 PM CET
Figure 1: Example Query of the Car Manufacturer
12.Summary:
A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network . The retrieval of data from different sites in a DDB is
referred to as distributed query processing.
16
Query processing is much more difficult in distributed environment than in centralized
environment because: 1)A large number of parameters affect the performance of
distributed queries. 2) Relations involved in a distributed query may be fragmented and/or
replicated. 3) With many sites to access, query response time may become very high
• .
businesses are beginning to rely on distributed rather than centralized databases for the
following reasons: Cost and scalability , ntegration of different software ,Integration of
legacy systems , New applications and Market forces .
A distributed database management system (distributed DBMS) is
a software system that permits the management of a distributed
database and makes the distribution transparent to users. A distributed
database together with a distributed DBMS is called distributed
database system . Different authors differ in classifying distributed
DBMSs. Usually the classification deals with three properties: autonomy,
integration, and heterogeneity.the process of distributed database go
through three steps : parsing the global query, query optimization and
query execution .
Some advantages of (DQP) : : improved throughput via parallel
processing, sharing of data and equipment, and modular expansion of
data management capacity. In addition, when redundant data is
17
maintained, one also achieves increased data reliability and improved
response time.
Electronic market places and virtual enterprises have become very important
applications for query processing Building a scalable virtual business-to-business
(B2B) market place with hundreds or thousands of participating suppliers requires
highly flexible, distributed query processing capabilities. Architecting such an
electronic market place as a data warehouse by integrating all the data from all
participating enterprises in one centralized data repository incurs severe problems
¤ Security and privacy violations
¤ Coherence problems
¤ Schema integration problems
¤ Fixed query operators
13.Referances :
All the contant of this paper is taking from internsts paper about the subjwct .the
following is named of this paper :
• p-1107-10619292.pdf .
• p422-kossmann.pdf .
• vldb2001.hperqueries.pdf .
• papadimos2003 .
• doctorweek .
18
• distributed_query.pdf .
19

Distributed Query Processing +

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Distributed Query Processing +

Transféré par

Droits d'auteur :

Formats disponibles

Al-Quds University

Distributed Query Processing

. For Dr : badee sartawi

. Name : Hiba Jafar

Sec.1 An Imtroduction of (DQP) .

Sec.2 The Objectives.

Sec.3 General area of research .

Sec.5 What is (DQP).

Sec.6 Why to use (DQP). issues make distributed data processing

Sec.7 Where and How (DQP) work.

Sec.8 Query processing .

Sec.9 ARCHITECTURE OF THE SYSTEM .

Sec.10 some advantages of (DQP).

Sec.11 Running example .

shows how query processing works in these systems.

.transaction processing (i.e., quickly processing small update operations), work on

paper, we will focus primarily on query processing. A discussion of transaction

A distributed database (DDB) is a collection of multiple, logically interrelated databases

distributed over a computer network . The distribution of databases on a network achieves

As with traditional centralized databases, distributed database systems (DDBS's) must

Oracle distributed database systems employ a distributed processing architecture. Thus, an

(DEPT) found in site2:

SELECT ename, dname

FROM scott.emp e, scott.dept@sales.us.americas.acme_auto.com d

WHERE e.deptno = d.deptno

network and distributed processing performs computations on multiple CPUs to achieve a

distributed query and need to be processed.

Query processing is much more difficult in distributed environment than in centralized

• A large number of parameters affect the performance of distributed queries.

• Relations involved in a distributed query may be fragmented and/or replicated.

query optimization algorithms attempts to reduce the quantity of data transferred.

* The objectives of this paper is :

• To learn what is distrebuted query processing .

* The objectives of use (DQP) itself :

modular expansion of data management capacity. In addition, when redundant data is

rely on distributed architectures and correspondingly many different kinds of distributed

purpose of this paper is to give a comprehensive overview of what query processing

stored in the distributed system..

4. History(Background and Motivation ):

Researchers and practitioners have beeninterested in distributed database systems

different offices or subsidiaries.

managed to survive without sophisticated distributed database technology by sending

situation has changed dramatically.

system vendors offer products to support distributed data processing (e.g.,IBM,

of recent technological advances (e.g., hardware, software protocols,

requirements, which have made distributed data processing cost-effective and in

certain situations the only viable option.

5.What is Distributed Query Processing:

A distributed database (DDB) consists of copies of datafiles (often redundant)

distributed on a network of computers.

Query processing (or data retrieval) is an important problem in distributed

databases. Accessing data distributed in different computer sites necessitates

the transmission of data over communication links. Since communication delay is

substantial, the database management system must devise an efficient strategy to

sites. This problem is enhanced in a redundant database because which of the

redundant copies to access becomes an important issue.

6.Why to use (DQP) :

Distributed data processing is becoming a reality. Businesses want to do it for many

technology) Specifically, businesses are beginning to rely on distributed rather than

centralized databases for the following reasons:

1.Cost and scalability.

can be achieved by mirroring (replicating) data.

2.Integration of different software modules.