Vous êtes sur la page 1sur 19

Al-Quds University

Distributed Query Processing

Database

. For Dr : badee sartawi

. Name : Hiba Jafar

Date :30-5-2003

1
Table of contents:

Sec.1 An Imtroduction of (DQP) .

Sec.2 The Objectives.

Sec.3 General area of research .

Sec.4 History .

Sec.5 What is (DQP).

Sec.6 Why to use (DQP). issues make distributed data processing

still a complex .

Sec.7 Where and How (DQP) work.

Sec.8 Query processing .

Sec.9 ARCHITECTURE OF THE SYSTEM .

Sec.10 some advantages of (DQP).

Sec.11 Running example .

Sec.12 Summary .

. Referancing …..

2
Abstract :

The paper presents the “textbook” architecture for distributed query processing and a

series of techniques that are particularly useful for distributed database systems. and

shows how query processing works in these systems.

A very large body of work in the general area of database systems exists. All this

work can be roughly classified into work on architectures and techniques for

.transaction processing (i.e., quickly processing small update operations), work on

query processing (i.e., mostly read operations that explore large amounts of data), and

work on data models, languages and user interfaces for advanced applications. In this

paper, we will focus primarily on query processing. A discussion of transaction

processing and of alternative data models is beyond the scope of this paper. This

paper will not even be able to give a full coverage of all query processing techniques

used today.

1. INTRODUCTION

A distributed database (DDB) is a collection of multiple, logically interrelated databases

distributed over a computer network . The distribution of databases on a network achieves

the advantages of performance, reliability, availability and modularity that are inherent in

distributed systems.

As with traditional centralized databases, distributed database systems (DDBS's) must

provide an efficient user interface that hides all of the underlying data distribution details of

the DDB from the users. The use of a relational query allows the user to specify a description

of the data that is required without having to know where the data is physically located.

3
The retrieval of data from different sites in a DDB is referred to as distributed query

processing.

Oracle distributed database systems employ a distributed processing architecture. Thus, an

Oracle database server acts as a client when it requests data that another Oracle database

server manages. For example, the following query accesses data from the local database as

well as the remote sales database. The first table (EMP) found in site1 and the second table

(DEPT) found in site2:

SELECT ename, dname

FROM scott.emp e, scott.dept@sales.us.americas.acme_auto.com d

WHERE e.deptno = d.deptno

So a distributed query is one that selects data from databases located at multiple sites in a

network and distributed processing performs computations on multiple CPUs to achieve a

single result. Any SQL data manipulation statement that references tables at sites other than

the site an application program is submitted to for compilation (i.e., the query site) is a called

distributed query and need to be processed.

Query processing is much more difficult in distributed environment than in centralized

environment because:

• A large number of parameters affect the performance of distributed queries.

• Relations involved in a distributed query may be fragmented and/or replicated.

• With many sites to access, query response time may become very high

•.

4
It is quite evident that the performance of a DDBS is critically dependant upon the ability of

the query optimization algorithm to derive efficient query processing strategies. DDBMS

query optimization algorithms attempts to reduce the quantity of data transferred.

Minimizing the quantity of data transferred is a desirable optimization criterion since more

data transported across telecommunications networks requires more time and labor. The

distributed query optimization has several problems that relate to: cost model, larger set of

queries, optimization cost & execution cost tradeoff, and optimization / reoptimization

interval.

2.The Objectevs:

* The objectives of this paper is :

• To learn what is distrebuted query processing .


• To understand the main idea in (DQP) .
• To understand how does it work .
• To know the target of (DQP).

* The objectives of use (DQP) itself :

The goal is to execute such queries as efficiently as possible in order to minimize the

response time that users must wait for answers or the time application programs are

delayed. And to minimizes the total communication costs associated with a query,

to improved throughput via parallel processing, sharing of data and equipment, and

modular expansion of data management capacity. In addition, when redundant data is

maintained, one also achieves increased data reliability and improved response time.

3.General area of research : This paper shows that there are many different reasons to

rely on distributed architectures and correspondingly many different kinds of distributed

systems exist. Sometimes it is only the software and not the hardware that is distributed.The

purpose of this paper is to give a comprehensive overview of what query processing

5
techniques are needed to implement any kind of distributed database and information system.

It is assumed that users and application programs issue queries using a declarative query

language such as SQL or OQL and without knowing where and in which format the data is

stored in the distributed system..

4. History(Background and Motivation ):

Researchers and practitioners have beeninterested in distributed database systems

since the 1970s. At that time,the main focus was on supporting distributed

data management for large corporations and organizations that kept their data at

different offices or subsidiaries.

Although there was a clear need and many good ideas and prototypes, and Distributed

Ingres the early efforts in building distributed database systems were never

commercially successful In some aspects, the early distributed database systems were

ahead of their time. First, communication technology was not stable enough to ship

megabytes of data as required for these systems. Second, large businesses somehow

managed to survive without sophisticated distributed database technology by sending

tapes, diskettes, or just paper to exchange data between their offices. Today, the

situation has changed dramatically.

Distributed data processing is both feasible and needed. Almost all major database

system vendors offer products to support distributed data processing (e.g.,IBM,

Informix, Microsoft, Oracle, Sybase), and large database application systems have a

distributed architecture (e.g., business application systems such as Baan IV, Oracle

Finance, Peoplesoft 7.5, and SAP R/3). Distributed data processing is feasible because

of recent technological advances (e.g., hardware, software protocols,

6
standards). Distributed data processing is needed because of changing business

requirements, which have made distributed data processing cost-effective and in

certain situations the only viable option.

5.What is Distributed Query Processing:

A distributed database (DDB) consists of copies of datafiles (often redundant)

distributed on a network of computers.

Query processing (or data retrieval) is an important problem in distributed

databases. Accessing data distributed in different computer sites necessitates

the transmission of data over communication links. Since communication delay is

substantial, the database management system must devise an efficient strategy to

coordinate data processing at local computer sites and data transmission between

sites. This problem is enhanced in a redundant database because which of the

redundant copies to access becomes an important issue.

6.Why to use (DQP) :

Distributed data processing is becoming a reality. Businesses want to do it for many

reasons, and they often must do it in order to stay competitive. While much of the

infrastructure for distributed data processing is already there (e.g., modern network

technology) Specifically, businesses are beginning to rely on distributed rather than

centralized databases for the following reasons:

1.Cost and scalability.

Today, one thousand PC processors are cheaper and significantly more powerful than

one big mainframe computer. So, it makes economic sense to replace a mainframe by

7
a network of small, off-the-shelf processors. Furthermore, it is very difficult to “up-

size” a mainframe computer if a company grows, while new PCs can be added to the

network at any time in order to meet a company’s new requirements. High availability

can be achieved by mirroring (replicating) data.

2.Integration of different software modules.

It has become clear that no single software package can meet all the requirements of a

company. Companies must, therefore, install several different packages, each

potentially with its own database, and the result is a distributed database system. Even

single software packages offered by one vendor have a distributed, componentbased

architecture so that the vendor can market and offer upgrades for every component

individually.

3.Integration of legacy systems.

The integration of legacy systems is one particular example that demonstrates how

some companies are forced to rely on distributed data processing in which their old

legacy systems need to coexist with new modern systems.

4.New applications.

There are a number of new emerging applications that rely heavily on distributed

database technology; examples are workflow management, computer-supported

collaborative work, tele-conferencing, and electronic commerce.

5.Market forces.

Many companies are forced to reorganize their businesses and use state-of-the-art

distributed information technology in order to remain competitive. As an example,

people will probably not eat more Pizza because of the Internet, but a Pizza delivery

service is definitely going to lose some of its market share if it does not allow people

to order Pizza on the Web.

8
a number of issues make distributed data processing still a complex undertaking:

(1) distributed systems can become very large, involving thousands of

heterogeneous sites including PCs and mainframe server machines; (2) the state of a

distributed system changes rapidly because the load of sites varies over time and new

sites are added to the system; (3) legacy systems need to be integrated—such legacy

systems usually have not been designed for distributed data processing and now need

to interact with other (modern) systems in a distributed environment.

7.Where and How (DQP) work :

In the field of data management tools, the developments in distributed computing

technologies lead up to distributed database management systems. These systems

should shield the users from complexities of the distribution. Distributed query

processing refers to the process that obtains an answer for a global query from

distributed sources. A pre-existing local database should be used as one of the

sources. This paper deals with principles of query processing in distributed database

systems and describes some specific architectural issues, which enable the integration

of pre-existing local databases into a distributed system.

A distributed database is a collection of multiple, logically interrelated local

databases distributed over a computer network. A distributed database

management system (distributed DBMS) is a software system that permits the

management of a distributed database and makes the distribution transparent to users.

A distributed database together with a distributed DBMS is called distributed

database system. Every local database has its own exported scheme that describes

the local data available for the system. An exported scheme should be a subset of a

9
local scheme. At the global system level, exported schemes of all local databases are

integrated into a global data scheme.

Different authors differ in classifying distributed DBMSs. Usually the classification

deals with three properties: autonomy, integration, and heterogeneity. Autonomy

refers to the degree to which local DBMSs can operate independently. Integration

describes the degree to which local databases are integrated into the global system,

that is, whether any distribution is transparent to the user or not. Heterogeneity

covers hardware, networking protocols, and local DBMS (including data model,

query language, interface...) heterogeneity.

8. QUERY PROCESSING

We will deal with queries over relational data model . a distributed DBMS provides

transparent access to distributed resources. There must be a module in the system

architecture that gets a global query and manages a distributed evaluation. The whole

process usually goes through the following steps:

•parsing the global query,

•query optimization,

•query execution.

When parsing the global query, each global relation is substituted by the expression

with local relations according to global scheme. Then the query is simplified by

eliminating redundant predicates. Finally, the query is transformed into relation

algebra expression. This intermediate expression of the query is called its canonical

form. During query optimization step, a distributed execution plan that obtains the

answer is prepared. Although query languages are usually non-procedural, the

execution plan gives the procedure how to extract data. The execution plan says

10
which local data are required, how to access them, which operations must be done at

which sites. Moreover, the execution plan should be optimized. it minimizes the

„execution cost“. Finally, the plan is executed in the query execution step.

9.ARCHITECTURE OF THE SYSTEM

Every local database has its exported relational scheme that describes local data

accessible from the global system. The local exported scheme should be just a subset

of a local scheme used by a local user. Exported schemes of all local databases are

integrated into a global relation scheme. Every global relation is expressed by a

relation algebra expression over relations from exported schemes. The relation

algebra expression should be arbitrary complicated, but the MINUS operator mustn’t

be used. The global user expresses queries over global relations only.

There are four different functional units (processes) in the functional model of the

system: the client, the distributed data server, the local database unit, and the partial

query integrator. Every unit should be at different site of the network. The client takes

over user’s query and presents gathered partial results back to the user. After the

client took up user’s query, it sends the query to the distributed data server and waits

to the server’s answer. In the server’s answer, there is just an information which sites

will deliver partial results to the client. The client waits for these partial results and

their union presents to the user. Notice that UNION operation is the only one that the

client must be able to execute. The distributed data server manages a distributed

query execution. It parses the query and generates the execution plan how to extract

required data from local databases. The server informs the client about the plan and

sends requests to the local database units and partial query integrators. Then, the

server waits and informs the client about some errors during execution.

11
The local database unit and the partial query integrator are the only units that

access the local data. They are based on the underlying local DBMSs. The unit’s

process runs on the same site as the local DBMS and manages the local database

access and communication with other sites in the system. The local database unit

accesses only the local data. There is coded what local data are required and where to

send them in the distributed data server’s request. The addressee of acquired local

data should be either the client or the partial query integrator. Additionally to the

function of the local database unit, the partial query integrator integrates incoming

partial results and possibly local data according to the distributed data server requests.

Again, the addressee of the results should be either another partial query integrator or

the client.

Fig. 1 Functional model

12
When there any error emerges in the system during the distributed executing, the

distributed data server is notified immediately. Then the server notifies the waiting

client and aborts all related requests.

10.Some advantages of (DQP) :

The distribution of data in a network also offers advantages over the centralization of

data at one computer. These advantages include: improved throughput via parallel

processing, sharing of data and equipment, and modular expansion of data

management capacity. In addition, when redundant data is maintained, one also

achieves increased data reliability and improved response time

Electronic market places and virtual enterprises have become very important

applications for query processing . Building a scalable virtual business-to-business

(B2B) market place with hundreds or thousands of participating suppliers requires

highly flexible, distributed query processing capabilities. Architecting such an

electronic market place as a data warehouse by integrating all the data from all

participating enterprises in one centralized data repository incurs severe problems:

¤ Security and privacy violations: The participants of the market place have to

relinquish the control over their data and entrust sensitive information (e.g., pricing

conditions) to the market place host.

¤ Coherence problems: The coherence of highly dynamic data, such as availability

and shipping information, may be violated due to outdated materialized data in the

market place’s data warehouse.

13
¤ Schema integration problems: Using the warehouse approach all relevant data from

all participants have to be converted `a priori into the same format. Often, it would be

easier to leave the data inside the participant’s information systems, e.g., legacy

systems, within the local sites, and apply particular local wrapper/transformer

operations. This way, data is only converted on demand and the most recent coherent

state of the data is returned.

¤ Fixed query operators: In a fully integrated (data warehouse-like) electronic market

place, all information is converted into materialized data. This is often not desirable in

such complex applications like electronic procurement/bidding. For example, in

pricing offers one would like to have vastly different choices:

– fixed pricing via materialized data

– operators which calculate the prices based on a multitude of local and global

parameters (identity of the consumer company, availability, local plant utilization,

subcontractor prices, etc.)

– even human interaction during the processing of such complex e-procurement

queries is desirable. In some participating enterprises the pricing could be done

by a human via an interactive “query operator”.

11.Running Example:

We demonstrate the HyperQuery technique with a scenario of the car manufacturing

industry. We assume a hierarchical supply chain of suppliers and sub-contractors. A

14
typical process of e-procurement to cover unscheduled demands of the production is

to query a market place for these products and to select the incoming offers by price,

terms of delivery, available quantity, etc. The price of the needed products can vary

by customer/supplier-specific sales discounts, the quantity of materials to be provided,

duties, plant utilization, etc. Thus the price cannot be a materialized attribute as in

traditional query processing systems. Instead it is an individually calculated,

dynamically changing attribute and a hyperlink to the supplier is contained, where

the price will be computed on demand.

In traditional distributed query processing systems such a query can only be executed

if a global schema exists or all local databases are replicated at the market place.

Considering an environment, where hundreds of suppliers participate in a market

place, one global query which integrates the sub-queries for all participants would be

too complex and error-prone, i.e., if one supplier’s host is down, the whole query

execution would fail. Following our approach the suppliers have to register their

products at the market place, which they want to participate in, and specify, by which

sub-plans the price information can be computed at their sites. This calculation can be

arbitrarily complex and involve their subcontractors, too. The allocation schema given

by the data at the market place is exploited for execution.

Figure 1 shows an SQL-like query, that returns the prices and suppliers of all needed

products. The execution is stopped at the latest at the given value of the expires

attribute. Only the results gathered so far are considered. Figure 2 shows two possible

execution traces of this query—both are supported by our evaluation technique. In the

hierarchical execution of Figure 2(a) the resulting objects flow back to the sites,

where the original input objects came from, whereas in the broadcast execution of

15
Figure 2(b) the objects do not flow all the way through intermediates back to the

client, but are routed directly to the client, which issued the query. 4

select p.ProductDescription, c.Supplier, c.AdditionalData, c.Price


from NeededProducts p, Catalog@MarketPlace c
where p.ProductDescription = c.ProductDescription
order by p.ProductDescription, c.Pric
expires Friday, May 18, 2001 5:00:00 PM CET
Figure 1: Example Query of the Car Manufacturer

12.Summary:

A distributed database (DDB) is a collection of multiple, logically interrelated databases

distributed over a computer network . The retrieval of data from different sites in a DDB is

referred to as distributed query processing.

16
Query processing is much more difficult in distributed environment than in centralized

environment because: 1)A large number of parameters affect the performance of

distributed queries. 2) Relations involved in a distributed query may be fragmented and/or

replicated. 3) With many sites to access, query response time may become very high

• .

businesses are beginning to rely on distributed rather than centralized databases for the

following reasons: Cost and scalability , ntegration of different software ,Integration of

legacy systems , New applications and Market forces .

A distributed database management system (distributed DBMS) is

a software system that permits the management of a distributed

database and makes the distribution transparent to users. A distributed

database together with a distributed DBMS is called distributed

database system . Different authors differ in classifying distributed

DBMSs. Usually the classification deals with three properties: autonomy,

integration, and heterogeneity.the process of distributed database go

through three steps : parsing the global query, query optimization and

query execution .

Some advantages of (DQP) : : improved throughput via parallel

processing, sharing of data and equipment, and modular expansion of

data management capacity. In addition, when redundant data is

17
maintained, one also achieves increased data reliability and improved

response time.

Electronic market places and virtual enterprises have become very important

applications for query processing Building a scalable virtual business-to-business

(B2B) market place with hundreds or thousands of participating suppliers requires

highly flexible, distributed query processing capabilities. Architecting such an

electronic market place as a data warehouse by integrating all the data from all

participating enterprises in one centralized data repository incurs severe problems

¤ Security and privacy violations

¤ Coherence problems

¤ Schema integration problems

¤ Fixed query operators

13.Referances :

All the contant of this paper is taking from internsts paper about the subjwct .the

following is named of this paper :

• p-1107-10619292.pdf .

• p422-kossmann.pdf .

• vldb2001.hperqueries.pdf .

• papadimos2003 .

• doctorweek .

18
• distributed_query.pdf .

19

Vous aimerez peut-être aussi