Vous êtes sur la page 1sur 12

Proceedings of 16th IEEE Int’l Conf. on Data Eng. (ICDE2000), San Diego, Feb.

29 - March 3, 2000 413

Developing Cost Models with Qualitative Variables for Dynamic


Multidatabase Environments

Qiang Zhu Yu Sun S. Motheramgari


Department of Computer and Information Science
The University of Michigan, Dearborn, MI 48128, U.S.A.
qzhu, yusun, motheram @umich.edu

Abstract A major challenge, among others [4, 7, 8, 9, 14], for


global query optimization in an MDBS is that some nec-
A major challenge for global query optimization in a multi- essary local information, such as local cost models, may
database system (MDBS) is lack of local cost information not be available at the global level due to local autonomy
at the global level due to local autonomy. A number of preserved in the system. However, the global query opti-
methods to derive local cost models have been suggested mizer needs such information to decide how to decompose
recently. However, these methods are only suitable for a a global query into local (component) queries and where to
static multidatabase environment. In this paper, we pro- execute the local queries. Hence, methods to derive cost
pose a new multi-states query sampling method to develop models for an autonomous local database system (DBS) at
local cost models for a dynamic environment. The system the global level are required. Several such methods have
contention level at a dynamic local site is divided into a been proposed in the literature recently.
number of discrete contention states based on the costs of In [3], Du et al. proposed a calibration method to deduce
a probing query. To determine an appropriate set of con- necessary local cost parameters. The key idea is to con-
tention states for a dynamic environment, two algorithms struct a special local synthetic calibrating database and use
based on iterative uniform partition and data clustering, re- the costs of some special queries run on the database to de-
spectively, are introduced. A qualitative variable is used to duce the parameters in cost models. In [5], Gardarin et al.
indicate the contention states for the dynamic environment. extended the above method so as to calibrate cost models
The techniques from our previous (static) query sampling for object-oriented local database systems in an MDBS.
method, including query sampling, automatic variable se- In [17, 18, 19], Zhu and Larson proposed a query sam-
lection, regression analysis, and model validation, are ex- pling method. The key idea is as follows. It first groups
tended so as to develop a cost model incorporating the qual- local queries that can be performed on a local DBS in an
itative variable for a dynamic environment. Experimental MDBS into homogeneous classes, based on some informa-
results demonstrate that this new multi-states query sam- tion available at the global level in an MDBS such as the
pling method is quite promising in developing useful cost characteristics of queries, operand tables and the underlying
models for a dynamic multidatabase environment. local DBS. A sample of queries are then drawn from each
query class and run against the user local database. The
costs of sample queries are used to derive a cost model for
1. Introduction each query class by multiple regression analysis. The cost
model parameters are kept in the MDBS catalog and utilized
A multidatabase system (MDBS) integrates data from during query optimization. To estimate the cost of a local
multiple local (component) databases and provides users query, the class to which the query belongs is first identified.
with a uniform global view of data. A global user can issue The corresponding cost model is retrieved from the catalog
a (global) query on an MDBS to retrieve data from multiple and used to estimate the cost of the query. Based on the
databases without having to know where the data is stored estimated local costs, the global query optimizer chooses a
and how the data is retrieved. How to process such a global good execution plan for a global query.
query efficiently is the task of global query optimization. There are several other approaches to tackling this prob-
Research supported by the US National Science Foundation under
lem. In [16], Zhu and Larson introduced a fuzzy method
Grant # IIS-9811980 and The University of Michigan under OVPR and based on fuzzy set theory to derive cost models in an
UMD grants. MDBS. In [10], Naacke et al. suggested an approach to
combining a generic cost model with specific cost informa- partition with merging adjustment (IUPMA) and the iter-
tion exported by wrappers for local DBSs. In [1], Adali ative clustering with merging adjustment (ICMA) respec-
et al. suggested to maintain a cost vector database to record tively, are introduced. The former is used for general cases,
cost information for every query issued to a local DBS. Cost while the latter is specifically designed for a dynamic envi-
estimation for a new query is based on the costs of similar ronment with the contention level following a non-uniform
queries. In [13], Roth et al. introduced a framework for distribution with clusters. Our previous query sampling
costing in the Garlic federated system. method in [17, 18, 19] is extended so as to develop a re-
All the methods proposed so far only considered a static gression cost model incorporating the qualitative variable
system environment, i.e., assuming that it does not change for a dynamic environment. Our approach in this paper
significantly over time. However, in reality, many factors is therefore an extension of our previous query sampling
in an MDBS environment such as contention factors (e.g., method. In this paper, we call our previous method as the
number of concurrent processes), database physical charac- static query sampling method and the new approach in this
teristics (e.g., index clustering ratio), and hardware config- paper as the multi-states query sampling method. In fact,
urations (e.g., memory size) may change significantly over the static method is a special case of the multi-states one
time. Hence, a cost model derived for a static system en- when only one contention state is allowed.
vironment cannot give good cost estimates for queries in a The rest of the paper is organized as follows. Section 2
dynamic environment. Figure 1 shows how the cost of a analyzes the dynamic factors at a local site in an MDBS.
sample query is affected by the number of concurrent pro- Section 3 discusses how to develop a regression model with
cesses in a dynamic system environment. We can see that a qualitative variable and how to determine contention states
the cost of the same query can dramatically change (from of a qualitative variable for a dynamic environment. Section
3.80 sec. to 124.02 sec.) in a dynamic environment. This 4 extends our previous static query sampling method so as
raises an interesting research issue, that is, how to derive to derive cost models with a qualitative variable for different
cost models that can capture the performance behavior of query classes in a dynamic environment. Section 5 shows
queries in a dynamic environment. some experimental results. Section 6 summarizes the con-
140
clusions.
Table R7(a1, a2, ..., a9) has 50,000 tuples of random numbe rs

120 Query:
Query Cost (Elapse Time in Sec.) on Oracle 8.0

select a1, a5, a7


from R7
where a3 > 300 and a8 < 2000
2. Dynamic environmental factors
100

80 In an MDBS, many environmental factors may change


over time1 . Some may change more often than others. They
60
can be classified into the following three types based on
40 their changing frequencies.
20
Frequently-changing factors. The main characteristic
0
of this type of factors is that they change quite often.
50 60 70 80 90 100 110 120 130
The Number of Concurrent Processes in SUN UltraSparc 2 Examples of such factors are CPU load, number of
Figure 1. Effect of Dynamic Factor on Query Cost I/Os per second, and size of memory space being used,
etc. The operating system at a local site typically pro-
In this paper, we propose a new qualitative approach to
vides commands (such as , , and in Unix)
deriving a cost model that can capture the performance be-
to display system statistics reflecting such environmen-
havior of queries in a dynamic environment. We notice that
tal factors. Table 1 lists some system statistics in Unix.
there are numerous dynamic factors that affect query costs.
To simplify the development of a cost model for a dynamic Occasionally-changing factors. These factors change
environment, our approach considers the combined net ef- occasionally. Examples of such factors are local
fect of dynamic factors on a query cost together rather than database management system (DBMS) configuration
individually. The system contention level that reflects such parameters (e.g., number of buffer blocks, and shared
a combined effect is gauged by the cost of a probing query. pool size), local database physical/conceptual schema
To capture such contention information in a cost model, we (e.g., new indexes, new tables/columns), and local
divide the system contention level (based on the costs of a hardware configurations (e.g, physical memory size).
probing query) in a dynamic environment into a number of Note that some other factors such as local database
discrete contention states and use a qualitative variable to 1 Sincewe concern ourselves with local cost models for an MDBS, only
indicate the contention states in the cost model. To deter- dynamic factors at local sites are considered. In general, there are also
mine an appropriate set of contention states for a dynamic dynamic network environmental factors in an MDBS. Some of them were
environment, two algorithms, called the iterative uniform considered in [15]

414
Types Statistics for Frequently-Changing Environmental Factors
CPU
— number of running processes;
— number of stopped processes;
— number of sleeping processes
— number of zombie processes overcome. In the rest of this paper, we introduce a feasible
Statistics — percentage of user time; — percentage of system time
— percentage of idle time
— load averages for the past 1, 5, and 15 minutes, respectively
method to capture the frequently-changing factors in a cost
Memory
— available memory;
— shared memory;
— used memory
— buffer memory model.
Statistics — available swap; — used swap
— free swap; — cached swap
— amount of memory swapped in; — amount of memory swapped out
I/O — number of reads per sec.; — number of writes per sec.
Statistics
Other
— percentage of disk utilization
— number of current users; — number of interrupts per sec.
3. Regression with qualitative variable
Statistics — number of context switches per sec.; — number of system calls per sec.

Table 1. System Stats for Frequently-Changing Factors in Unix As mentioned before, the key idea of our method is to
determine a number of contention states for a dynamic envi-
size, physical data distribution, and index clustering ronment and use a qualitative variable to indicate the states.
ratio may change quite frequently. However, they A cost model with the qualitative variable can be used to
may not have an immediate significant impact on estimate the cost of a query in different contention states.
query cost until such changes accumulate to a cer- The issues on how to include a qualitative variable in a cost
tain degree. Thus we also consider these factors model and how to determine an appropriate set of system
as occasionally-changing factors. The changes of contention states are discussed in this section.
occasionally-changing factors can be found via check-
ing the local database catalog and/or system configu- 3.1. Qualitative variable
ration files.
To simplify the problem, we consider the combined
Steady factors. These factors rarely change. Exam-
effect of all the frequently-changing factors on a query
ples of such factors are local DBMS type (e.g., rela-
cost together rather than individually. Although these dy-
tional or object-oriented), local database location (e.g.,
namic factors may change differently in terms of the chang-
local or remote), and local CPU speed (e.g., 300MHz).
ing frequency and degree, they all contribute to the con-
Although these factors may have an impact on a cost
tention level of the underlying system environment. The
model, the chance for them to change is very small.
cost of a query increases as the contention level. The sys-
Clearly, the steady factors usually do not cause a prob- tem contention level can be divided into a number of dis-
lem for a query cost model. If significant changes for such crete states (categories) such as “ ” ( ),
factors occur at a local site, they can be handled in a sim- “ ”( ), “ ” ( ),
ilar way as described below for the occasionally-changing and “ ” ( ). A qualitative variable is
factors. used to indicate the contention states. This qualitative vari-
For the occasionally-changing factors, a simple and ef- able, therefore, reflects the combined effect of foregoing
fective approach to capturing them in a cost model is to frequently-changing environmental factors. A cost model
invoke the static query sampling method periodically or incorporating such a qualitative variable can capture the dy-
whenever a significant change for the factors occurs. Since namic environmental factors to certain degree.
these factors do not change very often, rebuilding cost As shown in [17, 19], a statistical relationship between
models from time to time to capture them is acceptable. query costs and their affecting factors such as operand and
However, this approach cannot be used for the frequently- result table sizes can be established by multiple regres-
changing factors because frequent invocations of the static sion. The established relationship can be then used as a
query sampling method would significantly increase the cost model to estimate query costs.
system load and the cost model maintenance overhead. On Usually, only quantitative variables are considered in a
the other hand, if a cost model cannot capture the dramatical regression model. These variables such as operand table
changes in a system environment, poor query cost estimates size take values on a well-defined scale. However, many
may be used by the query optimizer, resulting in inefficient variables of interest may not be quantitative but qualitative.
query execution plans. Qualitative variables only have several discrete categories
Theoretically speaking, to capture the frequently- (states). For example, the foregoing qualitative variable
changing factors in a cost model, one approach is to include indicating system contention states may have states , ,
all explanatory variables that reflect such factors in the cost , and . Such a qualitative variable can also be incor-
model. However, this approach encounters several difficul- porated into a regression model.
ties. First, the ways in which these factors affect a query A qualitative variable can be represented by a set of in-
cost are not clear. As a result, the appropriate format of dicator variables. For example, the above contention state
a cost model that directly includes the relevant variables is variable with four states can be represented by three in-
hard to determine. Second, the large number of such fac- dicator variables: , , and , where indicates
tors (see Table 1) makes a cost model too complicated to , while indicates ; indi-
derive and maintain even if the previous difficulty could be cates , while indicates ;

415
indicates , while indicates . Note that the cost of a query usually consists of (1) ini-
Clearly, indicate . Note tialization cost such as moving a disk head to the right po-
that no more than one indicator variable can be 1 simultane- sition; (2) I/O cost such as fetching a tuple from an operand
ously (i.e., can only take one state at a time). In general, table; and (3) CPU cost such as evaluating the qualification
a qualitative variable that have categories (states) need condition for a given tuple. A typical cost model for a unary
indicator variables to represent it. query class may look like:
(1)
3.2. General regression model
where and are the cardinalities of the operand
Let and be the response variable table and result table, respectively; , and are the
and (quantitative) explanatory variables in a regression parameters representing the initialization cost, the cost of
model, respectively. Let a qualitative variable with retrieving a tuple from the operand table, and the cost of
states (categories) be represented by indicator variables processing a tuple in the result table, respectively. Both
. The qualitative variable can influence and may reflect I/O as well as CPU costs. Therefore, the
the regression model in the following four different ways initialization cost affects the intercept term in a cost model,
(see Table 2): while the I/O and CPU costs affect the slope terms in the
Type Regression Equation cost model. Clearly, the contention level of a system can
significantly affect not only the initialization cost but also
Coincident: the I/O and CPU costs of a query because the resources like
Parallel: the disk, I/O bandwidth and CPU are shared by multiple
Concurrent: processes. As a result, both the intercept and slope terms in
General:
a query cost model may change when the system contention
level changes. Therefore, to incorporate a qualitative vari-
able representing the system contention states into a query
cost model, the general qualitative regression model is more
Table 2. Qualitative Regression Equation Forms
appropriate.
Coincident. The relationship between the response and
explanatory variables stays the same for all states of 3.3. Determining system contention states
. In other words, the equations for all states are co-
incident. This in fact is the situation for a static sys- Combining multiple dynamic environmental factors into
tem environment assumed by the static query sampling a composite qualitative variable with a number of discrete
method. contention states greatly simplifies the development of a
Parallel. The relationship between the response and cost model for a dynamic environment. The question now
explanatory variables may differ in the intercept term is how to determine an appropriate set of system contention
but not the slope terms for different states of . The states for a dynamic environment.
relevant equation in Table 2 shows that the intercept
Two extremes
term for the th state of the qualitative variable is
( ; and ). Since the slope There are two extremes in determining a set of contention
terms remain the same for all states, the equations for states. One extreme is to consider only one contention state
different states are parallel. for the system environment. A cost model developed in
such a case is useful if the system environment is static.
Concurrent. The relationship between the response
This, in fact, was the case that the static query sampling
and explanatory variables may differ in the slope terms
method assumed. However, as pointed out before, a real
but not the intercept term for different states of . The
system environment may change dynamically over time.
relevant equation in Table 2 shows that the th slope
Using one contention state is obviously insufficient to de-
term ( ) for the th state of the qualitative
scribe the dynamic environment. For a dynamic environ-
variable is ( ; and ).
ment, usually, the more the contention states are considered,
The equations for different states have the same inter-
the better a cost model. In principle, as long as we consider
cept term. They are said to be concurrent.
a sufficient number of contention states for the environment,
General. The relationship between the response and we can get a satisfactory cost model. Another extreme is to
explanatory variables may differ in both the intercept consider an infinite number of contention states. However,
term and the slope terms for different states of the qual- the more the contention states are considered, the more the
itative variable. This is the most general case. indicator variables are used in the cost model. The number

416
of coefficients that need to be determined in a cost model To solve these two problems, the following algorithm is
therefore increases. Hence, if too many contention states used to improve the above straightforward uniform parti-
are considered, the cost model can be very complicated, tion:
which is not good for either the development or mainte- A LGORITHM 3.1 : Contention States Determination via Iterative
nance of the cost model. In practice, as we will see in Sec- Uniform Partition with Merging Adjustment (IUPMA)
tion 5, a small number of contention states (three to six) are Input: Observed data of sample queries and their associated
usually sufficient to yield a good cost model. probing query costs
Output: A set of system contention states4
Method:
Determining states via iterative uniform partition 1. begin
2. Derive a qualitative regression model with one contention
Notice that, for a given query, its cost increases as the sys- state using the sample query data;
tem contention level increases (see Figure 1). Based on this 3. Let be the coefficient of total determination of the
current regression model;
observation, we can use the cost of a probing query to gauge 4. Let be the standard error of estimation of the
current regression model;
the system contention level2 . The range of probing costs 5. ;
(therefore, the contention level) is divided into subranges, 6. do
7. ;
each of which represents a contention state for the dynamic 8. ;
environment. 9. Obtain a set of contention states for the system
environment via the straightforward uniform partition;
Let the cost of probing query fall in the 10. Derive a qualitative regression model with contention
range in a dynamic environment. A sim- states using sample query data;
11. Let be the coefficient of total determination for the
ple way to determine the system contention states is to par- current regression model;
tition range into subranges with an equal 12. Let be the standard error of estimation of the current
regression model;
size. In other words, to determine contention states3 13. until ( and ) are
, we divide range into sufficiently small or is too large;
subranges 14. ;
15. Let ( ) represent the current contention
and where states in ;
and . The 16. Let ( ) be the adjusted coefficient
system environment is said to be in contention state if of th variable for state in the general model in Table 2,
where is a dummy variable for the intercept term;
( ). To obtain more sys- 17. for down to do
tem contention states, we can simply increase . Hence, 18.
yields a set of the system con- 19. if is too small then
20. tag that states and should be merged;
tention states for the dynamic environment. 21. end for
Using this partition, it is easy to determine the system 22. if some states are tagged to be merged then
23. Derive a qualitative regression model with new merged states
contention state in which a query is executed. Let using sample query data;
24. goto step 15;
be a set of sample queries which are 25. end if;
performed in a dynamic environment and whose observed 26. return the current set of contention states;
27. end.
data (costs, result table sizes, etc.) are to be used to derive
a regression cost model for a query class. To determine the There are two phases in Algorithm 3.1. The first phase
system contention state in which is executed, the is to determine a set of contention states via the uniform
cost of probing query in the same environment is partition. The algorithm iteratively checks each qualitative
measured. if ( ). We call the regression model with an incremental number of contention
costs of a probing query associated with the sample queries states until (1) the model cannot be significantly improved
are sampled probing query costs. in terms of the coefficient of total determination5 and the
One basic question is how to determine a proper . An- standard error of estimation 6 ; or (2) too many contention
other question is how to eliminate some unnecessary sepa- states have been determined. Condition (2) is used here to
rations of subranges. Clearly, if the performance behaviors prevent that a derived cost model becomes too complicated
of queries in contention states and (for some ) (in terms of the number of variables involved). The set of
are similar, separating and is unnecessary. The de- contention states obtained from the first phase are based on
termination of system contention states should balance the
4 In fact, the algorithm integrates the contention states determination
accuracy and simplicity (hence low maintenance overhead) procedure with the cost model development procedure (to be discussed in
of a derived cost model. the next section). As a result, a cost model is also produced as an output of
the algorithm.
2 Our experiments showed that most queries, except the ones with ex- 5 The coefficient of total determination measures the proportion of vari-
tremely small cost (e.g., several hundredths of a second), can well serve as ability in the response variable explained by the explanatory variables in a
a probing query to gauge the system contention level. regression model [12]. The higher, the better.
3 A decreasing index is used here to simplify the descriptions of the 6 The standard error of estimation is an indication of the accuracy of
algorithms and derived cost models. estimation given by the model [12]. The smaller, the better

417
the uniform partition of the probing query cost range (see An agglomerative hierarchical algorithm is often used
Figure 2). The partition does not consider whether two for data clustering 6 . The main idea of the algorithm is
states actually have significantly different effects on the cost to place each data object in its own cluster initially and then
model or not. It is possible that some neighboring states gradually merge clusters into larger and larger clusters until
have only slight different effects on the cost model. If so, a desired number of clusters have been found. The criterion
the states should be merged into one to simplify the cost used to merge two clusters and is to make their dis-
model. Such a merging adjustment is done during the sec- tance minimized. One widely used distance measure is the
ond phase of the algorithm. If the maximum of relative er- distance between the centroids or means and
rors of the corresponding pairs of adjusted coefficients (i.e., of two clusters, i.e., .
, and , ) for two states Let be the maximum allowed number of system con-
and is too small, these two states are considered not to tention states. The above clustering algorithm can be used
have significantly different effects on the cost model. The to obtain clusterings (
subranges in the final adjusted partition of probing query ; ’s are clusters such that
cost range may not have an equal size. for ) for sampled prob-
ing query costs. Let subranges and
uniform partition
after , where and
1st phase: Im Im-1 Im-2 Im-3 I2 I1 probing
query
cost
Cmin Cmax , here and
after
2nd phase: I’k I’k-1 I’k-3 I’1 are the minimum and maximum probing query
adjusted partition costs in cluster . Clearly, gives
a set of the system contention states for the dynamic
Figure 2. Contention States Determination via IUPMA environment, which reflects the distribution information of
probing query costs (the contention level). If we use such
Determining states via data clustering in Line 9 in Algorithm 3.1, we get a new algorithm,
To capture the effect of every contention level on query termed as the Contention States Determination via Iterative
costs for a dynamic environment in a cost model, we can Clustering with Merging Adjustment (ICMA).
let each contention level point have an equal chance to be Note that, for clustered probing query costs, it is possible
chosen for running a given sample query. In other words, that a cluster may not have a sufficient number of sampled
the probing query costs associated with the sample queries data points to meet the minimum requirement for regres-
to indicate the sampled contention level points follow the sion analysis. In such a case, we draw additional sample
uniform distribution within their range. A cost model de- data points (via executing more sample queries) to make
rived by using such sample data can be used to estimate the cluster meet the minimum requirement rather than sim-
the cost of a query executed at any contention level. How- ply treat the data points in the cluster as outliers and ignore
ever, in a real dynamic application environment, the con- them. Although this way may change the distribution of the
tention level may occur more often in some subranges than contention level sightly, no useful contention level points
the others. To better capture the performance behavior of a are ignored in the derived cost model.
dynamic environment, we can choose the contention level
points for running sample queries based on the actual distri- Probing costs estimation
bution of the contention level in the dynamic environment.
To minimize the overhead for determining a system con-
As a result, the associated probing query costs may not fol-
tention state, a query with a small cost is preferred as a
low the uniform distribution in their range. More often they
probing query. To further reduce the overhead, estimated
are grouped into clusters.
costs (rather than observed costs) of probing query can
Although Algorithm 3.1 is designed for uniformly dis-
be used to determine the contention states of a dynamic en-
tributed probing query costs, it usually can also handle clus-
vironment. The idea is to first develop a regression equation
tered probing query costs well due to its iterating and ad-
between the probing query cost and some major system
justing mechanisms. However, the resulting partition of the
contention parameters7 (such as CPU load , I/O utiliza-
probing query cost range for the clustered cases may not
tion , and size of used memory space for a dynamic
be the best since the boundaries considered at each itera-
environment in Table 1), i.e.,
tion in the algorithm are fixed, regardless of the distribution
of the system contention level. To overcome the problem, (2)
a data mining algorithm for data clustering can be incor-
porated into the contention states determination procedure 7 A standard statistical procedure can be used to determine the signifi-
here. cant parameters for a system environment.

418
where ( ) are regression coefficients. Af- quantitative variable plus the intercept term. Each group
terwards, every time when we want to determine the sys- has coefficients, one for each state of the qualitative vari-
tem contention state in which a query is executed we only able. In addition, the variance of error terms need also to be
need to check which subrange the estimated cost of estimated.
probing query lies in by using (2) without actually ex-
Sample queries drawn from a query class are performed
ecuting the probing query. Since obtaining the parameter
in a dynamic environment. Their observed data as well as
values ( ) in (2) usually requires less overhead
their associated probing query costs are recorded and used
than executing a probing query, using the estimated costs
to derive a regression cost model for the query class. A load
of a probing query to determine system contention states
builder, which is part of the MDBS agent for each local
is usually more efficient. However, estimation errors may
DBS [2], is used to simulate a dynamic application environ-
introduce certain inaccuracy.
ment at a local site in an MDBS during the query sampling
procedure. The MDBS agent may also have an environment
4. Development of cost models monitor which collects system statistics used for estimating
the probing query costs when the estimation approach in
As mentioned before, we extend the query sampling Section 3.3 is employed.
method for a static environment in [17] so as to develop
cost models for a dynamic environment via introducing a 4.2. Regression cost models
qualitative variable. Such extensions are discussed in this
section. A qualitative regression cost model contains a set
of quantitative explanatory variables and a set of in-
4.1. Query classification and sampling dicator variables for a qualitative variable indicating sys-
tem contention states. Similar to the static query sam-
Similar to the static query sampling method, we group pling method, we divide the cost model into two parts:
local queries on a local database system into classes based . The basic model represents
on their potential access methods to be employed. The pre- the essential part of the model, while the secondary part is
vious classification rules and procedures in [17] can be uti- used to further improve the model. The qualitative variable
lized. For example, (i.e., the indicator variables) is included in both parts of the
cost model to capture the dynamic environmental factors.
Set is split into two subsets and , where con-
tains basic (quantitative) explanatory variables in the basic
model, while contains secondary (quantitative) explana-
tory variables in the secondary part. Table 3 lists poten-
is a class of unary queries that are most likely performed tial explanatory variables in each of the subsets for a unary
by using a clustered-index scan access method in a DBMS. query class and a join query class. If all variables (including
Hence a similar performance behavior is shared among the indicator variables) are included, the full cost model is:
queries in the class and can be described by a common cost
model.
A sample of queries are then drawn from each query
class in a similar way as before. However, since more
parameters associated with the indicator variables are in-
cluded in a cost model, more sample queries need to be
drawn in order to meet the commonly-used rule for sam-
pling in statistics, i.e., sample at least 10 observations for
every parameter to be estimated [12]. The following propo-
sition gives a guideline on the minimum number of sample However, usually, not all variables are necessary for a
queries needed for regression analysis. given cost model.
To determine the variables to be included in a regression
P ROPOSITION 4.1 For the general qualitative regression
cost model for a query class, a mixed backward and for-
cost model in Table 2 with quantitative explanatory vari-
ables and one qualitative variable for states, at least ward procedure described below is adopted. We start with
observations need to be sampled. the full basic model which includes all variables in and
use a backward procedure to eliminate insignificant basic
P ROOF. Notice that there are groups of regression explanatory variables one by one. Note that, in our algo-
coefficients in the cost model, one for each independent rithm, if an explanatory variable is removed from the

419
Class Basic Explanatory Variables Secondary Explanatory Variables
formula similar to (3); is a given small positive constant.
Unary – size (cardinality) of operand table – tuple length of operand table
Query
Class
– size of intermediate table
– size of result table
– tuple length of result table
– operand table length
Since the average simple correlation coefficient indi-
– result table length cates the degree of linear relationship between and on
– size of 1st operand table – tuple length of 1st operand table
Join
Query
– size of 2nd operand table
– size of 1st intermediate table
– tuple length of 2nd operand table
– tuple length of result table
average in all states, foregoing condition ( ) selects an ex-
Class – size of 2nd intermediate table
– size of result table
– 1st operand table length
– 2nd operand table length
planatory variable that contributes the least (on average
– size of Cartesian product of – result table length
intermediate tables in all states) in explaining the response variable . Since
the standard error of estimation is an indication of estima-
Table 3. Potential Explanatory Variables for Cost Models tion accuracy, foregoing condition ( ) ensures that removing
model, its coefficients for all con- variable from the model improves the estimation accu-
tention states (determined by indicator variables ’s) are racy or affects the model very little. Removing a variable
removed. We then use a forward selection procedure to add that has a little effect on the model can reduce the complex-
more significant secondary explanatory variables from ity and maintenance overhead of the model.
into the cost model. This procedure tries to further improve In the forward selection procedure, the next variable
the cost model. Similar to the backward procedure, if a sec- from to be added into the current model is the one sat-
ondary variable is added into the model, its coefficients isfies ( ) its average simple correlation coefficient
for all contention states are included. with the residuals of the current model
Since it is expected that most basic variables are important for all states is the largest among all explanatory variables
to a cost model and only a few secondary explanatory vari- in the model; i.e., it can explain the most (on average for
ables are important, both the backward elimination and the all states) about the variations that the current model cannot
forward selection procedures most likely terminate soon af- explain; and ( ) it significantly improves the estimation ac-
ter they start. curacy, i.e., and , where denote
Assume that we have sampling observations in con- the standard errors of estimation for the augmented model
tention state ( ), with observations (i.e., with included) and the original model, respectively;
in total. Consider the simple correction coefficient between and is a given small positive constant.
variables and in contention state : Note that the exact number of explanatory variables in a
cost model is determined after the above mixed backward
and forward procedure is done. However, we need such in-
formation to determine the query sample size from Propo-
sition 4.1 at the beginning of the cost model development.
Since it is expected that most basic explanatory variables in
are selected and only a few secondary explanatory vari-
where are the values from the th sampling ob- ables in are used for a cost model, we expect the number
servation ( ) in state . For any explanatory of explanatory variables in a cost model usually not exceed
variable , if its maximum simple correlation coefficient . Based on experiments, the maximum num-
with response variable is too small, it has ber of contention states for a dynamic environment in
little linear relationship with in any state. Such explana- practice can also be estimated. Hence, a reasonable query
tory variables should be removed from consideration. sample size is:
In the backward elimination procedure, the next variable (4)
to be removed from the current model is the one which
satisfies two conditions ( ) its average simple correlation from Proposition 4.1.
coefficient with response variable
for all contention states is the smallest among all explana- 4.3. Measures for developing useful models
tory variables in the current model; ( ) it makes or
, where is the standard error of estimation Multicollinearity occurs when explanatory variables are
for the reduced model (i.e., with removed) given by: highly correlated among themselves. In such a case, the es-
timated regression coefficients tend to have large sampling
variability. It is better to avoid multicollinearity.
(3)
The presence of multicollinearity is detected by means of
the variance inflation factor [11]. When an explana-
here denote the observed query cost, estimated tory variable has a strong linear relationship with the other
query cost given by the reduced model, and number of ex- explanatory variables, its is large. In a dynamic envi-
planatory variables in the model, respectively; is the stan- ronment with multiple contention states, let (
dard error of estimation for the original model given by a ) be the variance inflation factor of explanatory variable

420
Query Class Cost Estimation Model with Qualitative Variable (i.e., Multi-States Cost Models)
+1 +0 +1 -4 -3 -4 -5 -4
-4 -2 -2 -2 -7 -4 -5
+1 +1 +2 +1 +2 +1 -4 -4 -4
-3 -3 -3 -3 -3 -2 -2 -2
-2 -1 +0 +0 +1 +1 +1 +1
+0 +1 +0 +1 +1
+2 +2 +2 -7 -8 -6 -3 -2
-2
-1 +0 +1 -3 -3 -3 -4 -4
-4 -2 -2 -2 -4 -4 -4 -5
-5 -6
+1 +1 +1 +1 +1 +1 -3 -3 -3
-3 -3 -3 -3 -2 -2 -2 -2
-2
+2 +2 +2 -7 -6 -6 -6
-2 -2 -2 -2 -3 -3 -3 -6
+1 +1 +1 +1 +1 -1 +2 +1

Table 4. Multi-State Cost Models for DB2 and Oracle

in state . If is large, is not included generated tables (


in a cost model to avoid multicollinearity. ) with cardinalities ranging from 3,000
-test, the standard error of estimation , the coefficient 250,000. Each table has a number of indexed columns and
of multiple determination , as well as the percentage of various selectivities for different columns.
good cost estimates for test queries are used to validate the In the experiments, queries on each local DBS were clas-
significance of a developed regression cost model. sified first according to the same rules in the static query
sampling method. A sample of queries with the size meet-
5. Experimental results ing condition (4) were then drawn from each query class and
performed in the simulated dynamic environments at the lo-
cal sites. Their observed costs together with the associated
To verify the feasibility of our multi-states query sam-
probing query costs are used to derive a cost model with
pling method for developing cost models in a dynamic en-
a qualitative variable for each query class using the tech-
vironment, experiments were conducted in a multidatabase
niques suggested in the previous sections. Some randomly-
environment using a research prototype called CORDS-
generated test queries [17] in the relevant query classes were
MDBS [2]. Two commercial DBMSs, i.e., Oracle v8.0 and
also performed in the dynamic environment, and their ob-
DB2 v5.0, were used as local database systems running un-
served costs were compared with the estimated costs given
der Solaris 5.1 on two SUN UltraSparc 2 workstations. Fig-
by the derived cost models. Note that, unlike the scientific
ure 3 shows the experimental environment. Local queries
computation in engineering, the accuracy of cost estimation
are submitted to a local DBS via an MDBS agent. The
in query optimization is not required to be very high. The
MDBS agent provides a uniform relational ODBC interface
estimated costs with relative errors within 30% are consid-
for the global server. It also contains a load builder which
ered to be very good, and the estimated costs that are within
generates dynamic loads to simulate dynamic application
the range of one-time larger or smaller than the correspond-
environments.
ing observed costs (e.g., 2 minutes vs. 4 minutes) are con-
CORDS-MDBS Server sidered to be good. Only those estimated costs which are
not of the same order of magnitude with the observed costs
local queries (e.g., 2 minutes vs. 3 hours) are not acceptable.
Table 4 shows the cost models derived by applying the
multi-states query sampling method suggested in this paper
MDBS Agent MDBS Agent MDBS Agent for three representative query classes for each local DBS,
Local
DBMS
Local
DBMS
Local
DBMS
namely8, a unary query class without usable indexes,
(Oracle 8.0) (DB2 5.0) (......)
a unary query class with usable non-clustered indexes
Local Local Local for ranges, and a join query class without usable in-
DB DB DB dexes. Table 5 shows some statistical measures for the de-
Local DBS 1 Local DBS 2 Local DBS n rived cost models9 . For the comparison purpose, two static
experimental cases were also considered. In the first case,
Figure 3. Experimental Environment
cost models were derived by applying the static query sam-
The local databases used in the experiments were the pling method to sampling data obtained from a static en-
same as those in [17, 19], except that each table is ten- vironment (Static Approach 1). In the second static case,
time larger than before due to the improved space availabil- 8 The
three query classes correspond to , , and in [17].
ity and CPU capability in our experimental environment. 9 The
number in parentheses beside ‘multi-states’ in Table 5 indicates
More specifically, each local database has 12 randomly- the number of contention states used for the relevant cost model.

421
cost models were derived by applying the static query sam- models, the multi-states cost models increase the num-
pling method to sampling data obtained from a dynamic en- ber of very good cost estimates (i.e., with relative er-
vironment (Static Approach 2). This in fact is to restrict the rors 0.3) and the number of good cost estimates (i.e.,
multi-states query sampling method to consider only one within one time range) by 27.0% and 20.2% (on av-
contention state. erage) respectively for the test queries. Figures 6
5 show comparisons among the observed costs, esti-
query cost model average very good good
class type cost estimates estimates mated costs by the multi-states cost models, and esti-
for multi-states (3) 0.972 0.157e+2 0.528e+2 55% 78%
one-state 0.798 0.363e+2 0.511e+2 30% 58%
mated costs by the one-state cost models for the test
static 0.972 0.672e+0 0.290e+1 3% 5% queries in a dynamic environment.
for multi-states (6) 0.994 0.997e+1 0.620e+2 60% 76%
one-state
static
0.779
0.986
0.620e+2
0.733e+0
0.690e+2
0.359e+1
24%
7%
48%
14%
The more contention states are considered, the better
for multi-states (3) 0.996 0.230e+3 0.735e+3 37% 62% the derived cost model usually is. For example, the co-
one-state
static
0.910
0.992
0.254e+3
0.116e+2
0.431e+3
0.381e+2
27%
9%
45%
13%
efficients of total determination for the cost models for
for multi-states (3) 0.982 0.160e+2 0.680e+2 69% 81%
query class with 1 to 6 contention states
one-state 0.876 0.576e+2 0.865e+2 35% 60% are 0.7788, 0.9636, 0.9674, 0.9899, 0.9922, respec-
static 0.999 0.917e-1 0.402e+1 3% 6%
for multi-states (6) 0.993 0.143e+2 0.873e+2 63% 74%
tively. However, the improvement may be very small
one-state 0.901 0.672e+2 0.108e+3 35% 62% after the number of contention states reaches certain
static 0.999 0.301e+0 0.493e+1 4% 8%
for multi-states (4) 0.999 0.148e+3 0.998e+3 51% 67%
point. Table 5 shows that usually considering 3 to 6
one-state 0.951 0.507e+3 0.882e+3 22% 44% contention states for a dynamic environment is suffi-
static 0.999 0.503e+1 0.492e+2 0% 1%
cient to obtain a good cost model.
Table 5. Statistics for Cost Models Like static techniques [3, 17], it is also true to the
multi-states query sampling method that small-cost
From the experimental results, we can have the following queries usually have worse cost estimates than large-
observations: cost queries. The main reason for this is that even a
The multi-states query sampling method presented in small momentary change in the system environment
this paper can derive good cost models in a dynamic may have a significant impact on the cost of a small-
environment. The coefficients of total determination cost query. It is not easy to capture all such small
in Table 5 indicate that all derived models can cap- environmental changes in a cost model. Fortunately,
ture 98.9% variations in query cost on average. The estimating the costs of small-cost queries is not as im-
standard errors of estimation are acceptable, compared portant as estimating the costs of large-cost queries be-
with the magnitude of the average cost of relevant sam- cause it is more important to identify large-cost queries
ple queries (only 22% of average costs on average). so that inefficient execution plans can be avoided.
The statistical F-tests at significance level Contention states determination algorithm IUPMA
were also conducted, which showed that all cost mod- works well for both uniformly-distributed and clus-
els are useful for estimating query costs in a dynamic tered probing query costs, while algorithm ICMA can
environment. determine an even better set of system contention
The (static) cost models derived by the static query states for the clustered cases. Note that the sampled
sampling method for a static environment (i.e., Static probing query costs were drawn by following the dis-
Approach 1) are not suitable for estimating query costs tribution of the contention level in a dynamic environ-
in a dynamic environment. Although such cost mod- ment. In fact, the experimental results shown in Tables
els may have good coefficients of total determination 4 5 and Figures 4 9 were obtained for the uni-
(99.1% on average in Table 5) for the sampling data form case. Extensive experiments were also conducted
in a static environment, they can hardly give good cost for clustered cases. The experimental results showed
estimates in a dynamic environment (gave only 7.8% that, for a given query class, the cost model derived
good cost estimates on average in Table 5 for the test in the clustered cases is usually better than the one
queries in our experiments). derived for the uniform case even if IUPMA is used.
This is because the cost models for the clustered cases
The (multi-states) cost models derived by using the only need to capture performance behavior of queries
multi-states query sampling method for a dynamic en- in more focused and narrower subrange(s) of the con-
vironment significantly improve the (one-state) cost tention level. Table 6 shows some typical experimental
models derived by applying the static query sampling results for a query class in a dynamic environment with
method for the dynamic environment (i.e., Static Ap- clustered contention levels (see Figure 10 for the rel-
proach 2). In fact, compared with the one-state cost evant frequency distribution of the contention level).

422
800 1400
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
700 dotted line (+) --- estimated cost by static approach (on e-state) dotted line (+) --- estimated cost by static approach (on e-state)
1200

600
Query Cost (Elapse Time in Sec.)

Query Cost (Elapse Time in Sec.)


1000

500
800

400
600
300

400
200

200
100

0 0

-100 -200
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10

Figure 4. Costs for Test Queries in on DB2 5.0 Figure 5. Costs for Test Queries in on Oracle 8.0

1200 1800
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
dotted line (+) --- estimated cost by static approach (on e-state) 1600 dotted line (+) --- estimated cost by static approach (on e-state)
1000
1400
Query Cost (Elapse Time in Sec.)

Query Cost (Elapse Time in Sec.)

800 1200

1000
600

800

400
600

200 400

200
0
0

-200 -200
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10

Figure 6. Costs for Test Queries in on DB2 5.0 Figure 7. Costs for Test Queries in on Oracle 8.0

6000 7000
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
dotted line (+) --- estimated cost by static approach (on e-state) dotted line (+) --- estimated cost by static approach (on e-state)
6000
5000
Query Cost (Elapse Time in Sec.)

Query Cost (Elapse Time in Sec.)

5000
4000

4000
3000

3000

2000
2000

1000
1000

0
0

-1000 -1000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10

Figure 8. Costs for Test Queries in on DB2 5.0 Figure 9. Costs for Test Queries in on Oracle 8.0

423
query states # of average very good good
class determination states cost estimates estimates environment. Usually, considering a small number of con-
for IUPMA 3 0.978 0.128e+2 0.488e+2 58% 82% tention states is sufficient to yield a good cost model.
ICMA 3 0.991 0.740e+1 0.465e+2 82% 95%
Although dynamic environmental factors have signifi-
cant effects on query cost, they were ignored in most exist-
Table 6. Statistics for Cost Models in a Clustered Case
ing cost models for MDBSs or other database systems due
to lack of appropriate techniques. This paper introduces a
25 promising approach to tackling the problem. However, fur-
ther research needs to be done in order to fully solve all
20 relevant issues.

15 References
Frequency

[1] S. Adali et al. Query caching and optimization in distributed


10
mediator systems. In Proc. of SIGMOD, pp 137–48, 1996.
[2] G.K. Attaluri, D.P. Bradshaw, N. Coburn, P.-Å. Larson,
5 P. Martin, A. Silberschatz, J. Slonim, and Q. Zhu. The
CORDS multidatabase project. IBM Systems Journal,
34(1):39–62, 1995.
0
0 10 20 30 40 50
System Contention Level (Probing Query Cost in Sec.).
60 [3] W. Du, et al. Query optimization in heterogeneous DBMS.
In Proc. of VLDB, pp 277–91, 1992.
Figure 10. Histogram of Contention Level in a Clustered Case [4] W. Du, M. C. Shan, and U. Dayal. Reducing Multidatabase
Query Response Time by Tree Balancing. In Proc. of SIG-
MOD, pp 293 – 303, 1995.
6. Conclusions [5] G. Gardarin, et al. Calibrating the query optimizer cost
model of IRO-DB, an object-oriented federated database
system. In Proc. of VLDB, pp 378–89, 1996.
The techniques proposed so far in the literature to de- [6] S. Guha, et al. CURE: An Efficient Clustering Algorithm
velop local cost models in an MDBS are only suitable for for Large Databases. In Proc. of SIGMOD, pp 73–84, 1998.
[7] C. Lee and C.-J. Chen. Query Optimization in Multidatabase
a static environment. Many dynamically-changing environ- Systems Considering Schema Conflicts. IEEE Trans. on
mental factors have significant effects on query cost. To Knowledge and Data Eng., 9(6):941–55, 1997.
develop a cost model for a dynamic environment, we have [8] W. Litwin, et al. Interoperability of multiple autonomous
proposed a new qualitative approach, called the multi-states databases. ACM Comp. Surveys, 22(3):267–293, 1990.
[9] H. Lu and M.-C. Shan. On global query optimization in
query sampling method, in this paper. This method solves multidatabase systems. In 2nd Int’l workshop on Research
the dynamic problem by dividing the system contention Issues on Data Eng., pp 217, Tempe, Arizona, USA, 1992.
level, which reflects the combined net effect of dynamic fac- [10] H. Naacke, G. Gardarin, and A. Tomasic. Leveraging medi-
tors on query cost, in a dynamic environment into a number ator cost models with heterogeneous data sources. In Proc.
of 14th Int’l Conf. on Data Eng., pp 351–60, 1998.
of discrete contention states based on the costs of a probing [11] J. Neter, et al. Applied Linear Statistical Models, 3rd Ed.
query and then incorporating a qualitative variable indicat- Richard D. Irwin, Inc., 1990.
ing the contention states into a cost model. The costs of a [12] R. Pfaffenberger et al. Statistical Methods for Business and
probing query can be either observed or estimated. An ap- Economics. Richard D. Irwin, Inc., 1987.
[13] M. T. Roth, F. Ozcan, and L. M. Haas. Cost models DO
propriate set of system contention states can be determined matter: providing cost information for diverse data sources
based on either an iterative uniform partition with merging in a federated system. In Proc. of VLDB, pp 599–610, 1999.
adjustment or a clustering-based partition. The former is de- [14] A. P. Sheth, et al. Federated database systems for manag-
signed for a dynamic environment with the contention level ing distributed, heterogeneous, and autonomous databases.
ACM Computing Surveys, 22(3):183–236, Sept. 1990.
following the uniform distribution, while the latter is suit- [15] T. Urhan, et al. Cost-based query scrambling for initial de-
able for a dynamic environment with the contention level lays. In Proc. of SIGMOD., pp 130–41, 1998.
following a non-uniform distribution with clusters. Due to [16] Q. Zhu and P.-Å. Larson. A fuzzy query optimization ap-
the iterating and adjusting mechanisms, the former usually proach for multidatabase systems. Int’l J. of Uncertainty,
Fuzziness and Knowledge-Based Sys., 5(6):701 – 22, 1997.
can also handle the cases with non-uniform distributions al- [17] Q. Zhu and P.-Å. Larson. Solving local cost estimation prob-
though the latter may do a better job. The development of lem for global query optimization in multidatabase systems.
regression cost models for a dynamic environment is based Distributed and Parallel Databases, 6(4): 373 – 420, 1998.
[18] Q. Zhu and P.-Å. Larson. Building regression cost models
on the extensions of techniques from our previous static
for multidatabase systems. In Proc. of 4th IEEE Int’l Conf.
query sampling method. Our experimental results demon- on Paral. and Distr. Inf. Syst., pp 220–31, Dec. 1996.
strate that the multi-states query sampling method presented [19] Q. Zhu and P.-Å. Larson. A query sampling method for es-
in this paper is quite promising in developing useful cost timating local cost parameters in a multidatabase system. In
models in a dynamic environment. It represents a signifi- Proc. of 10th IEEE Int’l Conf. on Data Eng., pp 144–53,
Feb. 1994.
cant improvement over the static techniques in a dynamic

424

Vous aimerez peut-être aussi