Vous êtes sur la page 1sur 10

SQL Optimization

I. Overview

After learning the fundamentals of SQL, it's fairly easy to create queries that allow you to
selectively view the contents of one or more tables. However, even though these queries may
have taken very little time to create, you may end up being surprised at how long they can take to
execute. In fact, it's very commonplace for the initial delight of SQL to vanish quickly when the
first query takes hours to complete! This is due to two factors. First of all, SQL access through an
ODBC Driver will never be quite as fast as native manipulation of the database. This makes
sense as the ODBC Driver adds another layer (or middleman) to the whole process. The more
layers that are added, the slower the process will be. However, it's also very common to over-
estimate this impact. Most of the time, the poor performance is actually due to poorly optimized
queries rather than a slow ODBC Driver. The lack of optimization in a query is the number one
reason for those lengthy queries. It's very possible for an optimized query to run over 6000%
faster than a non-optimized query. Therefore, as speed (and therefore usability) is always an
issue, it's well worth it to spend some time examining and optimizing your queries. Because it is
so important, the process of creating a query should include careful planning up front and
optimization verification before the process is complete.

Because there are only a few techniques for optimization queries, it would seem to be an easy job
at first. After all, there are only a few things to check, right? Although that's correct, some of the
techniques can be very time-consuming and require a good understanding of the tables, Data
Dictionary, and index theory.

The techniques for optimization are outlined below.

II. Ensuring Proper Use of Indices

Effectively using a table's indices is probably the most difficult technique to employ. However, it
also produces the best results and is the most important technique for this reason. Before
continuing, it's worthwhile to quickly go over an index's purpose.

Index overview:

An index, or a key, is a field or combination of fields used to identify a record. There are two
main types of indexes - a primary index and a secondary index. A primary index uniquely
identifies a record. In other words, no other record in the table will have that same index. A non-
unique or secondary index does not uniquely identify a record. Two classic examples illustrate
the point. A customer number field uniquely identifies the customer, as no two customers can
have the same customer number. Therefore, the customer number field can be used for the
primary index. A customer's Zip Code, on the other hand, is not unique. This makes sense, as
many customers can be located within the same Zip Code. Therefore, the customer's Zip Code
may be used as a secondary index.
Now that we've covered the two types of indices, what are they used for? In general indices are
used to enable the database engine to find and sort records quickly. An index allows the engine
to jump directly to where the data is located, as opposed to reading every single record to see if it
matches your search criteria. As an example, most books have an index. A book's index allows
you to look up a particular word, and informs you of every page that the word exists on.
Therefore, using a book's index allows you to quickly find all occurrences of that particular word
in the book. It's much quicker to find the word this way than it would be to read the entire book,
marking every page that contains the word. It's very similar with databases. An index allows the
database engine to find records quickly, as it enables the engine to jump to the locations in the
file that contain a desired record. Without the use of indices, the engine would have to read every
record to see if it matched the required criteria. Armed with this knowledge, it's easy to see that
indices are imperative for good performance of a database, particularly with a relational database
or when dealing with a very large table.

Indices can vastly improve the speed of database operations - but how and when they are
implemented plays an integral part in realizing performance gains. Because their purpose is to
enable the database engine to find records and sort them quickly, the likely candidates for
indexing are those fields that are frequently searched, sorted, or used to join tables. However, it's
still a good idea to be somewhat judicious with the choice of fields to index. Whenever a record
is added or modified, the database engine has more work to do if there are many indexed fields.
This is because the indices must be updated every time this sort of action takes place. If most of
the table access is spent querying, this won't impact performance. However, if the table's records
are frequently modified, there should be a reasonable limit to the number of fields indexed.

Indices may be comprised of a single field, or multiple fields. Which type of index to use should
be determined by the types of SELECTs, ORDER BYs, and JOINs that are used most often. In
other words, multiple field indices are useful when searching on multiple fields (as in a
LastName+FirstName example), and single field indices are useful when searching on a single
field (as in a Customer_Number example).

Once the table's indices have been defined, queries to the table can be optimized to take
advantage of these indices. Generally speaking, all of your SQL queries should be optimized so
that the database engine can use existing indices to find, sort, and join records and tables quickly.
This means that indexed fields should always be used in the SQL statement's WHERE and
ORDER BY clauses. If the WHERE clause references an index, the records can be retrieved
from the file quickly, as in the book's index example. If the result set is to be sorted, then the
ORDER BY clause should also reference an indexed field, as the data will be sorted much more
quickly that way.

After the query has been created, the next step is to execute it and verify that the database engine
was able to optimize the query by using the proper indices. With the BASIS ODBC Driver, this
process is accomplished by examining the log file after a query has been performed. To illustrate
this, we can perform both an indexed and non-indexed query:

Indexed query:
SELECT * from CALL_LOG where ID > '0000022000' and TIME < 9
This will select all of the records from the CALL_LOG table where the call came in before 9:00
AM and the call ID is greater than '0000022000' Because the primary index is based on the call
ID, the ODBC Driver can retrieve the records quickly. Here's the excerpt from the log file:

OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=0, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)

Note that the log file contains several pieces of information to indicate the optimization strategy
used. First of all, it says [SELECTED] to indicate that it was able to select an index to optimize
on. Secondly, it lists two possible options to optimize off of - one for the ID field and one for the
TIME field. The ID field is listed first, and the knum=0, kseg=1 indicates that it is the primary
key (as it's the first key in the first keychain). The second line listed is for the TIME field. As its
knum and kseg are -1, this indicates that there aren't any indices that are based on the TIME
field. In the above example, the ODBC Driver used the primary index based on the call ID field
to optimize on. This is indicated by the '!' preceding its line.

Non-Indexed query:
SELECT * from CALL_LOG where TIME < 9

This will select all of the records from the CALL_LOG table where the call came in before 9:00
AM. As there aren't any indexes based on the time field, the ODBC Driver isn't able to optimize
the query. Here's the excerpt from the log file:

OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)

Note that it doesn't have a [SELECTED] or a '!' to indicated a selected index.

A more complex example:

SELECT CALL_TRN.TECH CALL_TRN.TRN_DATE from CALL_LOG, CALL_TRN where


CALL_LOG.ID between '0000020000' and '0000022000 'and CALL_LOG.TIME < 9 and
CALL_LOG.TECH='ABC' and CALL_LOG.ID=CALL_TRN.ID order by CALL_TRN.TECH

This query is more complex than the previous examples as it does a join on the CALL_LOG and
CALL_TRN tables, it references the CALL_LOG.ID field twice (for the between), and it does an
order by. Here's the log file excerpt:

OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=0, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=0, kseg=1) ID (bracket tail)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Predicate: 1 constraints
* : (file 1) (knum=0, kseg=1) ID (partial key: knum=0)

(file 2) = f:/odbc/Local_Call_Hist/data/call_trn
order_knum=-1
[Selected] Predicate: 2 constraints
: (file 1) (knum=0, kseg=1) ID (primary key)
*!: (file 2) (knum=0, kseg=1) ID (partial key: knum=0)

The logfile shows that the ODBC Driver did more work with this query in order to figure out the
best field to optimize on. In fact, it checked almost every field mentioned in the query. The first
two entries for the first file relate to the ID field. There are two entries for this field, a bracket
head and a bracket tail. This is because the field was used twice for comparison. Additionally,
note that that all entries list the primary index (ID) after their knum and kseg portion to indicate
the primary key. The third entry could not be used as there isn't an index associated with that
field, as mentioned above. The fourth entry says that it could have chosen to do a partial key read
based on the primary key. For the second file, it lists two entries. The first is the primary key for
the first table. The second entry (which was picked) shows that it was able to do a partial key
read off of the primary key. Note that this is quite a bit different from the first table. For the first
table, it keyed directly off of ID, but it says that it can only to a partial key read off of ID for the
second table. The reason for this is that the primary key for the second file is based off of
multiple fields. So, since CALL_TRN's primary index was comprised of the ID field followed by
two others, the ODBC Driver couldn't key directly off of the primary key. Instead, it could do a
partial key read because the query referenced the first portion of the primary key.

Comparison of indexed vs. non-indexed queries:

To demonstrate the importance of indices, two identical queries were performed on a table and
timed. The only difference between the tests is that the first query referenced the primary index,
and the second query did not reference any indices. Here is the query:

select * from CALL_TRN where ID > '0000022670'

To query on the primary index, the table was defined in the Data Dictionary to have an index for
the ID field, and the physical data file used the ID field as its primary key. Before the query was
run a second time, the Data Dictionary was modified so that the table's index used a different
field. The physical data file was also re-written so that the ID field was not specified as a key.
Here are the results:

Query on indexed field: 0.25 seconds


Query on non-indexed field: 14.71 seconds
This means that the indexed query was 5884% faster than the non-indexed query. Quite
impressive, isn't it? It's worthwhile to mention that this query was structured specifically to
provide a large discrepancy in the result times. In other words, performance increases aren't
going to always be in the 6000% range - it can vary greatly depending on the query and the data
in question. For this example, the select will only return records where the record ID is greater
than '0000022670'. When looking at the data file, it turns out that only 239 records out of 48,145
match this criterion. This means that the query will only return about one half of one percent of
the records - a fairly small amount. As the indexed query can jump directly to the records in the
data file that match the criterion, it only has to deal with 239 records. The non-indexed query, on
the other hand, has to wade through 48,145 records to see if they match the criterion.

Multiple-field indices:

If the index for a table is comprised of multiple fields that have been concatenated together, then
some optimization can take place if one rule is adhered to:

In order to include a field that is part of the index, you have to include every field before it

As an example, let's say the index is comprised of three fields: Customer_Number, Last_Name,
and First_Name. The ODBC driver could not perform any optimizations if the WHERE clause
just included the Last_Name field. If a query has to be done on the Last_Name field, then the
Customer_Number field should be included as well. Likewise, if the query used the First_Name
field it should include the other two in order to ensure proper optimization.

What about adding indices to a Single Keyed MKEYED file?

With a multi-keyed MKEYED file, it possible to have several indices defined. However, with a
single-keyed MKEYED file, only one index is allowed. This means all queries run against that
table must reference one of the fields referenced by the primary key according to the rule
outlined above. If a query is executed that does not follow that rule, no optimization can take
place. There are two possible workarounds to this problem. The first is to start using multi-keyed
MKEYED files instead, as they are not limited to a single index. The second method involves
creating another file. This new file will have an index defined using the first table's non-indexed
field and one of its indexed fields. The idea is that the new query can do a join on the old table
and the new table. The select will reference the desired field in the new table, which is part of its
index. As it will also return the value of an indexed field, this information can be used to find the
desired records in the original table via the index.

Structuring the query properly:

Even though the query may be constructed to take advantage of indexed fields, there are two
other parameters that determine how quickly the query will be able to be performed. The two
parameters are the order of the fields in the WHERE clause, and how simple the comparison
clause is. This is best demonstrated by an example. The goal of the query is to find all of the
information for each call during a particular data range where the call is of a particular priority
level. Here's the first query:
select * from hist_log where PRIORITY= '4' and CALL_DATE between {d '1997-07-07'} and
{d '1997-07-07'}
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)

From the log file, we can tell that it chose the PRIORITY field to optimize on. For this particular
file, both the PRIORITY field and the CALL_DATE field are secondary indices. The ODBC
Driver chose the PRIORITY field for two reasons

1. That field was listed first in the WHERE clause


2. The comparison for that field was much simpler than for the CALL_DATE field

The latter reason is particularly important. Because the CALL_DATE field was used for a more
complex comparison (thanks to the BETWEEN), the PRIORITY field was chosen for
optimization. This is made clearer by the following query:

select * from hist_log where CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and
PRIORITY = '4'
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)

In this case, the PRIORITY field was listed last in the WHERE clause, but was still chosen over
the CALL_DATE field as the comparison was much simpler.

Now, what does all of this have to do with optimization? The answer is that even though these
two indices appear equal (as they are both secondary indices), we can get far better performance
by optimizing off of the CALL_DATE field. The reason is simple - there are thousands of
records in the data file with a priority of '4', but only a couple dozen within the specified date
range. In other words, if the optimization is based on the PRIORITY field, the ODBC Driver will
end up reading and comparing thousands of records to the specified date. However, if the
optimization is based on the CALL_DATE field, the ODBC Driver will read and compare about
30 records to the specified priority level. To put this in perspective, the query is tried a third
time:
select * from hist_log where CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and
PRIORITY like '4%'
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Predicate: 1 constraints
* : (file 1) (knum=7, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=7, kseg=1) ID (bracket tail)

This time, the query was able to 'force' the ODBC driver into selecting the CALL_DATE field
for optimization. To accomplish this, the CALL_DATE field was listed first in the WHERE
clause. However, as demonstrated in the second query, that wasn't quite enough. The final step
was to make the PRIORITY field look much less attractive to the ODBC Driver. This was done
by making the comparison more complex - it was changed from a equals to a like. Because of
this, the CALL_DATE field was chosen for optimization. As mentioned above, this can
drastically reduce the amount of work that the ODBC Driver has to do. Here are the timing
results:

Query optimized on PRIORITY: 3.25 seconds


Query optimized on CALL_DATE: 0.05 seconds

The timings indicate that tuning the query resulting in a 6500% increase in speed - quite a
dramatic improvement! In fact, the performance gain was better with this query than it was for
the query that was comparing an indexed and non-indexed WHERE clause. This is important
because it points out that even though a query may be referencing and optimizing off of indexed
fields, it still may not be performing as well as it could be. This type of tweaking can have
profound effect on the speed of the query, but also requires a great deal of knowledge about the
table and the data that is being queried. The previous example illustrated that nicely - the
performance gain was only realized because the ODBC Driver was forced to optimize on a
particular field. This means that the user had to do some extra work (and thinking!) to make the
ODBC Driver optimize off of a particular field. Additionally, the user knew ahead of time that
the CALL_DATE field would be a far better choice, as it would substantially limit the amount of
data that the ODBC Driver had to deal with. Without a good knowledge of the table's contents,
this type of optimization would have never taken place.

Review:

To recap, referencing indexed fields in your queries is the most effective way to ensure fast
performance. If frequently run queries can't be optimized because they don't reference indexed
fields, it may be worthwhile to restructure the table by adding indices to ensure faster
performance. Note that this is normally a two-step process. The first step is to create the index in
the Data Dictionary, and the next is to re-key the physical BBx data file so that it has the new
field defined as a key. When done via PRO/5, this involves creating a new file with a new key
definition in the MKEYED statement that includes an index for the desired field. Once the new
file has been created, a short program must be run - one that reads all of the records from the old
file and writes them into the new file. With the BASIS ODBC Driver, however, it is possible to
complete this operation in a single step via the CREATE INDEX sql statement. When this type
of sql statement is executed, the ODBC Driver will not only modify the Data Dictionary to
include the new index, but it will also re-key the file.

III. Limiting the Amount of Data Returned

A often-overlooked way of improving query times is to only return the data that is actually
needed. It's very common to browse the entire table, or simply do a 'SELECT * from table'.
While it may sometimes be necessary to view every single field of every single record, in many
cases it's done just because it's easier. Regardless of whether all of the data is going to be used or
not. Therefore, before doing a query it's wise to plan out what fields and records are actually
needed, and have the query return only the necessary data. This means that the WHERE clause
should be constructed to filter out as much un-needed data as possible. Additionally, if only a
portion of the fields are required, stay away from the common 'SELECT * ...' type of query.
Instead, have the SELECT reference only the necessary fields, resulting in much less data being
processed and faster queries. This is especially important when the table has a large number of
columns, or columns that are defined as large character fields. If some of these columns are not
necessary for the result table, eliminating them from the query will result in much less data being
transferred - which translates into faster queries. Following is an example that demonstrates this:

The goal of this query is to find out all of the call reference numbers that were handled by a
particular technician. This query could be accomplished two ways:

select * from hist_log where TECH = 'ABC'"

select id from hist_log where TECH = 'ABC'"

The first way asks for every field for each record returned. The goal of the query was just to find
out which calls were taken by the technician, though, so only the ID field needs to be returned.
The second query takes advantage of that fact. Here are the result times:

Selecting all fields: 8.09 seconds


Selecting the ID field: 1.28 seconds

In this case, retrieving only the necessary data made the query run 632% faster!

IV. SQLPASSTHROUGH Mode

If at all possible, a SQLPASSTHROUGH type query should be performed. This is usually an


option for the type of query that the 3rd party application (like MS Access or MS Visual Basic)
will perform. With this type of query, the BASIS ODBC Driver is in charge of optimizing the
query instead of the other application. As the ODBC Driver has a better knowledge of the PRO/5
file's indices and its own capabilities for optimization, it will usually do a much better job of
selecting and sorting data than the other application would.

V. Data Source Configurations

The 'Fast Connect' option can help reduce query time, especially with larger Data Dictionaries.
Turning on this option does several things to improve the connection time to the data source. The
option that makes the biggest time difference is the 'No Shadow Data Dictionary Consistency
Check', which is also available as a separate option.

The shadow DD files represent a form of the Data Dictionary that is customized for the ODBC
Driver. They present the data in a format that the ODBC Driver is more readily able to digest.
They store type information about the DD and information about the data types that are
supported. They also leave out a lot of information from the DD that is Taos-specific. Because of
this, the shadow DD files are typically much smaller than the full-blown DD files.

When a BASIS ODBC Data source is initially connected to, the ODBC Driver will read
information from the Shadow DD files to find out information about the database's tables, fields,
indices, etc. If neither the 'No Shadow Data Dictionary Consistency Check' or the 'Fast Connect'
options are checked, then the ODBC driver will re-create the Shadow DD files. This means that
it will read the entire DD, figure out all of the pertinent information for the Shadow DD files,
then create and populate the Shadow DD files. Needless to say, this process can take some time.
The larger the Data Dictionary, the longer it takes to create the Shadow DD files. So, if the
ODBC Driver is allowed to bypass the create process, it will save quite a bit of time on the initial
connect. The Shadow DD files only need to be recreated if the original DD changes. In general,
the Shadow DD files shouldn't be created on every connection to the Data Source. Overall, it's
best (from the performance standpoint) to only recreate the Shadow DD files when necessary.
Therefore, this option should only be turned off if the original DD was modified (for example, if
tables, indices, or fields have been added, dropped or modified).

VI. Data Server versus Mapped Drives

If the data that is being queried resides on a remote machine, it's usually possible to access it
either via the BASIS Data Server or via a mapped drive. When possible, the BASIS Data Server
should be used instead of a mapped drive, as queries will be several times faster via the Data
Server. This is because the Data Server can reduce the amount of data that is sent over the
network. When reading information from a share (a mapped drive), protocol demands that
locking information be sent back and forth so that the ODBC driver can read the information.
However, when the Data Server is used, it handles the locking calls. This means that all of that
handshaking is done locally (to that remote machine), and the Data Server will send far less
information over the network. As the network link is relatively slow, reducing the amount of
information passed through this channel can greatly improve response times.

VII. Machine Speed


ODBC queries can be quite demanding on the workstation's resources. In particular, the
machine's CPU speed and hard disk speed (for local queries and ORDER BYs) can have a large
impact on the length of time it takes for a query to complete. As a test, a report consisting of ten
queries was run from two machines. The first machine was a 486 DX2-66 and it took 117
seconds to complete the report. The second machine was a Pentium 133 and it took 56 seconds to
complete. For this test, both machines were on the same network so that didn't play a part in the
speed increase, which was more than 200%. The deciding factor for this test was the CPU and
hard disk speed. This is due to the fact that the queries polled large amounts of data, and ORDER
BYs and GROUP BYs were used.

Vous aimerez peut-être aussi