CS 4320 PS 2 Index Selection Analysis

CS 4320 PS 2
Dhiraj Gupta (dg523) and Rowan Meara (rrm89)

March 17, 2017
1 Selecting Indices
Some initial assumptions for this problem: indexes entries are of the form hkey
k, rid of data entry with search key value ki (so if we anticipate 100 tuples in
our result, those tuples will each have a corresponding entry in the index for a
total of 100 index entries). Also, when fetching pages using an index, we assume
that the entries are held in the minimum number of pages (i.e. if pages hold
100 tuples and our query is expected to return 100 tuples, we assume that well
examine only one page and not two). The same holds for the data pages.
a There are two possibilities: we can use index 2 (clustered B+ tree index
on column age) or index 3 (unclustered B+ tree index on columns (age,
nrOrders)). We can calculate the cost of each possibility to find the
optimal choice. Note that we want to retrieve all entries of Customers
where age is 20. Since we assume the data is distributed uniformly this
1
means our result will be 1, 000, 000 80 = 12, 500 tuples. So for the first
possibility, we have to perform 4 I/O operations to retrieve a leaf page,
which contains 1,000 entries. Note that since the index is clustered, we
need only retrieve 1 page and we can simply scan the data pages to find
1,000,000 tuples
our other entries. We find that there are 10,000 disk pages = 100 tuples per
12,500
page, so we need to fetch d 100 e = 125 pages from disc. From these
pages we find the desired entries. So in total the first possibility costs
4 + 125 = 129 I/O operations .
For the second possibility, the process is similar. For an unclustered index,
though, we need to retrieve all of the leaf pages since each index entry
could point to a different data page. So we take 4 I/O operations to find
the first leaf page. Since we anticipate 12,500 tuples in the result and
index pages hold 1,000 entries we need to check a total of d 12,5001000 e = 13
index pages. Then our total cost of using the index is 17 I/Os. In the
worst case unclustered scenario, we have 1 I/O operation for every data
entry. So we have in total 12,500 I/O operations for this. Then our total
incurred cost is 17 + 12, 500 = 12, 517 I/O operations . There is also the
opportunity to use index 1, the unclustered hash index on name, and
retrieve tuples for all possible name then select only those where age =
1
20. This is completely unreasonable, so we wont calculate cost for this
method. A sequential scan is also possible (ignoring all indices), and this
is simply examining all pages of the dataset (assuming it is not sorted)
for a total of 10,000 I/Os . Clearly the first alternative is the best, so the
best use of indices here is to use the clustered B+ tree index on age.
b To retrieve all customers who ordered more than 20 items, we can use all
of the indexes. However, none of the uses are reasonable. For index 1, the
unclustered hash index on name, we can fetch individuals with each possi-
ble name (using an equality check) then for each of those individuals select
those where nrOrders > 20. For index 2, the clustered B+ tree index on
age, we can query for all ages between 6 and 85, and for each of those
select only the tuples where nrOrders > 20. Finally, for the unclustered
B+ tree index on (age, nrOrders) we must look up all ages and from
there find those whose order numbers are greater than 20. We cannot
directly apply the index to nrOrders because that field is not a prefix of
the index. Clearly, this alternative is unreasonable too. Therefore, the
best approach is likely a sequential scan with cost of 10,000 I/Os .
c Since we are retrieving customers of a specific age (20), gender (male)

and an order range ( 50), we multiply the selectivity of the predicates
1
to obtain the overall selectivity. So we have a selectivity of 80 21
51 51
100 = 16000 , meaning that we can anticipate a result size (in tuples) of
51
d1, 000, 000 16000 e = 3188 tuples (spanning 4 pages). So we can use
the unclustered hash index on name and then filter using the age, gender
and nrOrder fields (clearly unreasonable). However, we can retrieve all
customers of age 20 using the clustered B+ tree index on age at a cost of
129 I/O operations as found in (a). While these tuples are in the buffer,
we can keep only those matching our other selection criteria and discard
the rest at no additional I/O cost, so this alternative costs us 129 I/Os .

CS 4320 PS 2 Index Selection Analysis

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CS 4320 PS 2 Index Selection Analysis

Transféré par

Droits d'auteur :

Formats disponibles

CS 4320 PS 2

Dhiraj Gupta (dg523) and Rowan Meara (rrm89)

c Since we are retrieving customers of a specific age (20), gender (male)

Vous aimerez peut-être aussi