Vous êtes sur la page 1sur 7

# Solutions for CS 4320/5320 Homework 2

Fall 2014
Due October 8, 2014

1 Written Part
1.1 Question 1: Indexing Basics (9 points)
Consider a relation with this schema:

## M ovies(name : string, budget : integer, length : integer, genre : string)

For each of the following indexes, list whether the index matches the given selection conditions. (each 1
points; 8 points in total)
Note: Here, a simple yes/no-answer is sufficient.

## (a) A B+-tree index on the search key ( Movies.budget, Movies.length )

(1) M ovies.budget<5.000.000M ovies.length=180 (M ovies)
(2) M ovies.budget=6.000.000M ovies.length>120 (M ovies)
(3) M ovies.length=60 (M ovies)
(4) M ovies.budget=5.000.000 (M ovies)
(b) A hash index on the search key ( Movies.budget, Movies.length )

## (1) M ovies.budget=8.000.000M ovies.length>100 (M ovies)

(2) M ovies.budget=12.000.000M ovies.length=80 (M ovies)
(3) M ovies.budget=7.500.000 (M ovies)
(4) M ovies.length=60 (M ovies)

From these observations, explain the advantage of tree-based indexes over hash-based indexes. (1 point)

Answer In general, a tree index matches (a conjunction of) terms that involve only attributes in a prefix
of the search key and a hash index matches (a conjunction of) terms that has a term attribute = value for
every attribute in the search key of the index. Thus, a tree-based indexes can be used in more cases than
hash-based indexes. (1 point)

(a) (4 * 1 points)

(1) Yes
(2) Yes
(3) No
(4) Yes

1
(b) (4 * 1 points)
(1) No
(2) Yes
(3) No
(4) No

## 1.2 Question 2: Tree-based Indexing (11 points)

Assume that you have just built a dense B+ tree index using Alternative (2), on a heap file containing
10,000 records. The key field for this B+ tree index is a 20-byte string, and it is a candidate key. Pointers
(i.e., record ids and page ids) are (at most) 8-byte values. The size of one disk page is 560 bytes. The index
was built in a bottom-up fashion using the bulk-loading algorithm, and the nodes at each level were filled
up as much as possible. For all cases, please show your work.
Note: Alternative (2) means that every entry contains a key and a single reference to a data entry
(a) How many levels does the resulting tree have? (4 points)

(1) Since the index is a primary dense index, there are as many data entries in the B+ tree as records
in the heap file. An index page consists of at most 2d keys and 2d + 1 pointers. So we have to
maximize d under the condition that 2d 20 + (2d + 1)8 560. (1 point)
(2) The solution is d = 9, which means that we can have 18 keys and 19 pointers on an index page. A
record on a leaf page consists of the key field and a pointer. (1 point)
(3) Its size is 20 + 8 = 28 bytes. Therefore a leaf page has space for (560/28) = 20 data entries. (1
point)
(4) The resulting tree has dlog19 (10000/20) + 1e = 4 levels. (1 point)
(b) For each level of the tree, how many nodes are at that level? (2 points)

(1) Since the nodes at each level are filled as much as possible, there are d10000/20e = 500 leaf nodes
(on level 4). (A full index node has 2d + 1 = 19 children.) (1 point)
(2) Therefore there are d500/19e = 27 index pages on level 3, d27/19e = 2index pages on level 2, and
there is one index page on level 1 (the root of the tree). (1 point)
(c) How many levels would the resulting tree have if key compression is used and it reduces the average
size of each key in an entry to 10 bytes? (2 points)

(1) Here the solution is similar to part 1, except the key is of size 10 instead of size 20.
(2) An index page consists of at most 2d keys and 2d + 1 pointers. So we have to maximize d under
the condition that 2d 10 + (2d + 1) 8 560. The solution is d = 15, which means that we can
have 30 keys and 31 pointers on an index page. (1 point)
(3) A record on a leaf page consists of the key field and a pointer. Its size is 10 + 10 = 20 bytes.
Therefore a leaf page has space for (560/20) = 18 data entries. (1 point) The resulting tree has
dlog31 (10000/18) + 1e = 3 levels.
(d) How many levels would the resulting tree have without key compression but with all pages 70 percent
full? (3 points)
(1) Since each page should be filled only 70 percent, this means that the usable size of a page is
560 0.70 = 392 bytes. Now the calculation is the same as in part 1 but using pages of size 392
instead of size 560. (1 point)
(2) An index page consists of at most 2d keys and 2d + 1 pointers. So we have to maximize d under
the condition that 2d 20 + (2d + 1) 8 392. The solution is d = 6, which means that we can
have 12 keys and 13 pointers on an index page. (1 point)
(3) A record on a leaf page consists of the key field and a pointer. Its size is 20 + 8 = 28 bytes.
Therefore a leaf page has space for (392/28) = 14 data entries. The resulting tree has
dlog13 (10000/14) + 1e = 4 levels. (1 point)

## h(1) h(0) Primary Pages Overflow Pages

000 00 32 8 24

001 01 9 25 41 17

010 10 14 18 10 30

011 11 31 35 7 11

100 00 44 36

## Figure 1: Hash-Index for question 3

Consider the Linear Hashing index shown in Figure 4. The round number counter Level is 0 and the split
counter Next is 1. Assume that we split whenever an overflow page is created. Answer the following

(a) What can you say about the last entry that was inserted into the index? (1 point)

(1) Nothing can be said about the last entry into the index: it can be any of the data entries in the
index. (1 point)
(b) What can you say about the last entry that was inserted into the index if you know that there have
been no deletions from this index so far? (2 points)

(1) If the last item that was inserted had a hash code h(0) = 00 then it caused a split. (1 points)
(2) Otherwise, any value could have been inserted. (1 point)
(c) Suppose you know that there have been no deletions from this index so far. What can you say about
the last entry whose insertion into the index caused a split? (1 point)
(1) The last data entry which caused a split satisfies the condition h0 (keyvalue) = 00 as there are no
overflow pages for any of the other buckets. (1 point)
(d) Show the index after inserting an entry with hash value 4. Indicate any change in round number or
split counter. (2 points)

## h(1) h(0) Primary Pages Overflow Pages

000 00 32 8 24

001 01 9 25 41 17

010 10 14 18 10 30

011 11 31 35 7 11

100 00 44 36 4

## Figure 2: Answer for Question d (1 point)

(e) Show the original index after inserting and entry with hash value 15. Indicate any change in round
number or split counter. (2 points)

## h(1) h(0) Primary Pages Overflow Pages

000 00 32 8 24

001 01 9 25 41 17

010 10 14 18 10 30

011 11 31 35 7 11 15

100 00 44 36

101 01

## Figure 3: Answer for Question e (1 point)

(f) Show the original index after deleting entries with hash values 36 and 44. Assume that the full deletion
algorithm is used. Indicate any change in round number or split counter. (2 points)

Next is 0 now, because one row has been removed. Level remains at 0. (1 point)

## h(1) h(0) Primary Pages Overflow Pages

000 00 32 8 24

001 01 9 25 41 17

010 10 14 18 10 30

011 11 31 35 7 11

## Figure 4: Answer for Question f (1 point)

(g) Find a list of entries whose insertion into the original index would lead to a bucket with two overflow
pages. Use as few entries as possible to accomplish this. What is the maximum number of entries that
can be inserted into this bucket before a split occurs that reduces the length of this overflow chain? (4
points)

(1) The following constitutes the minimum list of entries to cause two overflow pages in the index:

## 63, 127, 255, 511, 1023

(1 point)
(2) The first insertion causes a split and causes an update of Next to 2. The insertion of 1023 causes a
subsequent split and Next is updated to 3 which points to this bucket. This overflow chain will not
be redistributed until three more insertions (a total of 8 entries) are made. (1 point)
(3) In principle if we choose data entries with key values of the form 2k + 3 with sufficiently large k, we
can take the maximum number of entries that can be inserted to reduce the length of the overflow
chain to be greater than any arbitrary number. This is so because the initial index has 31(binary
11111), 35(binary 10011),7(binary 111) and 11(binary 1011). (1 point)
(4) So by an appropriate choice of data entries as mentioned above we can make a split of this bucket
cause just two values (7 and 31) to be redistributed to the new bucket. By choosing a sufficiently
large k we can delay the reduction of the length of the overflow chain till any number of splits of
this bucket. (1 point)

## 1.4 Question 4: Indexing Costs (16 points)

You are given the following information:
Executives has attributes ename, title, dname, and address; all are string fields of the same length.
The ename attribute is a candidate key.
The relation contains 8,000 pages.
There are 5 buffer pages.
Suppose that the query is as follows:

## SELECT E.title, COUNT(*) FROM Executive E

WHERE E.dname > W% GROUP BY E.title

Assume that only 10% of Executives tuples meet the selection condition. Please show your work for all
questions.

(a) Suppose that a clustered B+ tree index on title is (the only index) available. What is the cost of the
best plan? If an additional index (on any search key you want) is available, would it help to produce a
better plan? (4 points)

(1) Using a clustered B+ tree index on title, the cost of the given query is 8000 I/Os. (2 points)
(2) The addition of another index would not lower the cost of any evaluation strategy that also utilizes
the given index.
(3) However, the cost of the query is significantly cheaper if a clustered index on dname would be
available.
(4) It would be even better to have a joint-index on <title, dname> for this particular case (as in
part e). (2 points)
(b) Suppose that an unclustered B+ tree index on title is (the only index) available. What is the cost of
the best plan? (3 points)

(1) The cost is also 8000 I/Os but we need to additionally sort the result. (1 point)
(2) The page size for sorting is reduced to 41 , because of the removal of unwanted attributes. So we
only have to sort 10% 2000 = 200 pages. (1 point)
(3) We do external sorting. So, the student first has to calculate the amount of passes. In the first pass
we can use 5 input pages, after that we need 1 to write the result to.
1. pass: 200/5 = 40
2. pass: 40/4 = 10
3. pass: d10/4e = 3
4. pass: Merge the remaining three pages
(4) This means we need 6 passes in total. (1 point)
(5) Each pass needs to read and write the pages (x2). So, the sorting cost is 4 2 200 = 1600. (1
point)
(6) The total cost is 8000 + 1600 = 9600.
(c) Suppose that a clustered B+ tree index on dname is (the only index) available. What is the cost of the
best plan? If an additional index (on any search key you want) is available, would it help to produce a
better plan? (4 points)
(1) The optimal plan with the indexes given involves scanning the dname index and sorting the
(records consisting of the) title field of records that satisfy the WHERE condition. (2 points)
1
(2) The index has the size of 8000 4 = 2000 pages (as it only contains the dname-field. (1 point)
(3) Scanning the relevant portion will cost 2000 10% = 200
(4) Retrieving the qualifying records: 8000 10% = 800 (1 point)
(5) Writing out the title records: 8000 10% .25 = 200 (1 point)
(6) Finally we have to sort those 200 records (same as above) (1 point)
(7) So we have a total cost of 200[scanning] + 800[retrieve] + 200[writeout] + 1600[sort] = 2800.

(d) Suppose that a clustered B+ tree index on <dname, title> is (the only index) available. What is the
cost of the best plan? (3 points)