Vous êtes sur la page 1sur 11

Solutions Exam Advanced Databases June 2008

1. (1 point)
Consider the following XML document:

<Vehicles>
<Car manf="Hyundai">
<Model>Azera</Model>
<HorsePower>240</HorsePower>
</Car>
<Car manf="Toyota">
<Model>Camry</Model>
<HorsePower>240</HorsePower>
</Car>
<Truck manf="Toyota">
<Model>Tundra</Model>
<HorsePower>240</HorsePower>
</Truck>
<Car manf="Hyundai">
<Model>Elantra</Model>
<HorsePower>120</HorsePower>
</Car>
<Car manf="Toyota">
<Model>Prius</Model>
<HorsePower>120</HorsePower>
</Car>
</Vehicles>

What are the results of the following XPath expressions?


(a) /Vehicles/Car[@manf="Toyota"]/HorsePower
(b) /Vehicles/*[HorsePower>200]/Model
(c) //*[@manf="Toyota"]/@manf
(d) /*/*/../..

Solution:
(a) <HorsePower>240</HorsePower>,
<HorsePower>120</HorsePower>
(b) <Model>Azera</Model>,
<Model>Camry</Model>,
<Model>Tundra</Model>
(c) manf="Toyota",
manf="Toyota",
manf="Toyota"
(d) The complete document is returned once.

1
2. (1 point)
Consider an XML document containing course information for students and
that conforms to the following DTD:

<!DOCTYPE Classes [
<!ELEMENT Classes (Class*)>
<!ELEMENT Class (Topic, Students)>
<!ELEMENT Topic (#PCDATA)>
<!ELEMENT Students (Student+)>
<!ELEMENT Student EMPTY>
<!ATTLIST Student Name #REQUIRED> ]>

We have the following two queries:

- Query Q1 in XPath:
/Classes/Class
[Students/Student/@Name != Students/Student/@Name]/Topic
- and query Q2 in XQuery:
for $c in /Classes/Class
for $s1 in $c/Students/Student
for $s2 in $c/Students/Student
where $s1/@Name != $s2/@Name
return $c/Topic

What is the difference in meaning between these two queries?

Solution: Q1 returns the topic of the classes in which there are two students with
a different name. Q2 does this also, but returns the topics for every pair of such
students with different names. The result of Q2 hence potentially contains many
duplicate topics.

2
3. (1 point)
Explain the following rule in the formal semantic of LiXQuery:

St, En ` e ⇒ (St1 , hsi) s ∈ S − {“”} St3 = St1 ∪ St2 VSt2 = {r}


r∈V t
σSt2 (r) = s ∀n, n ∈ V ((n St1 n ) ⇒ (n St3 n0 ))
0 0

St, En ` text{e} ⇒ (St3 , hri)

Solution: In the rule above we see what happens if a text node is added to
the document. Given a store St and an environment En, a text node is added
which has as text value te result of evaluation expression e. For XML store St
and environment (context) En, expression e changes the store into St1 and the
context to hsi, s being a non-empty string, and textnode r in store St2 has the
value s. St1 and St2 together form the store St3 in which the document order of
St1 remains preserved. Now, if text{e} is executed on St, En, it results in store
St3 , having textnode r with the non-empty value s as its context.

3
4. (2.5 points)
Consider a datacube with dimensions Student, Course, and Semester, and
measure Grade. There is no hierarchy defined over the dimensions. The
cube contains information for 200 students, 6 courses and 6 semesters. The
number of students that followed a particular course in a particular semester
is given in the following table (an empty cell indicates that a particular course
was not taught during that semester):

Semesters
Courses S1 S2 S3 S4 S5 S6 T otal
C1 20 35 25 15 95
C2 40 35 45 120
C3 29 30 28 17 104
C4 50 69 46 165
C5 40 35 42 117
C6 12 15 27
T otal 110 69 169 47 144 89 628

The following tables summarize how many students followed the different
courses (some students followed a course more than once), and how many
students were subscribed in the different semesters (most students are en-
rolled in more than one course per semester):

C1 C2 C3 C4 C5 C6 T otal S1 S2 S3 S4 S5 S6 T otal
# 80 105 90 150 102 27 554 # 50 30 85 20 72 40 297

(a) Consider the 8 views:


SELECT D1, ..., Dn, COUNT(*)
FROM CUBE
GROUP BY D1, ..., Dn
where for D1, . . . , Dn we consider the 8 subsets of the dimensions. De-
termine the number of tuples in these different views.
(b) Apply the greedy algorithm presented by Harinarayan , Rajaraman, and
Ullman in their ACM SIGMOD 1996 paper Implementing data cubes
efficiently to select 2 views to materialize. What is the total benefit of
materializing these 2 views?

Solution:

(a) These queries are just plain SQL queries; so, in their answer no null values
are introduced. There might have been some confusion with the way the null-
value is used in ROLAP. The goal of this question was to determine the sizes
of all views needed in part (b). In case of a wrong answer in part (a), part
(b) was corrected as if the numbers in the answer to (a) were the right ones;
i.e., it was perfectly possible to miss part (a) and get full grades on part (b).

4
The sizes of the views:

Grouping attributes Size Explanation


() 1 This is the total count; 1 tuple
(St) 200 There are 200 students
(Se) 6 There are 6 semesters
(C) 6 There are 6 courses
(St,Se) 297 There are 297 student-semester
combinations (down right table)
(St,C) 554 There are 554 student-course
combinations (down left table)
(Se,C) 19 There are 19 course-semester
combinations (top table, # filled cells)
(St,Se,C) 628 There are in total 628 tuples in the
database; top table, grand total

Se denotes Semester, St denotes Student, and C denotes Course.

(b) First we start by visualizing the partial order between the different views:

(St,Se,C) : 628 : 628


iii4 O jUUUU
iiiiiii UUUU
UUUU
ii UUUU
iiii UUU
iiii
(St, Se) : 297 : 628 (St, C) : 554 : 628 (Se, C) : 19 : 628
O jVVVV 4 jVVVV 4 O
VVVV hhhh VVVV hhhh
VVVhhVhhhhh VVhVhVhhhhh
hh VVVVV hh VVVVV
hhhh VVVV hhhh VVVV
hhhh hhhh
(St) : 200 : 628 (Se) : 6 : 628 (C) : 6 : 628
jVVVV O hhh4
VVVV hhhh
VVVV hhh
VVVV hhh
VVVV hhhh
VV hhhh
() : 1 : 628

The numbers : x : y after the views indicate the size x of the view and
the current cost y of computing the view, taking into account that only
(St,Se,C) is materialized. An arrow from view V to view W denotes that for
the computation of V we can use view W .
We compute the benefits of materializing the views. Initially we start with
only view (St,Se,C) materialized; this is the base table, which always needs
to be present as no other views can be used to compute it.
The benefits are:

() 1 × (628 − 1)
(St) 2 × (628 − 200)
(Se) 2 × (628 − 6)
(C) 2 × (628 − 6)
(St,Se) 4 × (628 − 297)
(St,C) 4 × (628 − 554)
(Se,C) 4 × (628 − 19)

5
Of all views, it is obvious that (Se,C) gives the highest benefit. Hence, in
the first step (Se,C) is selected. With the views in S = {(Se, C), (St, Se, C)}
materialized, the costs become:

(St,Se,C) : 628 : 628


i4 O jTTTT
iiiiiii TTTT
TTTT
iii
iiiiiii TTTT
TT
iii
(St, Se) : O 297 : 628
jU
(St, C) : 554 : 628 (Se,C) : 19 : 19
UUUU i4 j UUUU
UUUU iiii UUUU
UUUUiiiiii
ii4 O
UUiiUiUiiii i U
ii UUUUU ii UUUUU
iiii UUUU iiii UUUU
iiii iiii
(St) : 200 : 628 (Se) : 6 : 19 (C) : 6 : 19
jVVVV O hh4
VVVV
VVVV hhhhhhh
h
VVVV hhhh
VVVV
VV hhhhhhh
h
() : 1 : 19

Which results in the following benefits:

() 1 × (19 − 1)
(St) 1 × (628 − 200)
(Se) 2 × (19 − 6)
(C) 2 × (19 − 6)
(St,Se) 2 × (628 − 297)
(St,C) 2 × (628 − 554)

Clearly, (St,Se) gives the largest benefit. The views selected by the
greedy algorithm are hence: (Se,C) and (St,Se), giving a total ben-
efit of 4 × (628 − 19) + 2 × (628 − 297). It was not necessary to completely
work out the numbers as the relative magnitude of the benefits was obvious.

6
5. (2 points)
Consider the following Datalog-program.

u = {(a), (b), (c)}, and v = {(b), (c), (d)}

are the extensional relations. r, s, and t are intensionally defined as follows:

t(X,Y) :- u(X), u(Y), not(v(X)).


r(X) :- u(X), v(Y), not(t(X,Y)).
s(X) :- r(Y), t(Y,X), not(r(X)).

(a) Is this program safe? Is it stratified?


(b) Give the contents of the intensional relations in the stratified model;
that is, give the tuples in the intensional relations in the stratified se-
mantics. Explain briefly the different steps in the computation of the
stratified model.
(c) Give one other minimal model of this program. Explain why this is a
minimal model.

Solution:

(a) The program is safe; for every rule, every variable in the head occurs pos-
itively in the body and every variable in the body occurs positively in a
non-arithmetic literal.
The program is also stratified as there is no pair of intensional relations
such that the first one negatively depends on the second one and vice versa
(the program even does not contain any recursion at all).

(b) Stratum 0 contains the extensional relations u and v and is fixed.


Stratum 1 contains the intensional relation t. Using the rule for t, we get
the following instantiation:

t = {(a, a), (a, b), (a, c)}

Stratum 2 contains the intensional relation r. r is not in stratum 1 as it


depends negatively on a relation in this stratum, namely t. Using the rule
for r, we get the following instantiation:

r = {(a), (b), (c)}

The correctness can be seen as follows: for X we can pick any element in v,
and for Y we can always pick d.
Stratum 3 contains the intensional relation s. s is not in stratum 2 as it
depends negatively on a relation in this stratum, namely r. Using the rule
for s, we get the following instantiation:

s = {}

7
(c) Another minimal model of this Datalog-program is the following:

u = {(a), (b), (c)}


v = {(b), (c), (d)}
t = {(a, a), (a, b), (a, c), (a, d)}
r = {(b), (c)}
s = {}

This instantiation clearly is a model of the program. It can be seen to be


minimal as follows. The extensional relations cannot be touched; they are
fixed by the program. For the relation t, due to the rule for t, (a, a), (a, b),
and (a, c) have to be in it. If they are not in t, the model is not consistent
with the rules. For the tuple (a, d), if we remove it, the rule for r will require
that the tuples (a), (b), (c) are in r, hence making the program inconsistent
with the model. Therefore, the proposed model is also minimal.

8
6. (2.5 points)
Consider the following database consisting of only one relation:

R1
a b
b c
c d
d e
e f
f g
g h

(a) Construct the Gaifman graph for this database.


(b) Give all pairs x, y such that (x) ≈D
2 (y).

(c) Show that there cannot exist a relational algebra query Q that returns
the middle element a(l+1)/2 of a chain Rl = {(a1 , a2 ), (a2 , a3 ), . . . , (al−1 , al )}
for all l ≥ 3, l odd. That is, Q(R3 ) = {(a2 )}, Q(R5 ) = {(a3 )},
Q(R7 ) = {(a4 )}, . . .

Solution:

(a) a b c d e f g h

(b) Obviously all pairs x, x with x ∈ {a, b, c, d, e, f, g, h} are 2-equivalent in D.


Furthermore, also the pairs (x, y) with x, y ∈ {c, d, e, f } are 2-equivalent in
D. The different neighborhoods are:

N2D ((a)) = h{(a, b), (b, c)}, ai


N2D ((b)) = h{(a, b), (b, c), (c, d)}, bi
N2D ((c)) = h{(a, b), (b, c), (c, d), (d, e)}, ci
N2D ((d)) = h{(b, c), (c, d), (d, e), (e, f )}, di
N2D ((e)) = h{(c, d), (d, e), (e, f ), (f, g)}, ei
N2D ((f )) = h{(d, e), (e, f ), (f, g), (g, h)}, f i
N2D ((g)) = h{(e, f ), (f, g), (g, h)}, gi
N2D ((h)) = h{(f, g), (g, h)}, hi

Notice that for the pairs c − d, d − e, e − f the following bijection π can be


used:
a7→b b7→c e7→f f 7→g
c7→d d7→e g7→h h7→a
For most pairs x − y not in the answer, the non-equivalence follows obviously
from the number of elements in the part of R1 that is “visible” from (x) and
(y). For the pair a − h, the reason lies in the role the constant a plays in
N2D ((a)), and h in N2D ((h)); any bijection from N2D ((a)) to N2D ((h)) must
map a to h, leading to a tuple (π(a), π(b)) = (h, ∗), that is not in R1 , and
hence surely not in the part of R1 that is “visible” from (h).

9
Visually, we can represent the different neighborhoods as follows (notice that
the special dedicated element is filled.)

N2D ((a)) ≡ • /◦ / ◦
N2D ((b)) ≡ ◦ /• / ◦ / ◦
N2D ((c)) ≡ ◦ /◦ / • / ◦ / ◦
N2D ((d)) ≡ ◦ /◦ / • / ◦ / ◦
N2D ((e)) ≡ ◦ /◦ / • / ◦ / ◦
N2D ((f )) ≡ ◦ /◦ / • / ◦ / ◦
N2D ((g)) ≡ ◦ /◦ / • / ◦
N2D ((h)) ≡ ◦ /◦ / •

(c) We will give a proof by contradiction. Suppose, for the sake of contradiction,
that there is a relational algebra query Q always returns the middle element
of a chain with an odd number of elements.

1. As every relational algebra query is local, Q must have a finite locality


rank, let’s say r.
2. Consider now the chain R2r+3 . The middle element is ar+2 , and hence
Q(R2r+3 ) = {(ar+2 )}.
3. Notice, however, that (ar+1 ) ≈D
2 (ar+2 ). The neighborhoods of (ar+1 )
and (ar+2 ) are as follows:

N2D ((ar+1 )) = h{(a1 , a2 ), (a2 , a3 ), . . . , (a2r , a2r+1 )}, ar+1 i


N2D ((ar+2 )) = h{(a2 , a3 ), (a3 , a4 ), . . . , (a2r+1 , a2r+2 )}, ar+2 i

The bijection that maps ai to ai+1 for all i = 1 . . . 2r + 2, and a2r+3 to


a1 is an isomorphism from N2D ((ar+1 )) to N2D ((ar+2 )). Using a similar
visualization as above, we get:

r nodes r nodes
N2D ((ar+1 ))≡ ◦ −→ . . . −→ ◦ −→ • −→ ◦ −→ . . . −→ ◦
z }| { z }| {

r nodes r nodes
D
N2 ((ar+2 )) ≡ ◦ −→ . . . −→ ◦ −→ • −→ ◦ −→ . . . −→ ◦
z }| { z }| {

4. Hence, (ar+1 ) is in the answer of Q(R2r+3 ) if and only if (ar+2 ) is one.


This is, however, in contradiction with 2.

As our assumption leads to contradiction, we can conclude that it is incorrect


and hence that Q cannot be expressed as a relational algebra query.

10
BONUS (1 point) A safe Datalog-program is always domain independent. Is the
opposite direction also true? That is, is a domain independent query always
safe? Prove or give a counter-example.

Solution: The opposite direction is not true. (Domain independence is even


undecidable, whereas safety involves only simple syntactic checks of the program.)
One example of a program that is not safe according to the definitions but that is
domain-independent:

r(X) :- s(Y), not(s(Y)).

The crux in this example is that the atoms in the head do not occur positively
in a non-arithmetic expression in body, but the expression is constructed in such
a way that it does not matter; the body will never be true anyway. As such, for
the result of the program it does not matter what the actual domains of X and Y
are. There were also constructions possible with arithmetic operators. One such
solution is:

r(X) :- plus(X,0,0).

Given the way the arithmetic operator plus is introduced (as if it is an infinite
extensional relation), and the rather stringent requirement in the safety condition
that every variable in the head must appear in a non-arithmetic positive literal
in the body, and the fact that we can reasonably assume that in the infinite
extensional relation there is only one X such that (X,0,0) is in it, the program is
not safe, but obviously domain-independent.

11

Vous aimerez peut-être aussi