CS240A: Databases and Knowledge Bases

CS240A: Databases and Knowledge Bases
Carlo Zaniolo
Department of Computer Science
University of California, Los Angeles
December, 2001
Notes From Textbook

Advanced Database Systems
by Zaniolo, Ceri, Faloutsos,
Snodgrass, Subrahmanian and
Zicari
Morgan Kaufmann, 1997
A relational DB about students and courses

student
took
Name
Major
Year
'Joe Doe'
cs
senior
'Jim Jones'
cs
junior
'Jim Black'
ee
junior
Name
Course
Grade
'Joe Doe'
cs123
2.7
'Jim Jones'
cs101
3.0
'Jim Jones '
cs143
3.3
'Jim Black'
cs143
3.3
'Jim Black'
cs101
2.7
The same fact base for Datalog --------------------------------------student('Joe Doe', cs,

senior).
student('Jim Jones', cs,
junior).
student('Jim Black', ee,
junior).
took('Joe Doe', cs123,

2.7).
took('Jim Jones' ,
cs101, 3.0).
took('Jim Jones', cs143,
3.3).
took('Jim Black', cs143,
Rules
How to express logical conjunction:
Find the name of junior-level students who have taken both

cs101 and cs143
firstreq(Name)
student(Name, Major, junior),
took(Name, cs101, Grade1),
took(Name, cs143, Grade2).
Rule head, rule body. Upper case, lower case, anonymous

variables.
The commas in the body represent logical conjunction.
Disjunction
Junior-level students who took course cs131 or
course cs151 with grade better than 3.0
scndreq(Name) took(Name, cs131, Grade),
Grade > 3.0, student(Name, Major, junior).
scndreq(Name) took(Name, cs151, Grade),
Grade > 3.0, student(Name, _ , junior).
QUERIES
A closed query; the answer to such query is either yes or not. For
instance,
? firstreq('Jim Black')
An open query: ?firstreq(X)
and its answer: firstreq('Jim Jones')
firstreq('Jim Black')
Much power and convenience in cascading!!!
E.g., Both requirements must be satisfied to enroll in cs298
reqcs298(Name)
firstreq(Name), scndreq(Name).
The Relational Model versus Datalog

Terminology
Relational Model
Table or Relation
Row or Tuple
Column
View
Datalog
Base Predicate
Fact
Argument
Derived Predicate
Negation in Datalog
Only goals can be negated.
Negated heads are not allowed!
Junior-level Students who did not take course cs143

hastaken(Name, Course)
took(Name, Course, Grade).
lacks_cs143(Name)
student(Name, _, junior),
hastaken(Name, cs143).
Universal Quantification by Double Negation
Find the senior students who completed all the requirements

for the cs major: ?all_req_sat(X)
The first step is that of formulating the complementary query:
Find students who did not take some of the courses required
for a cs major.
We can now re-express the original query as: Find the senior
students who are NOT missing any requirement
req_missing(Name) student(Name,_,senior),
req(cs, Course), hastaken(Name, Course).
all_req_sat(Name)
student(Name, _, senior),
req_missing(Name).
Domain Relational Calculus
Relational calculus comes in two main flavors:
1. in the Domain Relational Calculus (DRC) the variables

denote values of attributes,
2. in the Tuple Relational Calculus (TRC) variables denote
whole tuples.
In DRC, the query ``Find the name of junior-level students

who have taken both cs101 and cs143'
{ (N) G1 (took(N, cs101, G1)) G2 (took(N, cs143, G2))

M student(N, M, junior))
}
Domain Relational Calculus (cont.)

The query ? scndreq(N) can be expressed as follows:
{ (N) G, M(took(N, cs131, G) G >3.0 student(N, M, junior))
G, M (took(N, cs151, G) G >3.0 student(N, M, junior)) }
DRC presents several syntactic differences w.r.t. Datalog:
set-definition by abstraction (rather than rules)
conjunctions and disjunctions in the same formula,
nesting of parentheses, and
explicit quantifiers.
Explicit Quantifiers
Existential
and universal quantification are both

allowed in DRC.
A query such as ?all_req_sat(N) can be
expressed either by
using
double negation (and only existential quantifiers)

or directly using the universal quantifier: Example:
Find the seniors who completed all cs requirements:
{ (N) M (student(N, M, senior))
C (req(cs, C) G (took(N, C, G)) }
The implication sign: p q is a shorthand for p q.
Tuple Relational Calculus (TRC)
In TRC, variables range over the tuples of a relation. For

instance, the TRC expression for the query ?
firstreq(N) is:
{ (t[1]) us (took(t) took(u) student(s)
t[2] = cs101 u[2] = cs143 t[1] = u[1]
s[3] = junior s[1] = t[1] ) }
The variables t and s, respectively denote tuples ranging

over took and student. t[1] denotes the first component
in t (corresponding to Name);
TRC requires an explicit statement of equality (e.g., s[1] =

t[1]), while in DRC equality is denoted implicitly by the
presence of the same variable in different places.
Relational DB Languages
The various languages are quite different, but they have

the same expressive power
Safe TRC and DRC expressions are equivalent, and there are
mappings that transform any formula in one language into an
equivalent one in the other.
For each TRC or DRC formula there is an equivalent, nonrecursive
Datalog program. The converse is also true, since a nonrecursive
Datalog program can be mapped into an equivalent DRC query.
Another language equivalent to these, is relational algebra (RA).
RA is an operator-based language, and thus provides a useful link
to concrete implementation of these logic-based languages.
Languages that can express every query expressible in

these languages are called relational complete.
Relational completeness is necessary but it is much less
than Turing Completeness & no longer sufficient in the
commercial world (ergo the mission of CS240A)
Commercial DB Languages
The actual query languages of commercial RDMS are

largely based on the formal query languages just
discussed. For instance:
Query-By-Example (QBE) is a visual query language based on

DRC
Languages such as QUEL and SQL are instead based on TRC.
In QUEL and SQL, the notation t.Name and t.Course

are used instead of t[1] and t[2]; also existential
quantification is (resp.) replaced by the constructs RANGE
and FROM.
RA algebra provides a good basis for the efficient

implementation of these relational languages.
Relational Algebra (RA)

A family of operators on relations that have the closure
property: take relations as arguments and return relations
as result.
Union. The union of relations R and S, denoted R S, is the
set of tuples that are in R, or in S, or in both.
R S = { t t R t S }
This operation is defined only if R and S have the same number of
columns.
Set difference Tuples tha belong to R but not to S.

R - S = { t t R r (r S t = r) }
This operation is defined only if R and S have the same number of
columns. Say that that number is n. Then:
t=r denotes that t[1] = r[1] t[n] = r[n]).
Relational Operators
Cartesian product.
RS ={ t (r R ) (s S) (t[1,, n]=r t[n+1, , n+m]=s)}
If R has n columns and S has m columns, then R S contains
all the possible m+n tuples whose first m components form
a tuple in R and the last n components form a tuple in S.
Thus, R S has m+n columns and R S tuples, where R
and S denote the respective cardinalities of the two
relations.
Projection. Let L1 be a sub list of the columns of R (with

possible reordering):
L1 R = { r[L] r R }
Relational Operators (cont)

Selection. F R denotes the selection on R according to the selection
formula F, where F obeys one of the following patterns:

$iC, where i is a column of R, is an arithmetic comparison operator,
and C is a constant, or
$i$j, where $i and $j are columns of R, and is an arithmetic comparison
operator, or
an expression built from terms such as those described in (i) and (ii),
above, and the logical connectives , , and .
Then:
F R = { t t R F}
where F denotes the formula obtained from F by replacing $i and $j with

t[i] and t[j]. For example, if F is ``$2 = $3 $1 = bob'', then F is
``t[2] = t[3] t[1] = bob''.
Thus: $2 = $3 $1 = bob R = { t t R t[2] = t[3] t[1] = bob }.
All previous operators, but set-difference, are monotonic.
Additional Operators
Addditional operators of frequent use can be derived from

these. For instance, we have join, semijoin, intersection,
division and generalized projection.
The join operator: R

S, can be constructed using
Cartesian product and selection.
S = F ( R S)
where F = $i1 1 $j1 ik k $jk ; i1, , ik are columns of R;

j1, , ik are columns of S; and 1, , k are comparison
operators. Then, if R has arity m, we define
F = $i1 1 $(j1+m ) $ik k $(jk+m ).
Additional Operators (cont.)

The intersection of two relations can be constructed either
by taking the equijoin of the two relations in every column

(and then projecting out duplicate columns) or by using the
following property: R S = R-(R-S) = S-(S-R).
The generalized projection of a relation R is denoted

L(R), where L is a list of column numbers and constants.
Unlike ordinary projection, components might appear more
than once, and constants as components of the list L are
permitted e.g., $1,c,$1 is a valid generalized projection
Unsafe Rules
An unsafe Rule: to find grades better than the grade Joe Doe
got in cs143, a user might write
bettergrade(G1)
took(Joe Doe, cs143, G), G1 > G.
Infinite answers. Assuming that, say Joe Doe got the grade of 3.3 (i.e.,
B+) in course cs143, then, there are infinitely many numbers that
satisfy the conditions of being greater than 3.3.
Lack of domain independence. A query formula is said to be domain
independent when its answer only depends on the database and the
constants in the query, but not on the domain of interpretation. The set
of values for G1 satisfying the rule above depends on what domain we
assume for numbers: e.g., integer, rational or real.
No relational algebra equivalent. RA expression take DB tables and
constant as operands and return finite relations.
Safety
In practical languages, it is desirable to allow only safe

formulas, which avoid the problems of infinite answers, and
loss of domain independence.
But the problems of domain independence and finiteness

of answers are undecidable even for non-recursive
queries. Therefore, necessary and sufficient syntactic
conditions that characterize safe formulas cannot be given
in general.
In practice, therefore, sufficient conditions are defined that

might be a more restrictive than necessary.
Safe Datalog: an inductive definition

1. Safe Predicates. A predicate q of P is safe if
(i) q is a database predicate, or
(ii) every rule defining q is safe
2. Safe Variables. A variable X in rule r is safe if
(i) X is contained in some positive goal q(t1, ... , tn), where
the predicate q(A1, ... , An) is safe, or
(ii) r contains some equality goal X = Y, where Y is safe.
3. Safe Rules. A rule r is safe if all its variables are safe
4. The goal ?q(t1, ... , tn) is safe when the predicate
q(A1, ... , An) is safe.
From Safe Rules to RA [Step 1]
P is transformed into an equivalent program P that does

not contain any equality goal by replacing equals with
equals and removing the equality goals. For example:
s(Z,b,W) q(X,X,Y),p(Y,Z,a), W=Z, W > 24.3.
Is translated into:
s(Z,b,Z) q(X,X,Y), p(Y,Z,a), Z > 24.3.
Mapping [Step 2]
The body of r is translated into the RA expression Bodyr
Bodyr is the cartesian product of all (base or derived)
relations in the body, followed by a selection F, where F
is the conjunction of the following conditions:
(i) inequality for each such goal (e.g., Z > 24.3),
(ii) equality between columns containing the same variables
(iii) equality between a column and the constant therein, e.g.
s(Z,b,Z) q(X,X,Y), p(Y,Z,a), Z > 24.3.
(ii)
Z > 24.3 translates into the selection condition $5 > 24.3,

the two occurrences of X translates into $1 = $2, while the the two Ys
maps into $3 = $4, and
(iii)
the constant in the last column of
(i)
P maps into $6 = a. Thus :
Bodyr = $1 = $2, $3 = $4, $6 = a, $5 > 24.3 (Q P)
Mapping Datalog into RA [Steps 3 & 4]

Step 3: Each rule r is translated into an extended projection
on Bodyr, according to the patterns in the head of r. For
the rule at hand
s(Z,b,Z) q(X,X,Y),p(Y,Z,a),Z > 24.3.
we obtain:
S=
$5, b, $5 ( Bodyr)
Step 4: Multiple rules with the same head are translated into
the union or their equivalent expressions.
Equivalence of RA and Safe Nonrecursive

Datalog Programs
Negated
Goals: A little more complex--- see

homework!
Equivalence of RA and Safe Nonrecursive Datalog

Theorem:
Let P be a safe Datalog program

without recursion or function symbols. Then, for
each predicate in P, there exists an equivalent
relational algebra expression.

CS240A: Databases and Knowledge Bases

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CS240A: Databases and Knowledge Bases

Transféré par

Droits d'auteur :

Formats disponibles

CS240A: Databases and Knowledge Bases

Notes From Textbook

A relational DB about students and courses

'Jim Jones '

The same fact base for Datalog --------------------------------------student('Joe Doe', cs,

took('Joe Doe', cs123,

How to express logical conjunction:

Find the name of junior-level students who have taken both

Rule head, rule body. Upper case, lower case, anonymous

The commas in the body represent logical conjunction.

The Relational Model versus Datalog

Only goals can be negated.

Negated heads are not allowed!

Junior-level Students who did not take course cs143

Universal Quantification by Double Negation

Find the senior students who completed all the requirements

Domain Relational Calculus

Relational calculus comes in two main flavors:

1. in the Domain Relational Calculus (DRC) the variables

In DRC, the query ``Find the name of junior-level students

{ (N) G1 (took(N, cs101, G1)) G2 (took(N, cs143, G2))

Domain Relational Calculus (cont.)

DRC presents several syntactic differences w.r.t. Datalog:

set-definition by abstraction (rather than rules)

conjunctions and disjunctions in the same formula,

nesting of parentheses, and

and universal quantification are both

double negation (and only existential quantifiers)

The implication sign: p q is a shorthand for p q.

Tuple Relational Calculus (TRC)

In TRC, variables range over the tuples of a relation. For

The variables t and s, respectively denote tuples ranging

TRC requires an explicit statement of equality (e.g., s[1] =

The various languages are quite different, but they have

Languages that can express every query expressible in

The actual query languages of commercial RDMS are

Query-By-Example (QBE) is a visual query language based on

Languages such as QUEL and SQL are instead based on TRC.

In QUEL and SQL, the notation t.Name and t.Course

RA algebra provides a good basis for the efficient

Relational Algebra (RA)

Set difference Tuples tha belong to R but not to S.

Projection. Let L1 be a sub list of the columns of R (with

Relational Operators (cont)

formula F, where F obeys one of the following patterns:

where F denotes the formula obtained from F by replacing $i and $j with

All previous operators, but set-difference, are monotonic.

Addditional operators of frequent use can be derived from

The join operator: R

where F = $i1 1 $j1 ik k $jk ; i1, , ik are columns of R;

Additional Operators (cont.)

by taking the equijoin of the two relations in every column

The generalized projection of a relation R is denoted

In practical languages, it is desirable to allow only safe

But the problems of domain independence and finiteness

In practice, therefore, sufficient conditions are defined that

Safe Datalog: an inductive definition

From Safe Rules to RA [Step 1]

P is transformed into an equivalent program P that does

Z > 24.3 translates into the selection condition $5 > 24.3,

the constant in the last column of

P maps into $6 = a. Thus :

Bodyr = $1 = $2, $3 = $4, $6 = a, $5 > 24.3 (Q P)