Vous êtes sur la page 1sur 26

CS240A: Databases and Knowledge Bases

Carlo Zaniolo
Department of Computer Science
University of California, Los Angeles
December, 2001

Notes From Textbook


Advanced Database Systems
by Zaniolo, Ceri, Faloutsos,
Snodgrass, Subrahmanian and
Zicari
Morgan Kaufmann, 1997

A relational DB about students and courses


student

took

Name

Major

Year

'Joe Doe'

cs

senior

'Jim Jones'

cs

junior

'Jim Black'

ee

junior

Name

Course

Grade

'Joe Doe'

cs123

2.7

'Jim Jones'

cs101

3.0

'Jim Jones '

cs143

3.3

'Jim Black'

cs143

3.3

'Jim Black'

cs101

2.7

The same fact base for Datalog --------------------------------------student('Joe Doe', cs,


senior).
student('Jim Jones', cs,
junior).
student('Jim Black', ee,
junior).

took('Joe Doe', cs123,


2.7).
took('Jim Jones' ,
cs101, 3.0).
took('Jim Jones', cs143,
3.3).
took('Jim Black', cs143,

Rules

How to express logical conjunction:

Find the name of junior-level students who have taken both


cs101 and cs143

firstreq(Name)
student(Name, Major, junior),
took(Name, cs101, Grade1),
took(Name, cs143, Grade2).

Rule head, rule body. Upper case, lower case, anonymous


variables.

The commas in the body represent logical conjunction.

Disjunction
Junior-level students who took course cs131 or
course cs151 with grade better than 3.0
scndreq(Name) took(Name, cs131, Grade),
Grade > 3.0, student(Name, Major, junior).
scndreq(Name) took(Name, cs151, Grade),
Grade > 3.0, student(Name, _ , junior).

QUERIES
A closed query; the answer to such query is either yes or not. For
instance,
? firstreq('Jim Black')
An open query: ?firstreq(X)
and its answer: firstreq('Jim Jones')
firstreq('Jim Black')
Much power and convenience in cascading!!!
E.g., Both requirements must be satisfied to enroll in cs298
reqcs298(Name)
firstreq(Name), scndreq(Name).

The Relational Model versus Datalog


Terminology

Relational Model
Table or Relation
Row or Tuple
Column
View

Datalog
Base Predicate
Fact
Argument
Derived Predicate

Negation in Datalog

Only goals can be negated.

Negated heads are not allowed!

Junior-level Students who did not take course cs143


hastaken(Name, Course)
took(Name, Course, Grade).
lacks_cs143(Name)
student(Name, _, junior),
hastaken(Name, cs143).

Universal Quantification by Double Negation

Find the senior students who completed all the requirements


for the cs major: ?all_req_sat(X)
The first step is that of formulating the complementary query:
Find students who did not take some of the courses required
for a cs major.
We can now re-express the original query as: Find the senior
students who are NOT missing any requirement

req_missing(Name) student(Name,_,senior),
req(cs, Course), hastaken(Name, Course).
all_req_sat(Name)
student(Name, _, senior),
req_missing(Name).

Domain Relational Calculus

Relational calculus comes in two main flavors:

1. in the Domain Relational Calculus (DRC) the variables


denote values of attributes,
2. in the Tuple Relational Calculus (TRC) variables denote
whole tuples.

In DRC, the query ``Find the name of junior-level students


who have taken both cs101 and cs143'

{ (N) G1 (took(N, cs101, G1)) G2 (took(N, cs143, G2))


M student(N, M, junior))
}

Domain Relational Calculus (cont.)


The query ? scndreq(N) can be expressed as follows:
{ (N) G, M(took(N, cs131, G) G >3.0 student(N, M, junior))
G, M (took(N, cs151, G) G >3.0 student(N, M, junior)) }

DRC presents several syntactic differences w.r.t. Datalog:

set-definition by abstraction (rather than rules)

conjunctions and disjunctions in the same formula,

nesting of parentheses, and

explicit quantifiers.

Explicit Quantifiers
Existential

and universal quantification are both


allowed in DRC.
A query such as ?all_req_sat(N) can be
expressed either by
using

double negation (and only existential quantifiers)


or directly using the universal quantifier: Example:
Find the seniors who completed all cs requirements:
{ (N) M (student(N, M, senior))
C (req(cs, C) G (took(N, C, G)) }

The implication sign: p q is a shorthand for p q.

Tuple Relational Calculus (TRC)

In TRC, variables range over the tuples of a relation. For


instance, the TRC expression for the query ?
firstreq(N) is:
{ (t[1]) us (took(t) took(u) student(s)
t[2] = cs101 u[2] = cs143 t[1] = u[1]
s[3] = junior s[1] = t[1] ) }

The variables t and s, respectively denote tuples ranging


over took and student. t[1] denotes the first component
in t (corresponding to Name);

TRC requires an explicit statement of equality (e.g., s[1] =


t[1]), while in DRC equality is denoted implicitly by the
presence of the same variable in different places.

Relational DB Languages

The various languages are quite different, but they have


the same expressive power

Safe TRC and DRC expressions are equivalent, and there are
mappings that transform any formula in one language into an
equivalent one in the other.
For each TRC or DRC formula there is an equivalent, nonrecursive
Datalog program. The converse is also true, since a nonrecursive
Datalog program can be mapped into an equivalent DRC query.
Another language equivalent to these, is relational algebra (RA).
RA is an operator-based language, and thus provides a useful link
to concrete implementation of these logic-based languages.

Languages that can express every query expressible in


these languages are called relational complete.
Relational completeness is necessary but it is much less
than Turing Completeness & no longer sufficient in the
commercial world (ergo the mission of CS240A)

Commercial DB Languages

The actual query languages of commercial RDMS are


largely based on the formal query languages just
discussed. For instance:

Query-By-Example (QBE) is a visual query language based on


DRC

Languages such as QUEL and SQL are instead based on TRC.

In QUEL and SQL, the notation t.Name and t.Course


are used instead of t[1] and t[2]; also existential
quantification is (resp.) replaced by the constructs RANGE
and FROM.

RA algebra provides a good basis for the efficient


implementation of these relational languages.

Relational Algebra (RA)


A family of operators on relations that have the closure
property: take relations as arguments and return relations
as result.
Union. The union of relations R and S, denoted R S, is the
set of tuples that are in R, or in S, or in both.
R S = { t t R t S }
This operation is defined only if R and S have the same number of
columns.

Set difference Tuples tha belong to R but not to S.


R - S = { t t R r (r S t = r) }
This operation is defined only if R and S have the same number of
columns. Say that that number is n. Then:
t=r denotes that t[1] = r[1] t[n] = r[n]).

Relational Operators
Cartesian product.
RS ={ t (r R ) (s S) (t[1,, n]=r t[n+1, , n+m]=s)}
If R has n columns and S has m columns, then R S contains
all the possible m+n tuples whose first m components form
a tuple in R and the last n components form a tuple in S.
Thus, R S has m+n columns and R S tuples, where R
and S denote the respective cardinalities of the two
relations.

Projection. Let L1 be a sub list of the columns of R (with


possible reordering):

L1 R = { r[L] r R }

Relational Operators (cont)


Selection. F R denotes the selection on R according to the selection

formula F, where F obeys one of the following patterns:


$iC, where i is a column of R, is an arithmetic comparison operator,
and C is a constant, or
$i$j, where $i and $j are columns of R, and is an arithmetic comparison
operator, or
an expression built from terms such as those described in (i) and (ii),
above, and the logical connectives , , and .
Then:

F R = { t t R F}

where F denotes the formula obtained from F by replacing $i and $j with


t[i] and t[j]. For example, if F is ``$2 = $3 $1 = bob'', then F is
``t[2] = t[3] t[1] = bob''.
Thus: $2 = $3 $1 = bob R = { t t R t[2] = t[3] t[1] = bob }.

All previous operators, but set-difference, are monotonic.

Additional Operators

Addditional operators of frequent use can be derived from


these. For instance, we have join, semijoin, intersection,
division and generalized projection.

The join operator: R


S, can be constructed using
Cartesian product and selection.

S = F ( R S)

where F = $i1 1 $j1 ik k $jk ; i1, , ik are columns of R;


j1, , ik are columns of S; and 1, , k are comparison
operators. Then, if R has arity m, we define
F = $i1 1 $(j1+m ) $ik k $(jk+m ).

Additional Operators (cont.)


The intersection of two relations can be constructed either

by taking the equijoin of the two relations in every column


(and then projecting out duplicate columns) or by using the
following property: R S = R-(R-S) = S-(S-R).

The generalized projection of a relation R is denoted


L(R), where L is a list of column numbers and constants.
Unlike ordinary projection, components might appear more
than once, and constants as components of the list L are
permitted e.g., $1,c,$1 is a valid generalized projection

Unsafe Rules
An unsafe Rule: to find grades better than the grade Joe Doe
got in cs143, a user might write
bettergrade(G1)
took(Joe Doe, cs143, G), G1 > G.
Infinite answers. Assuming that, say Joe Doe got the grade of 3.3 (i.e.,

B+) in course cs143, then, there are infinitely many numbers that
satisfy the conditions of being greater than 3.3.
Lack of domain independence. A query formula is said to be domain
independent when its answer only depends on the database and the
constants in the query, but not on the domain of interpretation. The set
of values for G1 satisfying the rule above depends on what domain we
assume for numbers: e.g., integer, rational or real.
No relational algebra equivalent. RA expression take DB tables and
constant as operands and return finite relations.

Safety

In practical languages, it is desirable to allow only safe


formulas, which avoid the problems of infinite answers, and
loss of domain independence.

But the problems of domain independence and finiteness


of answers are undecidable even for non-recursive
queries. Therefore, necessary and sufficient syntactic
conditions that characterize safe formulas cannot be given
in general.

In practice, therefore, sufficient conditions are defined that


might be a more restrictive than necessary.

Safe Datalog: an inductive definition


1. Safe Predicates. A predicate q of P is safe if
(i) q is a database predicate, or
(ii) every rule defining q is safe
2. Safe Variables. A variable X in rule r is safe if
(i) X is contained in some positive goal q(t1, ... , tn), where
the predicate q(A1, ... , An) is safe, or
(ii) r contains some equality goal X = Y, where Y is safe.
3. Safe Rules. A rule r is safe if all its variables are safe
4. The goal ?q(t1, ... , tn) is safe when the predicate
q(A1, ... , An) is safe.

From Safe Rules to RA [Step 1]

P is transformed into an equivalent program P that does


not contain any equality goal by replacing equals with
equals and removing the equality goals. For example:
s(Z,b,W) q(X,X,Y),p(Y,Z,a), W=Z, W > 24.3.
Is translated into:
s(Z,b,Z) q(X,X,Y), p(Y,Z,a), Z > 24.3.

Mapping [Step 2]
The body of r is translated into the RA expression Bodyr
Bodyr is the cartesian product of all (base or derived)
relations in the body, followed by a selection F, where F
is the conjunction of the following conditions:
(i) inequality for each such goal (e.g., Z > 24.3),
(ii) equality between columns containing the same variables
(iii) equality between a column and the constant therein, e.g.
s(Z,b,Z) q(X,X,Y), p(Y,Z,a), Z > 24.3.
(ii)

Z > 24.3 translates into the selection condition $5 > 24.3,


the two occurrences of X translates into $1 = $2, while the the two Ys
maps into $3 = $4, and

(iii)

the constant in the last column of

(i)

P maps into $6 = a. Thus :

Bodyr = $1 = $2, $3 = $4, $6 = a, $5 > 24.3 (Q P)

Mapping Datalog into RA [Steps 3 & 4]


Step 3: Each rule r is translated into an extended projection
on Bodyr, according to the patterns in the head of r. For
the rule at hand
s(Z,b,Z) q(X,X,Y),p(Y,Z,a),Z > 24.3.
we obtain:
S=

$5, b, $5 ( Bodyr)

Step 4: Multiple rules with the same head are translated into
the union or their equivalent expressions.

Equivalence of RA and Safe Nonrecursive


Datalog Programs
Negated

Goals: A little more complex--- see


homework!

Equivalence of RA and Safe Nonrecursive Datalog


Theorem:

Let P be a safe Datalog program


without recursion or function symbols. Then, for
each predicate in P, there exists an equivalent
relational algebra expression.

Vous aimerez peut-être aussi