Vous êtes sur la page 1sur 13

Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates

Leopoldo Bertossi and Lechen Li Carleton University, School of Computer Science, Ottawa, Canada

arXiv:1105.1364v1 [cs.DB] 6 May 2011

AbstractThere may be sensitive information in a relational database, and we might want to keep it hidden from a user or group thereof. In this work, sensitive data is characterized as the contents of a set of secrecy views. For a user without permission to access that sensitive data, the database instance he queries is updated to make the contents of the views empty or contain only tuples with null values. In particular, if this user poses a query about any of these views, no meaningful information is returned. Since the database is not expected to be physically changed to produce this result, the updates are only virtual. And also minimal in a precise way. These minimal updates are reected in the secrecy view contents, and also in the fact that query answers, while being privacy preserving, are also maximally informative. Virtual updates are based on the use of null values as used in the SQL standard. We provide the semantics of secrecy views and the virtual updates. The different ways in which the underlying database is virtually updated are specied as the models of a logic program with stable model semantics. The program becomes the basis for the computation of the secret answers to queries, i.e. those that do not reveal the sensitive information.

For example, when the database receives a query from that user, it checks if the query can be answered using those views alone. More precisely, if the query can be rewritten in terms of the views, for every possible instance [24]. If no complete rewriting is possible, the query is rejected. In [29] the problem about the existence of a conditional rewriting is investigated, i.e. relative to an instance at hand. Our approach to the data protection problem is based on specications of what users are not allowed to access through query answers, which is quite natural. Data owners usually have a more clear picture of the data that is sensitive rather than about the data that can be publicly released. Dealing with our problem as the complement of the problem formulated in terms of authorization views is not natural, and not necessarily easy, specially considering that complements of database views would be involved [18], [19]. According to our approach, the information to be protected is declared as a secrecy view, or a collection of them. Each user or class of them may have associated a set of secrecy views. When a user poses a query to the database, the system virtually updates some of the attributes values on the basis of the set of secrecy views associated to that user. In this work, we consider updates that modify attribute values assigning null values. As a consequence, in each of the resulting updated instances, the extension of each of the secrecy views either becomes empty or contains tuples showing only null values. Then, the original query is posed to the resulting class of updated instances. This amounts to: (a) Posing the query to each instances in the class. (b) Answering it as usual from each of them. (c) Collecting the answers that are shared by all the instances in the class. In this way, the system will return answers to the query that do not reveal the secret data. We illustrate the gist of our approach by means of an example.

I. Introduction
Database management systems allow for massive storage of data, which can be easily and efciently accessed and manipulated. However, at the same time, the problems of data privacy are increasingly important. For example, for commercial or legal reasons, administrators of sensitive information may not want or be allowed to release certain portions of the data. It becomes crucial to address database privacy issues. In this scenario, certain users should have access to only certain portions of a database. Preferably, what a particular user (or class of them) is allowed or not allowed to access should be specied in a declarative manner. This specication should be used by the database engine when queries are processed and answered. We would expect the database to return answers that do not reveal anything that should be kept protected from a particular user. On the other side and at the same time, the database should return as informative answers as possible once the privacy conditions have been taken care of. Some recent papers approach data privacy and access control on the basis of authorization views [24], [29]. View-based data privacy usually approaches the problem by specifying which views a user is allowed to access.
Contact author: bertossi@scs.carleton.ca. Faculty Fellow of the IBM CAS, Toronto.

Example 1. Consider the following relational database D: Marks studentID 001 001 002 courseID 01 02 02 mark 56 90 70

The secrecy view Vs dened below species that a student with her course mark must be kept secret when the mark is less than 60:

Vs (sid , cid , mark ) Marks(sid , cid , mark ), mark < 60.1 Now, a user of the database wants to obtain the students marks, posing the following query: Q(sid , cid , mark ) Marks(sid , cid , mark ). (1)

Through this query the user can obtain the rst record Mark (001, 01, 56), which is sensitive information. A way to solve this problem consists in virtually updating the base relation according to the denition of the secrecy view. In this way, the secret information, i.e. the extension of the secrecy view, cannot be revealed to the user. Here, in order to protect the tuple Mark (001, 01, 56), the new instance D below is obtained by virtually updating the original instance, changing the attribute value 56 to null. Now, by posing the query about the secrecy view, i.e. Q1 (sid , cid , mark ) Marks(sid , cid , mark ), mark < 60, to D , the user gets an empty answer. Marks studentID 001 001 002 courseID 01 02 02 mark null 90 70

This is because the comparison of null with any value will be evaluated as unknown. Similarly, query (1) will get the rst tuple with null instead of 56. Furthermore, the user cannot obtain the mark 56 by combining any answers obtained through other queries on D . Notice that, among other elements, there are two that are crucial for this approach to work: (a) The given database may contain null values and if it has them or not is not known to the user, and (b) The semantics of null values, including the logical operations with them. In this regard, we can say for the moment and in intuitive terms, that we will base our work on the SQL semantics of nulls, or, more precisely, on a logical reconstruction of this semantics (more details below). Hiding sensitive information is one of the concerns. Another one is about still providing as much information as possible to the user. In consequence, the virtual updates have to be minimal in some sense, while still doing their job of protecting data. In the previous example, we might consider virtually deleting the whole tuple Marks(001, 05, 56) to protect secret information, but we may lose some useful information, like the student ID and the course ID. Furthermore, the user should not be able to guess the protected information by combing information obtained from different queries. As illustrated above, null values will be used to virtually update the database instance. Null values have received the attention of the database community [28], [25], [16], and may have several possible interpretations, e.g. as a replacement for a real value that is non-existent, missing,
1 We use Datalog notation for view denitions, and sometimes also for queries.

unknown, inapplicable, etc. Several formal semantics have been proposed for them. Furthermore, it is possible to consider different, coexisting null values. In this work, we will use a single null value, denoted as above and in the rest of this paper, by null. Furthermore, we will treat null as the NULL in SQL relational databases. Since the SQL standard does not provide a precise, formal semantics for NULL, we will adopt here a formal, logical reconstruction of SQL nulls, rst proposed in [7] and rened here (cf. Section II-B). It captures the logics and the semantics of the SQL NULL that are relevant for our work, namely (conjunctive) query answering.2 This makes our approach to secrecy compatible with and implementable on top of commercial DBMSs. In this paper, we consider only conjunctive secrecy views and conjunctive queries. The semantics of null-based virtual updates for data privacy that we provide is modeltheoretic, in sense that the possible admissible instances after the update, the so-called secrecy instances, are dened and characterized. This denition captures the requirement that, on a secrecy instance, the extensions of the secrecy views contain only tuples with null or become empty. Furthermore, the secrecy instances do not depart from the original instance by more than what is needed to protect the secret data. Next, the semantics of secret answers to a query is introduced. Those answers are invariant under the class of secrecy instances. More precisely, a ground tuple t to a rst order query Q() is a secret answer from instance D if it x is an answer to Q() in every possible secrecy instance x for D. Of course, explicitly computing and materializing all the secrecy instances to secretely answer a query is too costly. Ways around this naive approach have to be found. We show that the class of secrecy instances, for a given instance D and set of secrecy views V s , can be captured in terms of a disjunctive logic program with stable model semantics [13], [14]. More precisely, there is a one-toone correspondence between the secrecy instances and the stable models of the program. The logic program uses extra annotation constants, to keep track of intermediate update steps. As a consequence, the logic programs can be used to: (a) Compactly specify (axiomatize) the class of secrecy instances; and (b) Compute secret answers to queries by running the program on top of the original instance. The structure of the rest of this paper is as follows. Section II introduces some basic denitions and notation needed in the later sections. This includes a precise definition of conjunctive query answering in databases with nulls. In Section III, a precise semantics for secrecy based on the use of null values is proposed. Section IV presents the notion of secret answer to a rst-order query. Section V presents secrecy logic programs. Section VI presents some connections to database repairs and consistent query answering. Section VII discusses related work. Section VIII presents conclusions, and points to future work.
2 The main issue in [7] was integrity constraint satisfaction in the presence of nulls, for database repair and consistent query answering [2].

II. Preliminaries
Consider a relational schema = (U, R, B), where U is the possibly innite database domain, with null U, R is a nite set of database predicates, and B is a nite set of builtin predicates, say B = {=, =, >, <}. For an n-ary predicate R R, R[i] denotes the ith position or attribute of R, with 1 i n. The schema determines a language L() of rst-order (FO) predicate logic. A relational instance D for schema can be seen as a nite set of ground atoms of the form R(), with R R, and a a tuple of constants a from U. Data will be protected via a xed set V s of secrecy views Vs . They are associated to a particular user or class of them. Denition 1. A secrecy view Vs is dened by a Datalog rule of the form3 Vs () R1 (1 ), . . . , Rn (n ), , x x x
4

A. Null value semantics: The basics In [10], Codd proposed three-valued logic with truth values true, false, and unknown for relational databases with NULL. Basically, when a NULL is involved in a comparison operation, the result is always considered to be unknown. He also gave a three-valued semantics to L()-formulas containing boolean operators, like , , and . A threevalued logic was adopted by the SQL standard, and it has been partially implemented in most common commercial DBMSs, and with some particular modications in each case. As a result, the semantics of NULL in both the SQL standard and the commercial DBMSs is not quite clear. In particular, this is the case of IC satisfaction in the presence of NULL. The semantics for IC satisfaction with NULL introduced in [7], [8] presents a FO semantics of nulls in SQL databases. It is a reconstruction in classical logic of the treatment of NULL in SQL DBs. More precisely, this semantics captures the notion of satisfaction of ICs, and also of query answering for a broad class of queries in relational databases. In the rest of this section, we sketch some of the elements of this notion of query answer. It will be the one used in the rest of this work. In this section we provide the basics of query answering with null, which could be good enough for a rst reading. The details can be found in Section II-B. In the following, database instances are assumed to have a single constant null to represent a null value. A tuple a of elements of U is an answer to query Q(), x denoted D |=N Q(), if the formula (that represents) Q a is classically true when the quantiers over its relevant variables (attributes) run over (U {null}), and those over of the non-relevant variables run over U. The relevant free variables in the query are not allowed to take the value null either. Cf. Section II-B for a precise denition (and also [7], [8]). Here we give just an example. Example 3. Consider the instance D and query below: R A 1 2 null B 1 null 3 C 1 null 3 S B null 1 3 (3)

(2)

with Ri R, x i xi , and xi is a tuple of variables. Formula is a conjunction of built-in atoms containing variables or domain constants. We can see that secrecy views are dened by conjunctive queries with built-in predicates. Vs (D) denotes the extension of view Vs when computed on an instance D for . Alternatively, the view in (2) can be seen as dened by a conjunctive query written in FO relational calculus, x y x x namely QVs() : (R1 (1 ) Rn (n ) ), where y = ( xi ) x. Here, the variables in x are the free variables of the query. Similarly, Q(D) denotes the set of answers to query Q from D. In Section II-B we provide a precise denition of conjunctive query with built-ins as needed for our work and applications to privacy. Example 2. Consider the database instance D = {R(a, b), R(c, d), S(b, f ), S(d, g), S(e, e)}, and the secrecy view dened by Vs (x) R(x, y), S(y, z). Here, the data protected by the view is the one that belongs to its extension, namely Vs (D) = {(a), (c)}. Sometimes, to emphasize the view predicate involved, e.g. we write Vs (D) = {Vs (a), Vs (c)}. A query is a formula Q() of L(), with free variables x x. D |= Q[] denotes that the instance D makes the query a true when the free variables take the values as in a, a tuple of constants from U. In this case, a is an answer to the query. Finally, an integrity constraint (IC) is a sentence of L(). D |= denotes that instance D satises . Here, |= is the usual notion of satisfaction found in predicate logic. According to it, the constant null is just another element of the database domain. We will use this notion at some points. However, to capture the special role of null among the constants, we will introduce next a different notion, the one of formula satisfaction under nulls. It will coexist with the classical notion.
3 We will frequently use Datalog notation for view denitions and queries. 4 When there is no possible confusion, we treat sequences of variables as set of variables. I.e. x1 xn as {x1 , . . . , xn }.

Q(x) : yz(R(x, y, z) S(y) y > 2).

A variable v (quantied or not) in a conjunctive query is relevant if it appears (non-trivially) twice in the formula [7]. Occurrences of the form v = null and v = null do not count though. In this query, the only relevant quantied variable is y, because it participates in a join and a built-in in the quantier-free matrix of (3). So, there are two reasons for y to be relevant. The only free variable is x, which is not relevant since it does not satisfy the same criterion applied to y. As for query answers, the only candidate values for x are: null, 2, 1. In this case, null is a candidate value because x is not a relevant variable. First, x = null is an answer to the query, because the formula yz(R(x, y, z) S(y) y > 2) is true in D, with

a non-null witness value for y and a witness value for z that combined make the (non-quantied) formula true. Namely, y = 3, z = 3. So, it holds D |=N Q[null]. Next, x = 2 is not an answer. For this value of x, because the candidate value for y, namely null that accompanies 2 in P , makes the formula (R(x, y, z) S(y) y > 2) false. Even if it were true, this value for y would not be allowed. Finally, x = 1 is not an answer, because the only candidate value for y, namely 1, makes the formula false. In consequence, null is the only answer. This notion of query answer coincides with the classical rst-order semantics for queries and databases without null values [7], [8]. B. Semantics of query answers with nulls Here we introduce the semantics of rst-order conjunctive query answering in relational databases with null values. This semantics can be extended to a broader class of queries and also to integrity constrain satisfaction. It builds upon a similar and more general semantics rst introduced in [7], [8]. As in SQL, we assume we have a single null value, null. Queries may contain the special unary predicates IsNull and IsNotNull, to avoid using explicit (in)equality in combination with nulls in them. IsNull (null) is true, but IsNull(c) is false for any other constant c in the database domain. For any constant d U, IsNotNull(d) is true iff IsNull(d) is false. For example, we could have the query Q(x) : y(P (x, y) IsNull(y)). However, we can also have the query Q(x) : y(P (x, y) y = null). As in SQL relational databases, these two queries have different semantics, which is illustrated in the following example. Example 4. Consider the schema S = {R(A, B)} and the instance in the table below. In it NULL is the SQL null. If this instance is stored in an SQL database, we can observe the behavior of the following queries when they are directly translated into SQL and run on an SQL DB: (a) Q1 (y) : x(R(x, y) y = null) SQL: Select * from R where B = NULL Result: No tuple R A a a d d u v v B b c NULL e u NULL r (b) : x(R(x, y) IsNull(y)) SQL: Now uses the IS NULL predicate Result: The expected two tuples Q (y) 1

(e) Q3 (x, y) : R(x, y) x = y) SQL: Select * from R where A = B Result: One row: (u,u) (f) Q4 (x, y) : (R(x, y) x = y) SQL: Select * from R where A <> B Result: Four tuples: (a,b), (a,c), (d,e), (v,r) (g) Q5 (x, y, x, z) : yz(R(x, y) R(x, z) y = z) SQL: Select * from R r1, R r2 where r1.A = r2.A and r1.B <> r2.B Result: Two tuples: (a,b,a,c), (a,c,a,b) This semantics is captured in Denition 4 below. In the previous example each query Q is dened by the formula on the right-hand side. Below, we will identify the query with its dening rst-order formula. Denition 2. A conjunctive query is an L() formula of the form Q() : (A1 (1 ) An (n )), x y x x (4)

y, and the = ( i xi ) where y i xi , x Ai are atoms containing any of the predicates in R {=, =, <, >, IsNull, IsNotNull} plus terms, i.e. variables or constants in U. Furthermore, those atoms are never of the form t = null, null = t, t = null, null = t, with t a term, null or not. The idea here is to dene queries that explicitly mention the null value in terms of the built-ins InNull or IsNotNull. If any of the excluded atoms in (a) above is intended to be in a query, it has to be replaced by IsNull (t) and IsNotNull(t), resp. However, an arbitrary L() formula may contain those excluded atoms, but they should not be in a conjunctive query (as needed in this work). Denition 3. Consider a conjunctive query of the form Q() : (, y ), where is possibly empty prex of x y x y existential quantiers, and is a (quantier-free) conjunction of atoms. The set of relevant variables of Q is [8]: V R (Q) := {v | v is a variable that occurs at least twice in , without counting IsNull(v)/IsNotNull (v)}. Of course, we make the assumption that the occurrences are non-trivial, i.e. we do not nd redundant atoms, as in P (x, y) P (x, y) or P (x, y) y = y. For example, for query Q(x) : y(P (x, y, z) Q(y) IsNull (y)), V R (Q(x)) = {y}, because y is used twice in P (x, y, z) Q(y) IsNull (y). As usual in logic, we consider assignments from the set, Var , of variables to the underlying database domain U (that contains constant null), i.e. s : Var U. Such an assignment can be extended to terms, as s. It maps every variable x to s(x), and every element c of U to c. For an assignment s, a variable y and a constant c, s y denotes c the assignment that coincides with s everywhere, possibly except for y, that takes the value c. We dene similarly the extension s y . Given a formula , [s] denotes the formula c

(c) Q2 (y) : x(R(x, y) y = null) SQL: Select * from R where B <> NULL Result: No row (d) Q (y) : x(R(x, y) IsNotNull(y)) 2 SQL: Uses now the IS NOT NULL predicate Answer: The expected ve tuples

obtained from by replacing its free variables by their values according to s. Now, given a formula (query) and a variable assignment function s, we verify if instance D satises [] by s assuming that the quantiers on relevant variables range over (U {null}), and those on non-relevant variables range over U. More precisely, we dene, by induction on , when D satises with assignment s, denoted D |=N [s]. Denition 4. Assume that B = {=, =, <, >}. Let be a conjunctive query, and s an assignment. The pair D, s satises under the null-semantics, denoted D |=N [s] when falls in one of the following cases: (an inductive denition) 1. is IsNull(t) (IsNotNull(t)), t a term, and s(t) = null ((t) = null). s 2. is t1 t2 with {<, >} and t1 , t2 terms, and s(t1 ) = null, s(t2 ) = null, and D |= (t1 tn )[s]. 3. is t1 t2 with {=, =}, t1 , t2 are terms and one of the following holds:5 (a) is =, t1 Var , t2 = null, and s(t1 ) = null. (or symmetrically) (b) is =, t1 , t2 Var , and s(t1 ) = s(t2 ). (c) is =, t1 Var, t2 is null, and s(t1 ) = null. (or symmetrically) (d) is =, t1 , t2 Var , and s(t1 ) = s(t2 ). 4. is R(t1 , . . . , tn ), with R R, and R((t1 ), . . . , s s(tn )) D. 5. is ( ), and D |=N [] and D |=N []. s s 6. is y , and one of the following holds: (a) y V R (), and there is a c in (U {null}) with D |=N [ y ]. sc (b) y V R (), and there is a c in U with D |=N [ y ]. sc This semantics can be applied in particular to conjunctive queries. The reason for giving it for arbitrary formulas is that it can be applied also to the satisfaction of integrity constraints under null values [8], [7]. Denition 5. [8] Let Q() : (, y ) be a conjunctive x y x query, with x = x1 , . . . , xn . (a) A tuple (t1 , . . . , tn ) U n is an answer from D under the null query answering semantics to Q, in short, an N -answer, and denoted D |=N Q[t1 , . . . , tn ], iff there exists an assignment s such that: i. s(xi ) = ti , for i = 1, . . . , n; ii. ti = null, for each xi V R (Q); and iii. D |=N ()[]. y s (b) Ans N(Q, D) denotes the set of N -answers to Q from database D. (c) If Q is a sentence (boolean query), the N -answer is yes iff D |=N Q, and no, otherwise. Notice that D |=N ()[] in (a) above requires, y s according to Denition 4, that the existentially quantied variables yj in the y that are relevant do not take the value null. Variables xi can take the value null only when they are not relevant in the query. Example 3 illustrates this
5 Here we will use the symbols = and = both at the object and the meta levels, but there should not be a confusion since valuations are involved.

denition. In it, since the free variable x is not relevant, Ans N(Q, D) = {null}. The N -query answering semantics coincides with classical rst-order query answering semantics in databases without null values [8], [7]. More precisely, if null U / (and then it does not appear in D either): D |=N Q[t] iff D |= Q[t]. Furthermore, every conjunctive query can be syntactically transformed into new FO query for which the evaluation can be done by treating null as any other constant [8], [7]. The transformation is similar to the one in Proposition 1. More precisely, a conjunctive query Q() of the form x (4) can be rewritten into the classically conjunctive query QN () : (A (1 ) A (n ) x y 1 x n x
vV R (Q)

v = null),

where A (i ) is Ai (i ) when the predicate belongs to R x i x {=, =, <, >}, but when Ai is of the form IsNull(t) or IsNotNull(t), then A is t = null or t = null, resp. It i holds: D |=N Q[t] iff D |= QN [t]. On the right-hand side, we have classical rst-order satisfaction, and null is treated as an ordinary constant in the domain. This transformation ensures that relevant variables range over (U {null}). Example 5. (example 3 continued) The query Q in Example 3 can be rewritten as QN : yz(P (x, y, z) Q(y) y > 2 y = null). Now, D |= QN [1], but also D |= yz(P (1, y, z) Q(y) y > 2 y = null) under classical query evaluation, with null treated as an ordinary constant. Similarly, D |= QN [2] due to the new conjunct y = null. Finally, D |= QN [null] because D |= (P (null, 3, 3)Q(3)3 > 23 = null). Since null is treated as any other constant, we can compare it with 3. By the unique names assumption, it holds null = 3.

III. Secrecy Instances


In this work we will make use of null to protect secret information. The idea is that the extensions of the secrecy views will contain only tuples with null or will become empty. View evaluation corresponds to conjunctive query evaluation, which will be based on the notion of query answering with nulls we just introduced. Example 6. (example 3 continued) Consider the secrecy view Vs (x) R(x, y, z), S(y), y > 2. This view can be expressed as the FO query: QVs (x) : yz(R(x, y, z) S(y) y > 2), that we found in (3). Under the semantics of secrecy in the presence of null, we expect that the extensions of the secrecy views contain tuples only with null or become empty. This implies that the values in attribute A associated with variable x in QVs are null, or the values in B associated with variable y in QVs are null, or the negation of the comparison is true.

These three cases correspond to the three assignments in Example 3. This example shows that there are some attributes that are particularly relevant for the view extensions, A and B in that case. In the following, we make precise this notion of relevant attribute. Denition 6. Consider a view Vs dened by (2). (a) For a predicate R R in the body of (2) and a term t (i.e. a variable or constant), pos R(Vs , t) denotes the set of positions in R where t appears in the body of Vs s denition. (b) The set of combination attributes for Vs is: C(Vs ) = {R[i] | there is a relevant variable v with i pos R(Vs , v)}. (c) The set of secrecy attributes for Vs is: S(Vs ) = {R[i] | there is a variable v in the head of (2), and i pos R(Vs , v)}.

domain U as D . Example 8. (example 3 continued) For the given instance D and the set of attributes A = {R[2], R[3], S[1]}, we obtain the following instance DA : RA B 1 null 3 C 1 null 3 SA B null 1 3

The following proposition provides a characterization of admissible instance for a set of secrecy of views in terms of classical FO satisfaction. Its proof is rather straightforward [22, Proposition 1]. Proposition 1. Let V s be a set of secrecy views Vs of the x form (2). Let QVs() be their expressions as conjunctive queries of the form ( n Ri (i ) ). For an instance y i=1 x D, D Admiss(V s ) iff for each Vs V s , DA(Vs ) |= QVs , N where QVs is the sentence N
n

Combination attributes for a secrecy view Vs are those involved in joins or built-in predicates. Secrecy attributes are those appearing in the head of Vs s denition. They correspond to the free variables in the associated query QVs . Denition 7. The set of s-relevant attributes for a secrecy view Vs are those (associated to positions) in the set A(Vs ) = C(Vs ) S(Vs ). Example 7. (example 6 continued) Consider again the secrecy view Vs (x) R(x, y, z), S(y), y > 2. Since y is the only relevant variable, C(Vs ) = {R[2], S[1]}. Since x is the only free variable, S(Vs ) = {R[1]}. Therefore, A(Vs ) = {R[1], S[1], R[2]}. The value in attribute C is not relevant to gain the view extension. This makes sense, because atoms are joined via attribute B, and making sure that its values are greater than 2. And then, the projection is on attribute A. Denition 8. A database instance is admissible for a set V s of secrecy views of the form (2), denoted D Admiss(V s ), if under the |=N semantics, each Vs (D) is empty or in all its tuples only null appears. From the example above, we can see that, given a database instance D and a secrecy view Vs of the form (2), that D is admissible, when one of the following three cases occur: (a) There is a null in any of the combination attributes; (b) Values in the secrecy attributes are all null; (c) The negation of the built-ins is true. Below we provide an alternative, actually a classical characterization of admissible instance. We need some notation rst. Denition 9. For a set of attributes A and a predicate R R, RA denotes the predicate R restricted to the attributes in A. D A denotes the database D with all its database atoms projected onto the attributes in A, i.e. DA = {RA (A (t)) | R(t) D}, where A (t) is the projection on A of tuple t. DA has the same underlying
6 For distinction from the notion of relevant attribute/variable used in Sections II-A and II-B.

(
i=1

Ri

A(Vs )

(i ) x

v = null
vC(Vs )

u = null ).
uS(Vs )

Here, denotes the universal closure of the formula that follows it; and DA(Vs ) |= QVs refers to classical satisfaction N in FO predicate logic, where null is treated as any other constant. Example 9. (example 7 continued) According to the above denition, in order to check whether the database instance D is admissible, the following must hold: DA(Vs ) |= xy(RA(Vs ) (x, y) S A(Vs ) (y) y = null x = null y 2), with D A(Vs ) given below. When checking this, null is treated as any other constant. RA(Vs ) A 1 2 null B 1 null 3 S A(Vs ) B null 1 3

For x = 1, y = 1, the antecedent of the implication is satised since RA(Vs ) (1, 1) DA , S A(Vs ) (1) DA . For these values, the consequent is also satised, because y = 1 < 2. For x = 2, y = null, the consequent is satised since y is null. For x = null, y = 3, the antecedent is satised since RA(Vs ) (null, 3) DA , S A(Vs ) (3) DA . For these values, the consequent is also satised, because null = null is true. So, D |=N QVs , and the given instance D is admissible. Since we will virtually change attribute values by null in order to enforce privacy on instance D, we need to be able to compare database instances (for the same schema of D) that have been subject to this kind of changes. We

have to introduce notions of comparison and minimality. A secrecy instance for D will be admissible and will also minimally differ from D. As expected, the comparison of instances will consider the presence of null in their tuples. Denition 10. [21] (a) The binary relation on the database domain U, is dened as follows: c d iff c = d or c = null, where c, d U. (b) Let t1 = (c1 , . . . , cn ) and t2 = (d1 , . . . , dn ) be sequences of constants of U. t1 t2 iff ci di for each i {1, . . . , n}. Also, t1 t2 iff t1 t2 and t1 = t2 . This partial order relationship t1 t2 indicates that t1 is less or equally informative than t2 . For example, tuple (a, null) provides less information than tuple (a, b). Then (a, null) (a, b) holds. In order to capture the fact that we are just modifying attribute values (as opposed to inserting or deleting tuples), we we assume that database tuples have tuple identiers. More precisely, we assume that each predicate has an additional, rst, attribute TID , which is a key for the relation. Its values are taken in N. In consequence, tuples in an instance, say our given instance D, will be of the form R(k, t), with k N, and t U n , and R R is of arity n. Below, we will consider only instances D that are correlated to D: There is a surjective function from D to D , such that (R(k, t)) = R(k, t ), for some t . That is, the mapping respects the predicate name and the tuple identier. We say that D is D-correlated (via ). In the rest of this section, D is a xed instance, the one under privacy protection. Denition 11. (a) For database tuples R1 (k1 , t1 ), R2 (k2 , t2 ), with R1 , R2 R, R1 (k1 , t1 ) R2 (k2 , t2 ) iff R1 = R2 , k1 = k2 , and t1 t2 . (b) For D-correlated instances D1 , D2 , D1 D D2 iff for every R(k, t1 ) D1 , R(k, t2 ) D2 its holds R(k, t2 ) R(k, t1 ). As usual, D1 <D D2 iff D1 D D2 , but not D2 D D1 . Notice that the partial order D between D-correlated instances inverts the partial order between tuples. The reason is that we want the secrecy instances to be minimal, as is customary for database repairs [2]. A minimal instance is closest to the original instance. Denition 12. Given a set V s of secrecy views, an instance Ds is a secrecy instance for D wrt V s iff: (a) Ds Admiss(V s ), and (b) Ds is D -minimal in the class of D-correlated database instances that satisfy (a), i.e. there is no instance D in that class with D <D Ds . Sec(D, V s ) denotes the set of all the secrecy instances for D wrt V s .

Consider also the secrecy view: Vs (x, z) P (x, y), R(y, z), y < 3.7 D itself is not a secrecy instance, because it is not admissible. Now, consider the following alternative updated instances Di : D1 D2 D3 D4 Di
A(Vs )

{P (1, null, 2), R(1, 2, null)} {P (1, 1, null), R(1, 2, 1)} {P (1, 1, 2), R(1, null, 1)} {P (1, 1, null), R(1, null, 1)}

They are all admissible, that is: |= xy(P A(Vs )(x, y) RA(Vs )(y, z) (y = null (x = null z = null) y 3), with Di s = {P A(Vs )(1, 2), RA(Vs )(2, 1)}. D1 , D2 , and D3 are the only three secrecy instances. D4 is not, because P (1, 1, null) P (1, 1, 2); and then, D3 <D D4 .
A(V )

IV. Privacy Preserving Query Answers


Now we want to dene and compute the secret answers to queries from a given database D that is subject to privacy constraints. They will be dened on the basis of the class of secrecy instances for D. Its members will be queried instead of directly querying D. We may consider the class of secrecy instances as representing a logical database, given through its models. In such a case, the expected answers are those that are true of all the instances in the class, and become the so-called certain answers [16]. Denition 13. Let Q() L() be a conjunctive query. x A tuple a of constants in U is a secret answer to Q from D wrt to a set of secrecy views V s iff Ds |=N Q[] for a each Ds Sec(D, V s ). S A(Q, D, V s ) denotes the set of all secret answers. Example 11. (example 10 continued). Consider the query Q(x, z) : y(P (x, y) R(y, z) y < 3). According to the denition of query answer introduced in Sections II-A and II-B, it holds: Q(D1 ) = {(null, null)}, Q(D2 ) = , and Q(D3 ) = . These answers can also be obtained by rewriting Q into a new query QN according to the methodology introduced in Section II-B (which is similar to the one applied in Proposition 1). In this case, we obtain QN (x, z) : y(P (x, y) R(y, z) y < 3 y = null). This query can be evaluated on each of the secrecy instances treating null as any other constant. Finally, we obtain SA(Q, D, {Vs }) = Q(D1 ) Q(D2 ) Q(D3 ) = . This is as expected, because Q is QVs , the query associated to the secrecy view. Secret answers (SAs) are based on a skeptical or cautious semantics that considers as true what is true of all the

Example 10. Consider the instance D = {P (1, 2), R(2, 1)} for schema R = {P (A, B), R(B, C)}. With tuple identiers, it takes the form D = {P (1, 1, 2), R(1, 2, 1)}.

7 It would be easy to consider tuple ids in queries and view denition, but they do not contribute to the nal result and will only complicate the notation. So, we skip tuple ids whenever possible.

secrecy instances. A more relaxed alternative is the possible or brave semantics, which considers an answer as valid if it is an answer from at least one of the secrecy instances. For instance, in Example 10, (1, 2) is a possibly secret answer to the query Q1 (x, y) : P (x, y), because it is an answer from instance D3 . Similarly, R(2, 1) is a possibly secret answer to Q2 (x, y) : R(x, y). The possibly secret answers provide more information about the original database than the (certainly) secret answers. Actually, they do not help us solve the privacy problem. In the example just presented, from the possibly secret answers P (1, 2) and R(2, 1), the user can obtain the contents of the secrecy view. On the other side, even under the skeptical semantics, the user may try to pose queries to obtain sensitive information, as the following example shows. Example 12. Consider instance D = {P (1, 2), P (3, 4), R(2, 1), R(3, 3)} for schema R = {P (A, B), R(B, C)}, and the secrecy view Vs (x, z) P (x, y), R(y, z). D has the following three secrecy instances: D1 D2 D3 {P (null, 2), P (3, 4), R(2, null), R(3, 3)} {P (1, null), P (3, 4), R(2, 1), R(3, 3)} {P (1, 2), P (3, 4), R(null, 1), R(3, 3)}

Example 13. (example 12 continued) Consider the secrecy view Vs (x, z) P (x, y), R(y, z). It holds: D{Vs } = {P (3, 4)} {R(3, 3)} = {P (3, 4), R(3, 3)}. The following proposition states that by combination of SAs to queries, there is no way for the user to discover the extensions of the secrecy views on the original instance D. The proof can be found in [22, Proposition 2]. It implies that from the reconstructed database we can not obtain more information that the one provided by the secret answers. Proposition 2. For every Vs of the form (2) in V s , SA(QVs , D, V s ) = Vs (DVs ). Notice that for our whole approach to work, we are assuming that the original database may be incomplete, in the sense that some information may be originally represented using null values. In consequence, a user who obtains nulls from a query will not know if the nulls were already there or were (virtually) introduced for privacy purposes. However, if the user knows the denition of the secrecy views, it is possible for him to combine SAs with the denition of the secrecy view, to determine the original contents of the view. This is shown in the next example. Example 14. Consider the secrecy view Vs (x) P (x, y), x = 1, and the database instance D = {P (1, 1)}. D has only one secrecy instance Ds : P A null B 1

The user may pose the queries Q1 (x, y) : P (x, y) and Q2 (x, y) : R(x, y), trying to reconstruct the original database D. For query Q1 (x, y), it holds Q1 (D1 ) = {(null, 2), (3, 4)}, Q1 (D2 ) = {(1, null), (3, 4)}, and Q1 (D3 ) = {(1, 2), (3, 4)}. In consequence, SA(Q1 , D, {Vs }) = {(3, 4)}. Now, for query Q2 (x, y), Q2 (D1 ) = {(2, null), (3, 3)}, Q2 (D2 ) = {(2, 1), (3, 3)}, and Q2 (D3 ) = {(null, 1), (3, 3)}. In consequence, SA(Q2 , D, {Vs }) = {(3, 3)}. By combining the secret answers to Q1 and Q2 , it is not possible to obtain the extension {(1, 1)} of Vs (D). Actually, if the user poses the queries Q1 and Q2 , for him the relations would look as follows: P A B R B C 3 4 3 3 In this case, any other conjunctive query posed to detect the presence of initial nulls, like y(P (x, y) x = null), will get an empty set of secret answers, and the user will not know anything more about the contents of the original instance. Now, we analyze in more general terms this impossibility to obtain the contents of the secrecy views though the use of secret answers to queries. Denition 14. Let V s be a set of secrecy views Vs . The secrecy answer instance for V s from D is x DVs = {R(t) | R R and t SA(R(), D, V s )}. Here, we are building a database instance by collecting the SAs to all the atomic queries of the form R(), with x R R. This instance has the same schema as D.

For the query Q(x) : y(P (x, y) x = 1), the secrecy answer to Q(x) on D is . If the user knows that there exists one tuple in D, and also the secrecy view denition, he could infer that (1) Vs (D). In summary, for our approach to work, we rely on the following assumptions: (a) The user interacts with a possibly incomplete database. (b) The interaction is via query answering. (c) The user does not know the secrecy view denitions. From these assumptions and Proposition 2, we can conclude that the user cannot obtain information about the secrecy views through a combination of SAs to conjunctive queries. Therefore, there is not leakage of sensitive information.

V. Secrecy Instances and Logic Programs


As indicated above, the updates leading to the secrecy instances have to be virtual. Actually, the secrecy instances are mainly an auxiliary notion to dene the right secret answer semantics. In general, we expect not to have to compute all the secrecy instances, materialize them, and then cautiously query them. We would rather stick to the original instance, and use it as it is to obtain the secret answers. One way to approach this problem is via query rewriting. Ideally, a query Q posed to D and expecting secret answers should be rewritten into another query Q . This new query

would be posed to D, and the usual answers returned by D should be the secret answers to Q. We would like Q to be still a simple query, that can be easily evaluated. However, the possibility of being able to do this is restricted by the intrinsic complexity of the problem of computing secret answers, which is likely to be higher than polynomial time in data (cf. Section VI). In consequence, Q may not be a conjunctive query, actually not even a FO query. An alternative is to specify the secrecy instances in a compact manner, by means of a logical theory, and do reasoning from that theory. This will not decrease a high intrinsic complexity, but can be much more efcient than computing all the secrecy instances and querying them in turns. However, secret query answering is a non-monotonic process, as the following example shows. Example 15. Consider the instance D0 = {P (a)}, the secrecy view V (x) P (x), R(x), and the query Q : Ans(x) P (x). Here, V (D0 ) = , and then, D0 itself is the only secrecy instance. Therefore, SA(Q, D0 , {V }) = {(a)}. Let us update the instance to D1 = {P (a), R(a)}. Now, V (D1 ) = {(a)}. The secrecy instances for D1 are: D1 = {P (null), R(a)} and D1 = {P (a), R(null)}. It holds, Q(D1 ) = {(null)} and Q(D1 ) = {(a)}. In consequence, SA(Q, D1 , {V }) = . The previous secret answer is lost. The non-monotonicity of SQA tells us that a nonmonotonic formalism is required to logically specify the secrecy instances of a given instance. Actually, the secrecy instances for a original database can be specied as the stable models of a disjunctive logic program. Those programs use annotation constants with the intended, informal semantics shown in the table below. More precisely, for each database predicate R R, we introduce a copy of it that contains an extra, and last, attribute (or argument) to allocate in it an annotation constant. So, a tuple of the form R(t) would become , a).8 The possible an annotated atom of the form R(t annotations are those in the table, and are used to keep track of virtual updates, i.e. of old and new data:
Annotation bu au t s Atom R (, bu ) a R (, au ) a R (, t) a R (, s) a The tuple R() ... a has already been updated is updated tuple in database is new or old tuple stays in the nal database

the atoms in the secrecy instances. The secrecy instances are obtained by restricting the models of the program to those atoms with the annotation constant s. As expected, the ofcial semantics of the annotations is captured through the logic program; the table above is just for motivation. In Section V-A we provide the general form of (D, V s ), the secrecy logic program that species the secrecy instances for an instance D and set of secrecy views V s . Here, we illustrate the program by means of an example. Example 16. (example 10 continued) Consider D = {P (1, 2), R(2, 1)} and the secrecy view Vs (x, z) P (x, y), R(y, z), y < 3. The secrecy instance program (D, {Vs }) is as follows: 1. P (1, 2, t), R(2, 1, t). (the initial database contents) 2. P (null, y, au ) P (x, null, au ) R(null, z, au ) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z). R(y, null, au ) P (x, null, au ) R(null, z, au ) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z). aux (x, z) P (x, y, t), R(y, z, t), y < 3, x = null. aux (x, z) P (x, y, t), R(y, z, t), y < 3, z = null. 3. P (x, y, bu ) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z), P (null, y, au ), x = null. R(y, z, bu) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z), R(y, null, au ), z = null. P (x, y, bu ) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z), P (x, null, au ). R(y, z, bu) P (x, y, t), R(y, z, t), y < 3, y = null, aux (x, z), R(null, z, au ). 4. P (x, y, t) P (x, y). P (x, y, t) P (x, y, au ). R(x, y, t) R(x, y). R(x, y, t) R(x, y, au ). 5. P (x, y, s) P (x, y, t), not P (x, y, bu ). R(x, y, s) R(x, y, t), not R(x, y, bu ). The facts in 1. are the elements of the initial database. The most important rules of the program are those in 2. and 3. They enforce the update semantics of secrecy in the presences of null and using null. Rules in 2. capture in the body the violation of secrecy; and in the head, the intended way of restoring secrecy. The main idea behind this rule is that we can either update a combination of attributes or single secrecy attributes with null. In this example, we need to update with null values in attribute B or update values in attributes A and C, simultaneously. Since disjunctive programs do not allow conjunctions in the head, the intended head (P (null, z) P (y, null)) P (x, null) Q(null, z) Body is represented by means of two rules: P (null, z)P (x, null)Q(null, z) Body and P (y, null) P (x, null) Q(null, z) Body . Furthermore, we need to restore secrecy only if the given database is not already a secrecy instance, which happens when the combination attribute B is not null, the secrecy attributes A and C are not null, and formula is true.

In R(, bu ), annotation bu means that the atom has a already been updated, and au should appear in the new, updated atom. For example, consider a tuple R(a, b) D. A new tuple R(a, null) is obtained via the update of b to null. Therefore, R(a, b, bu ) denotes the old atom before updating, while P (a, null, au ) denotes the new atom after the update. The logic program uses these annotations to go through different steps; and nally, to specify and read off
8 We should use a new predicate, e.g. R , but to keep the notation simple, we will reuse the predicate. We also omit tuple ids.

10

Here, we use aux (x, z) to captures the condition not (x = null z = null). The rules in 3. collect the tuples in the database that have already been updated and (virtually) no longer exist in the database. Rules 4. capture the atoms that are part of the database or updated atoms in the process of restoring secrecy. Rules in 5. collect the tuples that stay in the nal state of the updated database. The secrecy instances are represented by the stable models of program (D, V s ): Given a stable model, the associated secrecy instance is obtained by collecting the atoms that are annotated with s. The proof of this statement is rather long, and is similar in spirit to the proof of the fact that database repairs wrt integrity constraints [2] can be specied by means of disjunctive logic programs with stable model semantics (cf. [8], [1]). The program can be evaluated, for example, using the DLV system that computes the disjunctive stable models semantics. It offers a nice and effective interface to commercial DBMSs [20]. Example 17. (example 16 continued) The program has three stable models (the facts of the program are omitted for simplicity): M1 = {P (1, 2, t), R(2, 1, t), aux(1, 1), P (1, 2, s), R(2, 1, bu ), R(null, 1, au ), R(null, 1, t), R(null, 1, s)}. M2 = {P (1, 2, t), R(2, 1, t), aux(1, 1), P (1, 2, bu), R(2, 1, s), P (1, null, au ), P (1, null, t), P (1, null, s)}. {P (1, 2, t), R(2, 1, t), aux(1, 1), P (1, 2, bu), R(2, 1, bu ), P (null, 2, au ), R(2, null, au ), P (null, 2, t), R(2, null, t), aux (1, null), aux (null, 1), P (null, 2, s), R(2, null, s)}. The secrecy instances are built by selecting the underlined atoms: D1 = {P (1, 2), R(null, 1)}, D2 = {P (1, null), R(2, 1)}, and D3 = {P (null, 2), R(2, null)}. These are the secrecy instances obtained in Example 10. If we want to obtain the secret answers to a query, it is not necessary to explicitly compute all the stable models. Instead, the query can be posed directly on top of the program and answered according to the skeptical semantics. This will return the secret answers to the query. Of course, the query has to be formulated as a top-layer program and with atoms annotated with s, since those are the atoms that affect the query. Optimizations methods are available that bypass the full computation of all stable models [9]. Example 18. (example 17 continued) We want the secret answers to the conjunctive query Q(x, z) : y(P (x, y) R(y, z) y < 3). This requires rst rewriting it as the query QN obtained in Example 10. This new query can be evaluated against

instances with null treated as any other constant. Now, query QN has to be transformed into a query program with all the atoms in it of the form R(), with R R, replaced x by R(, s). In this case we obtain a simple query program: x Ans(x, z) P (x, y, s), R(y, z, s), y < 3, y = null, which has to be evaluated in combination with the program in Example 16, under the skeptical semantics. In this evaluation, null is treated as an ordinary constant. A. The general secrecy logic program To provide the general form of secrecy logic program, we need to introduce some notation rst. We recall that our view denitions are of the form Vs () R1 (1 ), . . . , Rn (n ), . x x x (5)

Some of the variables9 in atoms in the body of the denitions are relevant or sensitive, in the sense dened above, and will be replaced by nulls. Those atoms and variables (or, better, arguments) have a crucial role in the logic program. For an atom of the form R(), y x, and t a variable x x or constant, R() y denotes R() will all the variables in x t y replaced by t. Now, we dene y | Ri (i ) is in body of (5), x CP(Vs ) = {Ri (i ) x null y = {y1 , ..., yn } x, and yi C(Vs ))}. y SP(Vs ) = {Ri (i ) x | Ri (i ) is in body of (5), x null y = {y1 , ..., yn } x, and yi S(Vs ))}. The sets of predicate positions, C(Vs ) and S(Vs ), are introduced in Denition 6. Sets CP(Vs ) and SP(Vs ) will be used below in the head of the disjunctive program rule which is used to impose secrecy by changing some relevant values into nulls. Example 19. Consider the secrecy view: Vs (x, z, w) P (x, y), Q(y, z, w). According to Denition 6, C(Vs ) = {P [2], Q[1]}, resp. S(Vs ) = {P [1], Q[2], Q[3]}. Thus, CP(Vs ) = {P (x, null), Q(null, z, w)}, and SP(Vs ) = {P (null, y), Q(y, null, null)}. Now we can give the general form of the secrecy program. Given a database instance D, a set V s of secrecy views Vs s, each of them of the form (5), the secrecy program (D, V s ) contains the following rules: 1. Facts: R(, t) for each atom R() D. a a 2. For every Vs of the form (5), if SP(Vs ) = {R1 (1 ), x . . . , Ra (a )}, and CP(Vs ) = {R1 (1 ), ..., Rb (b )}, then x x x the program contains the rules:
9 To be more precise, we should talk about variables in relevant positions or arguments, as we did before, e.g. in Section III, but the description would be less intuitive.

M3

11

Rc CP(Vs )

(a) If S(Vs ) C(Vs ) = , the rule: n Rc (c , au ) i=1 Ri (i , t), , x x

L( {Vs }): cl = null. (Vs () (R1 (1 ) Rn (n ) )), (6) x x y x x with y = ( xi ) x. From this perspective, the problem of view maintenance, i.e. of maintaining the view dened by (6) synchronized with the base relations [15] becomes a problem of database maintenance, i.e. maintenance of the consistency of the database wrt (6) seen as an IC. This also works in the other direction since every IC can be associated to a violation view, which has to stay empty for the IC to stay satised. Actually, we want more than maintaining the view dened in (6). We want it to be empty or returning only tuples with null values. In consequence, we have to impose the following ICs on D, which are obtained from the RHS of (6): If x is x1 , . . . , xk , then for 1 i k, y (R1 (1 ) Rn (n ) xi = null). x x x (7)

cl C(Vs )

(b) If S(Vs ) C(Vs ) = , for each Rsj SP(Vs ), 1 j a, the rule: n x Rc (c , au ) i=1 Ri (i , t), , x Rsj (sj , au ) x
Rc CP(Vs )

s cl = null, aux Vs ().


cl C(Vs )

Plus rules dening the auxiliary predicates: If S(Vs ) = {s1 , ..., sk } and s = si , then for 1 i k: s aux Vs ()
n i=1

Ri (i , t) si = null. x

3. The collection rules: (a) For each Rsj SP(Vs ), 1 j a: s Rsj (sj , bu ) i=1 Ri (i , t), , aux Vs (), x x cl = null, Rsj (sj , au ), x sl = null.
cl C(Vs ) sl S(Vs )sj x n i=1 n

(b) For each Rck CP(Vs ), 1 k b: Rck (ck , bu ) x s Ri (i , t), , aux Vs (), x cl = null, Rck (ck , au ). x
cl C(Vs )

4. For each R R, the rule:

R(, t) R(, au ). x x

5. For each R R, the rule: R(, s) R(, t), not R(, bu ). x x x Rules in 1. create program facts from the database atoms. Rules in 2. are the most important and express how to impose secrecy by changing attribute values into nulls. The body of the rule becomes true when the database instance is not a secrecy instance, and the head captures the intended ways of imposing secrecy. Rules in 3. collect the tuples in the database that have already been updated and (virtually) no longer exist in the database. Rules 4. capture the atoms that are part of the database or updated atoms in the process of imposing secrecy. Rules in 5. collect the tuples that stay in the nal state of the updated, secrecy instance.

That is, from each view denition (6) we obtain k denial constraints, i.e. prohibited conjunctions of (positive) database atoms and built-ins. This class of constraints have been investigated in CQA [12], [2], but under different repair semantics. In our case, the secrecy instances correspond to the repairs of D wrt the set IC of ICs in (7). These repairs are dened according to the null-based repair semantics introduced in Section III, i.e. D -minimality. Through this correspondence we can benet from concepts and techniques developed for CQA. Example 20. The secrecy view dened by Vs (x, z) P (x, y), R(y, z), y < 3 gives rise to the following denial constraints: xyz(P (x, y) R(y, z) y < 3 x = null) and xyz(P (x, y) R(y, z) y < 3 z = null). The initial instance D has to be minimally repaired in order to satisfy them.

VI. The CQA Connection


Consider a database instance D that fails to satisfy a given set of integrity constraints IC . It still contains useful and some semantically correct information. The area of consistent query answering (CQA) [2] has to do with: (a) Characterizing the information in D that is still semantically correct wrt IC , and (b) Characterizing, and computing, in particular, the semantically correct, i.e. consistent, answers to a query Q from D wrt IC . The rst goal is achieved by proposing a repair semantics, i.e. a class of alternative instances to D that are consistent wrt IC and minimally depart from D. The consistent information in D is the one that is invariant under all the repairs in the class. This applies in particular to the consistent answers: They should hold in every minimally repaired instance. There are some connections between CQA and our treatment of privacy preserving query answering. Notice that every view denition of the form (2) can be seen as an integrity constraint expressed in the FO language

VII. Related Work


Other researchers have investigated the problem of data privacy and access control in relational databases. We described in Section I the approach based on authorization views [24], [29]. In [17], the privacy is specied by through the values in cells within tables that can be accessed by a user. To answer a query Q without violation of privacy, they propose the table and query semantics models, which generate masked versions of the tables by replacing all the cells that are not allowed to be accessed with NULL. When the user issues Q, Q is posed to the masked versions of the tables, and answered as usual. The table semantics is independent of any queries, and views. However, the query semantics takes queries into account. [17] shows the implementation of two models based on query rewriting. Recent work [26] has presented a labeling approach for masking unauthorized information by using two types of special variables: Type-1/Type-2 variables. Based on

12

this approach, they proposed a secure and sound query evaluation algorithm in the case of cell-level disclosure policies, which determine for each cell whether the cell is allowed to be accessed or not. The algorithm is based on query modication, into one that returns less information than the original one. Those approaches propose query rewiring to enforce ne-grained access control in databases. Their approach is mainly algorithmic. Data privacy and access control in incomplete databases has been studied in [2, 3, 20]. They address privacy issues in incomplete propositional databases. In [4], [5], [27], they took a different approach to ne-grained access control. The approach, control query evaluation (CQE), is policydriven, and aims to ensure condentiality on the basis of a logical framework. A security policy species the facts that a certain user is not allowed to access. Each query posed to the database by that user is checked, as to whether the answers to it would allow the user to infer any sensitive information. If that is the case, the answer is distorted by either lying or refusal or combined lying and refusal. In [6], they extend CQE to restricted incomplete rst-order logic databases. In that case, CQE is based on a transformations of the specication language into a corresponding propositional language. This approach does not seem to be comparable to ours. They do not use null values, and the issue of maximality of answers that do not compromise privacy is not explicitly addressed. Our approach is based on producing virtual updates on the database, by forcing the secrecy views to be empty or contain only tuples with nulls. This is clearly reminiscent of the older, but still challenging database problem of updating a database through views [11]. Here we confront new difculties, namely the occurrence of SQL nulls with a special semantics, the corresponding notion of null-based minimality of changes on the base relations. In [7] a null-based repair semantics was introduced, but it differs from the one introduced in Section III. The former was proposed for enforcing satisfaction of sets of ICs that include referential ICs, which require the possible insertion of new tuples with nulls. The comparison between instances is based on both sets of full tuples and the occurrence of nulls in them. Here, we we enforce secrecy by changes of attributes values only.

VIII. Conclusions
In this work, we have developed a logical framework and a methodology to answer conjunctive queries that do not reveal secret information as specied by secrecy views. We have concentrated on the case of conjunctive secrecy views and conjunctive queries, but it is possible to relax these restrictions. We have assumed that the databases may contain nulls, and also nulls are used to protect secret information, by virtually updating with nulls some of the attribute values. In each of the resulting alternative virtual instances, the secrecy views either become empty or contain tuples showing only null values. The queries can be posed against any of these virtual instances or cautiously against

all of them, simultaneously. The update semantics enforces (or captures) two natural requirements. That the updates are based on null values, and that the updated instances stay close to the given instance. In this way, the query answers become implicitly maximally informative, while not revealing the original contents of the secrecy views. The null values are treated as in the SQL standard, which in our case is reconstructed in classical logic. This logical reconstruction captures well the semantics of SQL nulls (which in not clear or complete in the standard), at least for the case of conjunctive query answering, and some extensions thereof. This is the main reason for concentrating on conjunctive queries and views. In this case, queries, views and ICs can be syntactically transformed into new FO formulas for which the evaluation or verication can be done by treating nulls as any other constant. We introduced disjunctive logic programs with stable model semantics to specify the secrecy instances. This is a single program that can be used to compute secret answers to any conjunctive query. This provides a general mechanism, but may not be the most efcient way to go for some clases of secrecy views and queries, and ad hoc methods can be proposed for them, as has been the case in CQA [3]. Our work leaves several open problems, and they are matter of ongoing and future research. Complexity issues have to be explored. For example, of deciding whether or not a particular instance is a secrecy instance of an original instance. Also of deciding if a tuple is a secret answer to a query. The connection with CQA, where similar problems have been investigated, looks very promising in this regard. Another problem is about query rewriting, i.e. about the possibility of rewriting the original query into a new FO query, in such a way that the new query, when answered by the given instance, returns the secret answers. From the connection with CQA we can predict that this approach has limited applicability, but whenever possible, it should be used, for its simplicity and lower complexity. For future work, it would be interesting to investigate the connections with view determinacy [23], that has to do with the possible determination of extensions of query answers by a set of views with a xed contents. The occurrence of SQL nulls and their semantics introduces a completely new dimension into this problem. A natural extension of this work would be to dene more expressive secrecy views, e.g. secrecy views with negation. Another direction consists in adding ICs to the schema. If they are known to the user, and also that they are satised by the database, then privacy could be compromised. Also the updates leading to the virtual updates should take these ICs into account, to produce consistent secrecy instances.
Acknowledgements: This research started when Leo Bertossi was spending his sabbatical at the Technical University of Vienna in 2006. The support provided by Georg Gottlob, Thomas Eiter and a Pauli Fellowship of the Wolfgang Pauli Institute, Vienna is highly appreciated. We are indebted to Thomas Eiter and Loreto

13

Bravo for technical conversations at a preliminary stage of this research, and to Sina Ariyan for some computational experiments. This research was funded by an NSERC Discovery grant and the NSERC/IBM CRDPJ/371084-2008.

References
[1] Barcelo, P. Applications of Annotated Predicate Calculus and Logic Programs to Querying Inconsistent Databases. MSc Thesis PUC, 2002. http://people.scs.carleton.ca/ bertossi/papers/tesisk.pdf [2] Bertossi, L. Consistent Query Answering in Databases. ACM Sigmod Record, June 2006, 35(2):68-76. [3] Bertossi, L. From Database Repair Programs to Consistent Query Answering in Classical Logic (extended abstract). In Proc. The Alberto Mendelzon International Workshop on Foundations of Data Management (AMW09), CEUR-WS, Vol-450, 15 pp. [4] Biskup, J. and Weibert, T. Condentiality Policies for Controlled Query Evaluation. In Data and Applications Security, Springer LNCS 4602, 2007, pp. 1-13. [5] Biskup,J. and Weibert. Keeping Secrets in Incomplete Datbabases. International Journal of Information Sercurity, 2008, 7(3):199-217. [6] Biskup, J., Tadros, C. and Wiese, L. Towards Controlled Query Evaluation for Incomplete First-Order Databases. In Proc. FoIKS10, Springer LNCS 5956, 2010, pp. 230-247. [7] Bravo, L. and Bertossi, L. Semantically Correct Query Answers in the Presence of Null Values. Proc. EDBT WS on Inconsistency and Incompleteness in Databases (IIDB06), J. Chomicki and J. Wijsen (eds.), Springer LNCS 4254, 2006, pp. 336-357 [8] Bravo, L. Handling Inconsistency in Databases and Data Integration Systems. PhD. Thesis, Carleton University, Department of Computer Science, 2007. http://people.scs.carleton.ca/ bertossi/papers/Thesis36.pdf [9] Caniupan, M. and Bertossi, L. The Consistency Extractor System: Answer Set Programs for Consistent Query Answering in Databases. Data & Knowledge Engineering, 2010, 69(6):545-572. [10] Codd, E.F. Extending the database relational model to capture more meaning. ACM Trans. Database Syst., 1979, 4(4):397-434. [11] Cosmadakis, S. and Papadimitriou, Ch. Updates of Relational Views. Journal of the ACM, 1984, 31(4):742-760. [12] Chomicki, J. and Marcinkowski, J. Minimal-Change Integrity Maintenance Using Tuple Deletions. Information and Computation, 2005, 197(1-2):90-121. [13] Gelfond, M. and Lifschitz, V. Classical Negation in Logic Programs and Disjunctive Databases. New Generation Computing, 1991, 9:365-385. [14] Gelfond, M. and Leone, N. Logic Programming and Knowledge Representation: The A-Prolog Perspective. Articial Intelligence, 2002, 138(1-2):3-38. [15] Gupta, A. and Singh Mumick, I. Maintenance of Materialized Views: Problems, Techniques, and Applications. IEEE Data Engineering Bulletin, 1995, 18(2):3-18. [16] Imielinski, T. and Lipski, W. Jr. Incomplete Information in Relational Databases. Journal of the ACM, 1984, 31(4):761-791. [17] LeFevre, K., Agrawal, R., Ercegovac, V., Ramakrishnan, R., Xu, Y. and DeWitt, D. Limiting Disclosure in Hippocratic Databases. In Proc. International Conference on Very large Data Bases (VLDB04), 2004, pp. 108-119. [18] Lechtenb rger, J. and Vossen, G. On the Computation of Relao tional View Complements. Proc. ACM Symposium on Principles of Database Systems (PODS02), 2002, pp. 142-149. [19] Lechtenb rger, J. The Impact of the Constant Complement Approach o towards View Updating. Proc. ACM Symposium on Principles of Database Systems (PODS03), 2003, pp. 49-55. [20] Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S. and Scarcello, F. The DLV System for Knowledge Representation and Reasoning. ACM Transactions on Computational Logic, 2006, 7(3):499-562. [21] Levene, M. and Loizou, G. Null Inclusion Dependencies in Relational Databases. Information and Computation, 1997, 136(2):67108. [22] Li, L. Achieving Data Privacy Through Virtual Updates. MSc. Thesis, Carleton University, Department of Computer Science, 2011. http://people.scs.carleton.ca/ bertossi/papers/thesisLechen.pdf

[23] Nash, A., Segoun, L. and Vianu, V. Views and Queries: Determinacy and Rewriting. ACM Transactions on Database Systems, 2010, 35(3). [24] Rizvi, S., Mendelzon, A., Sudarshan, S. and Roy, P. Extending Query Rewriting Techniques for Fine-Grained Access Control. In Proc. Proc. ACM International Conference on Management of Data (SIGMOD04), 2004, pp. 551-562. [25] Vassiliou, Y. Null Values in Data Base Management: A Denotational Semantics Approach. In Proc. ACM International Conference on Management of Data (SIGMOD79), 1979, pp. 162-169. [26] Wang, Q., Yu, T., Li, N., Lobo, J., Bertino, E., Irwin, K. and Byun, J.-W.. On the Correctness Criteria of Fine-Grained Access Control in Relational Databases. In Proc. International Conference on Very large Data Bases (VLDB07), 2007, pp. 555-566. [27] Weibert, T. A Framework for Inference Control in Incomplete Logic Databases. PhD thesis, Technische Universit t Dortmund, 2008. a [28] Zaniolo, C. Database Relations with Null Values. In Proc. ACM Symposium on Principles of Database Systems (PODS82), 1982, pp. 27-33. ACM. [29] Zhang, Z. and Mendelzon, A. Authorization Views and Conditional Query Containment. In Proc. International Conference on Database Theory (ICDT05), Springer LNCS 3363, 2005, pp. 259-273.

Vous aimerez peut-être aussi