Vous êtes sur la page 1sur 46

Association Rule Hiding

V.S. Verykios1 , A.K. Elmagarmid2 , E. Bertino3 , Y. Saygn4 , and E. Dasseni3 1 College of Information Science and Technology, Drexel University 2 Department of Computer Sciences, Purdue University 3 Department of Computer Sciences, University of Milan, Italy Faculty of Engineering and Natural Sciences, Sabanc University, Turkey Jan 7, 2003
Abstract Large repositories of data contain sensitive information that must be protected against unauthorized access. The protection of the condentiality of this information has been a long-term goal for the database security research community and for the government statistical agencies. Recent advances in data mining and machine learning algorithms, have increased the disclosure risks that one may encounter when releasing data to outside parties. A key problem, and still not suciently investigated, is the need to balance the condentiality of the disclosed data with the legitimate needs of the data users. Every disclosure limitation method aects, in some way, and modies true data values and relationships. In this paper, we investigate condentiality issues of a broad category of rules, the association rules. In particular, we present three strategies and ve algorithms for hiding a group of association rules, which is characterized as sensitive. One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold. Sometimes, sensitive rules should not be disclosed to the public, since among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. We also perform an extensive validation and evaluation study of the hiding algorithms, in order to analyze their time complexity and the impact that they have in the original database. Index Terms- Privacy preserving data mining, association rule mining, sensitive rule hiding.

Introduction

Many government agencies, businesses and non-prot organizations in order to support their short and long term planning activities, they are searching for a way to collect, store, analyze and report data about individuals, households or businesses. Information systems, therefore,

contain condential information such as social security numbers, income, credit ratings, type of disease, customer purchases, etc., that must be properly protected. Securing against unauthorized accesses has been an long-term goal of the database security research community and the government research statistical agencies. Solutions to such a problem require combining several techniques and mechanisms [1, 11, 9]. In an environment where data have dierent sensitivity levels, this data may be classied at dierent levels, and made available only to those subjects with an appropriate clearance. A multilevel secure system would ensure that data at classication levels above the users login level would be invisible to that user. Such a system is designed to prevent higher level of data from being directly encoded in lower level data. This can be done by changing the values of lower level data in a particular pattern or by selecting certain visible data for output based upon values of higher level data. It is however, well known that simply by restricting access to sensitive data does not ensure complete sensitive data protection. For example, sensitive or high data items may be inferred from non-sensitive, or low data through some inference process based on some knowledge of the semantics of the application the user has. Such a problem, known as the inference problem has been widely investigated and possible solutions have been identied. The proposed solutions address the problem of how to prevent disclosure of sensitive data through the combination of known inference rules with non-sensitive data. Examples of inference rules are deductive rules, functional dependencies, or material implications [9]. Recent advances in Data Mining (DM) techniques and related applications have however, increased the security risks that one may incur when is releasing data. The elicitation of knowledge, that can be attained by such techniques, has been the focus of the Knowledge Discovery in Databases (KDD) researchers eort for years and by now, it is a well understood problem [7]. On the other hand, the impact on the information condentiality originating by these techniques has not been considered until very recently. The process of uncovering hidden patterns from large databases was rst indicated as a threat to database security, by O Leary [10], in a paper presented in the 1st International Conference in Knowledge Discovery and Databases. Piatetsky-Shapiro, in GTE Laborato-

ries, was the chair of a mini-symposium on knowledge discovery in databases and privacy, organized around the issues raised in O Learys paper in 1991. The focal point discussed by the panel was the limitation of disclosure of personal information, which is not dierent in principle from the focal point of statisticians and database researchers since in many elds like medical and socio-economic research, the goal is not to discover patterns about specic individuals but patterns about groups. The compromise in the condentiality of sensitive information, that is not limited to patterns specic to individuals, is another form of threat which is analyzed in a recent paper by Clifton from Mitre Corporation and Marks from Department of Defense [6]. The authors provide a well designed scenario of how dierent data mining techniques can be used in a business setting to provide business competitors with an advantage. For completeness purposes we report such a scenario below. Let us suppose, that we are negotiating a deal with Dedtrees Paper Company, as purchasing directors of BigMart, a large supermarket chain. They oer their products in reduced price, if we agree to give them access to our database of customer purchases. We accept the deal and Dedtrees starts mining our data. By using an association rule mining tool, they nd that people who purchase skim milk also purchase Green paper. Dedtrees now runs a coupon marketing campaign saying that you can get 50 cents o skim milk with every purchase of a Dedtrees product. This campaign cuts heavily into the sales of Green paper, which increases the prices to us, based on the lower sales. During our next negotiation with Dedtrees, we nd out that with reduced competition they are unwilling to oer us a low price. Finally, we start to lose business to our competitors, who were able to negotiate a better deal with Green paper. The scenario that has just been presented, indicates the need to prevent disclosure not only of condential personal information from summarized or aggregated data, but also to prevent data mining techniques from discovering sensitive knowledge which is not even known to the database owners. The necessity to combine the condentiality and the legitimate needs of data users is imperative. Every disclosure limitation method has an impact, which is not always a positive

one, on true data values and relationships. Ideally, these eects can be quantied so that their anticipated impact on the completeness and validity of the data can guide the selection and use of the disclosure limitation method. In this paper, we propose new strategies and a suite of algorithms for hiding sensitive knowledge from data by minimally perturbing their values. The hiding strategies that we propose are based on reducing the support and condence of rules that specify how signicant they are (condence and support will be formally dened in Section 3). In order to achieve this, transactions are modied by removing some items, or inserting new items depending on the hiding strategy. The constraint on the algorithms is that the changes in the database introduced by the hiding process should be limited, in such a way that the information loss incurred by the process is minimal. Selection of the items in a rule to be hidden and the selection of the transactions that will be modied is a crucial factor for achieving the minimal information loss constraint. We also perform a detailed performance evaluation and validation study in order to prove that the proposed algorithms are computationally ecient, and provide certain provisions on the changes that they impose in the original database. According to this, we try to apply minimal changes in the database at every step of the hiding algorithms that we propose. The rest of this paper is organized as follows: in section 2 we present an overview of the current approaches to the problem of DM and security. Section 3 gives a formalization of the problem, while some solutions are presented in section 4. Section 5 discusses performance results obtained from the applications of the devised algorithms. Concluding remarks and future extensions are listed in section 6. In Appendix A, a discussion on the possible damage on the data in terms of possible queries is provided. In Appendix B, we present analysis and validation results. Issues concerning the implementation of the algorithms are discussed in Appendix C.

Background and Related Work

The security impact of DM is analyzed in [6] and some possible approaches to the problem of inference and discovery of sensitive knowledge in a data mining context are suggested. The proposed strategies include fuzzyfying and augmenting the source database and also limiting 4

the access to the source database by releasing only samples of the original data. Clifton [5] adopts the last approach as he studies the correlation between the amount of released data and the signicance of the patterns which are discovered. He also shows how to determine the sample size in such a way that data mining tools cannot obtain reliable results. Clifton and Marks in [6] also recognize the necessity of analyzing the various data mining algorithms in order to increase the eciency of any adopted strategy that deals with disclosure limitation of sensitive data and knowledge. The analysis of the data mining techniques should be considered as the rst step in the problem of security maintenance: if the selection criteria that are used to measure the signicance of the induced patterns are known, it will be easier to identify what must be done to protect sensitive information from being disclosed. While the solution, which is proposed by Clifton in [5], is independent from any specic data mining technique, other researchers [8, 4] propose solutions that prevent disclosure of condential information for specic data mining algorithms such as association rule mining and classication rule mining. Classication mining algorithms may use sensitive data to rank objects; each group of objects has a description given by a combination of non-sensitive attributes. The sets of descriptions, obtained for a certain value of the sensitive attribute are referred to as description space. For Decision-Region based algorithms, the description space generated by each value of the sensitive attribute, can be determined a priori. The authors, in [8], they identify rst two major criteria which can be used to assess the output of a classication inference system and then they use these criteria, in the context of Decision-Region based algorithms, to inspect and to modify, if necessary, the description of a sensitive object, so that they can be sure that it is not sensitive. Agrawal and Srikant used data perturbation techniques to modify the condential data values in such a way that the approximate data mining results could be obtained from the modied version of the database [3]. They considered applications where the individual data values are condential rather than the data mining results and concentrated on a specic data mining model, namely the classication by decision trees. Agrawal and Aggarwal, in their recent paper, enhance the data perturbation methods by using expectation maximiza-

tion for reconstructing the original data distribution which further is used to construct the classication model [2]. Disclosure limitation of sensitive knowledge by data mining algorithms, based on the retrieval of association rules, has also been recently investigated [4]. The authors, in [4], propose to prevent disclosure of sensitive knowledge by decreasing the signicance of the rules induced by such algorithms. Towards this end, they apply a group of heuristic solutions for reducing the number of occurrences, also referred to as support, of some frequent groups of items, called large itemsets, which are selected by the database security administrator, below a minimum user specied threshold. Since an association rule mining algorithm discovers association rules from large itemsets only, decreasing the support of the selected sets of items has as a consequence that the selected rules escape from the mining. This approach focuses on the rst step of the rules mining process, which is the discovery of large itemsets. In this paper, we investigate both the rst, and the second step of the association rule discovery process (e.g. the derivation of strong rules from large itemsets).

Problem Formulation

Let I = {i1 , .., in } be a set of literals, called items. Let D be a set of transactions which is the database that is going to be disclosed. Each transaction t D is an itemset such that t I. A unique identier, which we call it TID, is associated with each transaction. We say that a transaction t supports X, a set of items in I, if X t. We assume that the items in a transaction or an itemset, are sorted in lexicographic order. A sample database of transactions is shown in Table 1(a). Each row in the table represents a transaction. There are 3 items, and 6 transactions. AB is an itemset, and transaction T1 supports that itemset. An itemset X has support s if s% of the transactions support X. Support of X is denoted as Supp(X). All possible itemsets and their supports obtained from the database in Table 1(a) are listed in Table 1(b). For example, Itemset BC is supported by 3 transactions out of 6, and therefore has 50% support. An association rule is an implication of the form X Y , where X I, Y I and X Y = . We say that the rule X Y holds in the database D with condence c if 6

TID T1 T2 T3 T4 T5 T6

Items ABC ABC ABC AB A AC

Itemset A B C AB AC BC ABC

Support 100% 66% 66% 66% 66% 50% 50% (b)

(a)

Table 1: (a) Sample database D, (b) Large itemsets from obtained from D.
Rules BA BC CA CB B AC C AB AB C AC B BC A Condence 100% 75% 100% 75% 75% 75% 75% 75% 100% Support 66% 50% 66% 50% 50% 50% 50% 50% 50%

Table 2: The rules derived from the large itemsets of Table 3


|XU Y |100 |X|

c (where |A| is the number of occurrences of the set of items A in the set of

transactions D, and A occurs in a transaction t, if and only if A t.). We also say that the rule X Y has support s if
|XU Y |100 N

s, where N is the number of transactions in

D. Note that while the support is a measure of the frequency of a rule, the condence is a measure of the strength of the relation between sets of items. All possible rules obtained from the large itemsets in Table1(b) are shown in Table 2. For example, Rule BC A has support 50%, since it appears in 3 out of 6 transactions in D and it has condence 100% since all the transactions that contain BC also contain A. Association rule mining algorithms scan the database of transactions and calculate the support and condence of the candidate rules to determine if they are signicant or not. A rule is signicant if its support and condence is higher than the user specied minimum

support and minimum condence threshold. In this way, algorithms do not retrieve all the association rules that may be derivable from a database, but only a very small subset that satises the minimum support and minimum condence requirements set by the users. An association rule-mining algorithm works as follows. It nds all the sets of items that appear frequently enough to be considered relevant and then it derives from them the association rules that are strong enough to be considered interesting. We aim at preventing some of these rules, that we refer to as sensitive rules, from being disclosed. The problem can be stated as follows: Given a database D, a set R of relevant rules that are mined from D and a subset RH of R, how can we transform D into a database D in such a way that the rules in R can still be mined, except for the rules in RH ? In [4] the authors demonstrate that solving this problem by reducing the support of the large itemsets via removing items from transactions (also referred to as sanitization problem) is an NP-hard problem. Thus, we are looking for a transformation of D (the source database) in D (the released database) that maximizes the number of rules in R RH that can still be mined. There are two main approaches that can be adopted when we try to hide a set RH of rules (i.e., prevent them from being discovered by association rule mining algorithms): (a) we can either prevent the rules in RH from being generated, by hiding the frequent sets from which they are derived, or (b) we can reduce the condence of the sensitive rules, by bringing it below a user-specied threshold (min conf). In this paper we propose ve strategies to hide rules using both the approaches; work related to the rst approach can also be found in [4].

Proposed Strategies and Algorithms

This section is organized as follows. In Section 4.1 we introduce the required notation and in Section 4.2 we introduce three strategies that solve the problem of hiding association rules, by tuning their condence and their support. Few assumptions that we make are presented in Section 4.3 while the building blocks of the algorithms that implement those strategies are presented in Section 4.4. 8

TID T1 T2 T3 T4 T5 T6

Items 111 111 111 110 100 101

Size 3 3 3 2 1 2

Table 3: the sample database that uses the proposed notation.

4.1

Notation and Preliminary Denitions

Before presenting the strategies and the algorithms, we introduce some notation. We used a bitmap notation with a few extensions to represent a database of transactions. Bitmap notation is commonly used in the association rule mining context. In this representation, each transaction t in the database D is a triple: t =< TID, values of items, size > where TID is the identier of the transaction t and values of items is a list of values with one value for each item in the list of items I. An item is represented by one of the initial capital letters of the English alphabet. The items in an itemset are sorted in lexicographic order. An item is supported by a transaction t if its value in the values of items is 1 and it is not supported by t if its value in values of items is 0. Size is the number of 1 values which appear in the values of items (e.g., the number of items supported by transaction t). For example, Table 3 shows a database of transactions where the set of items is I = {A, B, C}. If t is a transaction that contains the items {A, C} it would be represented as t =< T 6, [101], 2 > as shown in the last row of Table 3. Given a set P , we adopt the conventional representation |P | to indicate the number of elements of the set P (also referred to as size or cardinality of P ). According to this notation, the number of transactions stored in a database D is indicated as |D|, while |I| represents the number of the dierent items appearing in D. The set of rules that can be mined from the database will be indicated by R and the subset of these rules, that were interested in hiding, will be referred to as RH . For each rule r in RH we use the compact notation lhs(r) 9

or lr to indicate the itemset which appears in the left side of a rule r (also referred to as rule antecedent) and rhs(r) or rr to indicate the itemset which appears in the right side of a rule (also referred to as rule consequent). Before going into the details of the proposed strategies, it is necessary to give some denitions. Given a transaction t and an itemset S, we say that t fully supports S if the values of the items of S in t.values of items are all 1; t is said to partially support S if the values of the items of S in t.values of items are not all 1s. For example, if S = {A, B, C} = [1110] and p =< T 1, [1010], 2 >, q =< T 2, [1110], 3 > then we would say that q fully supports S while p partially supports S. A rule r corresponds to an itemset. This itemset is the union of the items in the left hand side and the right hand side of the rule. We denote the itemset that corresponds to rule r as Ir , and we refer to it as the generating itemset of r. Notice that two dierent rules may have the same generating itemset. We use the notation Tr to denote the set of transactions that fully support the generating itemset of a rule r. We also denote by Tlr the set of transactions that fully support the left hand side or the antecedent of the rule r, while by Trr we denote the set of transactions that fully support the right hand side of the rule r. We slightly change the notations to represent a set of transactions that partially supports an itemset. In the previous notations we add the prime symbol in all occurrences of T , so the set of transactions that partially support the antecedent of the rule becomes Tlr . In this study, long with the condence and the support, we assume that each rule is assigned to a sensitivity level. The sensitivity level is determined based on the impact that a certain rule has in the environment that the rule is a part of. For example, in a retail environment, a rule that can be used to boost the sale of a set of items could be a sensitive rule. The impact of a rule in the retail environment is the degree at which the rule increases the sales and consequently the prot. Since only frequent and strong rules could be extracted by the data mining algorithms we assume that the sensitivity level of only the frequent and strong rules are of interest to us. If a strong and frequent rule is above a certain sensitivity level, the hiding process should be applied in such a way that either the frequency or the

10

strength of the rule will be reduced to bring the support and the condence of the rule below the min supp and the min conf correspondingly. Let L also be the set of large itemsets with a greater support than the min supp, and LH L be the set of large itemsets that we want to hide from D. Note that LH is the set of generating itemsets of the rules in RH . We denote with TZ , the set of the transactions that support an itemset Z. Finally we denote by AD = |D| AT L the average number of items in the database where AT L is the average transaction length.

4.2

The Hiding Strategies

The hiding strategies, that we propose, heavily depend on nding transactions that fully or partially support the generating itemsets of a rule. The reason for this is that if we want to hide a rule, we need to change the support of some part of the rule (i.e., decrease the support of the generating itemset). Another issue is that the changes in the database introduced by the hiding process should be limited, in such a way that the information loss incurred by the process is minimal. According to this, we try to apply minimal changes in the database at every step of the hiding algorithms that we propose. The decrease in the support of an itemset S can be done by selecting a transaction t, that supports S and by setting to 0 at least one of the non-zero values of t.values of items that represent items in S. The increase in the support of an itemset S can be accomplished by selecting a transaction t that partially supports it and setting to 1 the values of all the items of S in t.values of items. In order to be able to identify some viable ways for reducing either the support or the condence of a rule, we need to analyze the formulas that we have already presented for the condence and the support. Both the condence and the support are expressed as ratios of supports of itemsets that support the two parts of a rule or its generating itemset. In this way, if we want to lower the value of a ratio, we can adopt either one of the following options: (a) we can decrease the numerator, while keeping the denominator xed, or (b) we can increase the denominator while keeping the numerator xed. First we consider which one of the above options can be adopted for decreasing the sup-

11

port of a rule X Y . The support of this rule is

|XU Y |100 N

where N is the number of

transactions in D. Since N is constant, this means that we can only change the numerator. For this reason we can only adopt option (a) from above, which is to decrease the numerator (support) of the generating itemset of the rule. As an example, consider the database in Table3 and the rules in Table 2 that are obtained from this database. Given that min supp=33% and min conf= 70% we are interested in hiding the rule AC B, with support = 50%, and condence = 75%. In order to decrease the support of the rule AC B, we select the transaction t=< T1, [111], 3 > and turn to 0 one of the elements in the list of items that corresponds to A or to B or C. We decide to set to 0 the element corresponding to C, obtaining t=< T1, [110], 2 >. The rule AC B has been hidden (support=33%, condence=66%). Following we need to investigate which of the three options can be adopted for decreasing the condence of a rule X Y . The condence of this rule is written as
|XU Y |100 . |X|

In order

to see whether any of the options that were mentioned previously can be applied to this case or not, we need to investigate if any of the pairs of the actions stated in each option are conicting, based on the relationship between the numerator and the denominator in the condence ratio. Option (a) implies that we need to decrease the numerator (which is the support) of the generating itemset of the rule, while the support of the itemset in the left hand side of the rule remains xed. In order to do that, we can decrease the support of the generating itemset of the entire rule by modifying the transactions that support this itemset, making sure that we hide items from the consequent or the right hand side of the rule. This will decrease the support of the generating itemset of the rule, while it will leave unchanged the support of the left hand side or else the denominator. Again, lets consider the rule AC B in Table 2. In order to decrease the condence, we select the transaction t=< T1, [111], 3 > and turn to 0 the element of the list of items that corresponds to B. The transaction becomes t=< T1, [101], 2 > and we obtain : AC B with support=33% and condence=50%, which means that the rule is hidden. Option (b) implies that we need to increase the denominator (which is the support of the itemset in the antecedent) of the rule, while the support of the generating itemset of the

12

rule remains xed. This option is also applicable, since we can increase the support of the rule antecedent while keeping the support of the generating itemset xed by modifying the transactions that partially support the itemset in the antecedent of the rule but do not fully support the itemset in the consequent.Lets consider the rule AC B in Table 2 one more time. In order to decrease the condence, we select the transaction t=< T5, [100], 1 > and turn to 1 the element of the list of item that corresponds to C. We obtain t=< T5, [101], 2 >. Now, the rule AC B has support=50% and condence=60%, which means that the rule has been hidden since its condence is below the min conf threshold. The strategies that we developed, based on the observations below, can be summarized as follows: Given a rule X Y on a database D, the support of the rule in D expresses the probability to nd transactions containing all the items in X Y . The condence of X Y is, instead, the probability to nd transactions containing all the items in X Y (e.g. supporting X Y ) once we know that they contain X. This suggests another way to express the condence of a rule, based on support: Conf (X Y ) =
Supp(XY )100 , Supp(X)

where

supp(X Y ) is the support of the rule and supp(X) is the support of the antecedent of the rule. 1. We decrease the condence of the rule: (a) by increasing the support of the rule antecedent X through transactions that partially support it. (b) by decreasing the support of the rule consequent Y in transactions that support both X and Y . 2. We decrease the support of the rule by decreasing the support of either the rule antecedent X, or the rule consequent Y , through transactions that fully support the rule.

4.3

Assumptions

We make the following assumptions in the development of the algorithms: 1. We hide only rules that are supported by disjoint large itemsets. 2. We hide association rules by decreasing either their support or their condence. 13

3. We select to decrease either the support or the condence based on the side eects on the information that is not sensitive. 4. We hide one rule at a time. 5. We decrease either the support or the condence, one unit at a time. Without the rst assumption, i.e., if we try to hide overlapping rules, then hiding a rule may have side eects on the other rules to be hidden. This may increase the time complexity of the algorithms since hiding a rule may cause an already hidden rule to haunt back. Therefore the algorithms should reconsider previously hidden rules and hide them back if they are no longer hidden. However, we can solve this problem of rules haunting back in case of hiding overlapping rules. Assume that we want to hide r1 : AB CD and r2 : AB EF which are overlapping. In order to hide r1 , we should consider only the items in H = r1 r2 , which are C and D, since they are the ones that will have no side eects on r2 . This way, we restrict the set of items that we are going to use for hiding to H. We can than choose an item from H using the item selection heuristics we propose. Alternatively, we can modify the transaction selection mechanism for hiding overlapping rules. Consider two transactions T1 : ABCDEF , and T2 : ABCD. Both T1 , and T2 support r1 , and r2 . If we remove A or B from T1 , then the condence and support of r1 will decrease as well as the condence and support of r2 , which is desirable since we want both of them to be hidden. If we remove C or D, then the support and condence of r1 will decrease while the condence and support of r2 stay the same. However, If we remove A or B from T2 , then the support and condence of r1 will decrease while this time support of r2 stays and its condence increase, which is not desirable. So the transaction selection mechanism should not choose the transactions that support only the antecedent of the overlapping rules. According to the second assumption, we can choose to hide a rule by changing either its condence or its support, but not both. The widely used data mining algorithms rst nd the large itemsets whose supports are greater than a threshold value. This is the rst step in pruning rules. Without support pruning, the number of large itemsets to check for rule condence becomes too large. Rules are then generated and their condence is 14

calculated. Rules with condence above the condence threshold are the signicant ones. When we are hiding a rule by decreasing its support below the threshold, we do not need to further reduce its condence. Also after decreasing the condence of a rule, we do not need to decrease is condence since it is no longer signicant. So reducing the support or the condence is enough for hiding a rule. Therefore we assume that either the support or the condence reduction algorithms are used. Note that while decreasing the support of a rule, its condence will also decrease. Also using the rst condence reduction method, when decreasing the condence, the support of the rule will also decrease. But this does not guarantee that both the support and condence will go below the thresholds while reducing the support or condence. According to the third assumption, we select to decrease either the support or the condence based on the side eects on the information that is not sensitive. This is actually the main constraint of the algorithms that aim to maximize the data quality in terms of non-sensitive information. If we relax this assumption, we can just randomly choose a transaction and an item to hide from the database without the need for any kind of heuristic approach. The fourth assumption states that hiding one rule must be considered as an atomic operation. This is actually related to the rst assumption. Since the rules to be hidden are assumed to be disjoint, the items chosen for hiding a rule will also be dierent for dierent rules. Therefore, hiding a rule will not have a side eect ont the rest of the rules. So, considering the rules one at a time or all together will not make any dierence. If we relax the rst assumption, we should reconsider the fourth assumption as well since items for the hiding process should be selected carefully taking into account the overlapping rules as explained in the discussion of the rst assumption. The proposed heuristics for rule hiding work step by step, considering an item and a transaction at each step which is stated by the fth assumption. If we relax this assumption, we can assume that we remove clusters of transactions and remove items from the whole cluster which is an entirely dierent approach.

15

4.4

Algorithms

We now present in detail ve algorithms for the strategies that we developed and that we have already described in 4.2 and we analyze the time complexity of those algorithms. 4.4.1 Algorithm 1.a

The rst algorithm hides the sensitive rules according to the rst strategy: for each selected rule, it increases the support of the rules antecedent until the rule condence decreases below the min conf threshold. An overview of this algorithm is depicted in Figure 1a.

INPUT: a set RH of rules to hide, the source database D, the number |D| of transactions in D, the min conf threshold, the min supp threshold OUTPUT: the database D transformed so that the rules in RH cannot be mined Begin Foreach rule r in RH do { 1. Tlr = { t in D / t partially supports lr } 2. for each transaction of Tlr count the number of items of lr in it 3. sort the transactions in Tlr in descending order of the number of items of lr supported 4. repeat until Conf (r) < min conf { 5. choose the transaction t Tlr with the highest number of items of lr supported (t is the rst transaction in Tlr ) 6. modify t to support lr 7. increase the support of lr by 1 8. recompute the condence of r 9. remove t from Tlr } 10. remove r from RH } End

Begin Foreach rule r in RH do { 1. Tlr = {t D/ t partially supports lr } // count how many items of lr are // in each trans. of Tlr 2. foreach transaction t in Tlr do { 3. t.num items= |I| Hamming dist(lr ,t.values of items) } // sort transactions of Tlr in descending // order of number of items of lr // contained 4. sort(Tlr ) 5. N iterations = |D| ( min conf supp(lr )) 6. For i = 1 to N iterations do { // pick the transaction of Tlr with the // highest number of items 7. t = Tlr [1] // set to one all the bits of t that // represent items in lr 8. set all ones(t.values of items, lr ) 9. supp(lr )=supp(lr )+1 10. conf(r)=supp(r)/supp(lr ) 11. Tlr = Tlr - t } 12. RH = RH r } End
supp(r)

(a)

(b)

Figure 1: Sketch and Pseudo code of algorithm 1.a Given a rule r, algorithm 1.a starts by generating the set Tlr of transactions partially supporting the antecedent of r (referred to as lr ) and counting, for each transaction, the number of items of lr that it contains. The transaction that supports the highest number 16

of items of lr is then selected as the one through which the increase of the support of the rules antecedent is accomplished. This step is performed by sorting the transactions in Tlr in decreasing order of number of items of the lr contained and then selecting the rst transaction. Since, in order to increase the support of the rules antecedent through t, algorithm 1.a has to set to 1 the elements in t corresponding to items of lr , by choosing the transaction containing the largest subset of items in lr , the number of changes made to the database is minimal. Once the support of the rules antecedent has been updated, the condence of the rule is recomputed. A check is then applied in order to nd out if the rule is still sensitive. If no further steps are required for the current rule, another rule is selected from the set RH . If, on the other hand, the condence of r is still greater than (or equal to) min conf, algorithm 1.a keeps executing the operations inside the inner loop before selecting a new rule from RH . The number of executions of the inner loop that are performed by algorithm 1.a so as to hide a rule r, can be determined a priori, as the following Lemma proves. Lemma 4.1
supp(r) Given a rule r, algorithm 1.a performs |D| ( min conf supp(lr )) executions of the inner

loop. Proof During each execution of the inner loop, the number of transactions supporting the antecedent of the rule (lr ) is increased by one. After k iterations, the condence of r will be Conf (r)(k) =
Nr , Nlr +k

where Nr is the number of transactions supporting r and Nlr is the num-

ber of transactions supporting lr . The inner loop is executed until Conf (r)(k) < min conf or more specically
Nr Nlr +k

< min conf . This inequality can be rewritten as


Nr min conf

Nr min conf

Nlr < k.

From the last inequality we can derive that k =

Nlr . The formula about k can

be easily derived by observing that the iterations in the inner loop terminate as soon as the rst integer value greater than (or equal to)
Nr min conf

Nlr is reached. By incorporating the


conf
lr |D| ) or that

Nr size of database in the formula for k, we can prove that k = |D|( |D|min

supp(r) k = |D| ( min conf supp(lr )) . Figure 1b depicts the pseudo-code for algorithm 1.a, where

the number of executions of the inner loop (i.e., k) has been indicated by N iterations.

17

4.4.2

Algorithm 1.b

INPUT: a set RH of rules to hide, the source database D, the size of the database |D|, the min conf threshold, the min supp threshold OUTPUT: the database D transformed so that the rules in RH cannot be mined Begin Foreach rule r in RH do { 1. Tr = { t in D / t fully supports r} 2. for each transaction of Tr count the number of items in it 3. sort the transactions in Tr in ascending order of the number of items supported 4. repeat until (conf (r) <min conf or supp(r) <min supp) { 5. choose the transaction t in Tr with the lowest number of items (the rst transaction in Tr ) 6. choose the item j in rr with the minimum impact on the (|rr | 1)-itemsets 7. delete j from t 8. decrease the support of r by 1 9. recompute the condence of r 10. remove t from Tr } 11. remove r from RH } End

Begin Foreach rule r in RH do { 1. Tr = {t D/ t fully supports r} // count how many items are in each // transaction of Tr 2. foreach transaction t in Tr do { 3. t.num items=count(t) } // sort Tr in ascending order of // size of the transactions 4. sort(Tr ) supp(r) 5. N iter conf = |D| ( min conf supp(lr )) 6. N iter supp = |D| min supp 7. N iterations = min(N iter conf, N iter supp) 8. For i=1 to N iterations do { // choose the transaction in Tr // with the lowest size 9. t = Tr [1] // choose the item of rr // with the minimum impact on the // (|rr | 1)-itemsets 10. j = choose item(rr ) // set to zero the bit of t.values of items // that represents item j 11. set to zero(j, t.values of items) 12. supp(r)=supp(r) - 1 13. Conf(r)=supp(r)/supp(lr ) 14. Tr = Tr - t } 15. RH = RH r } End
supp(r)

(a)

(b)

Figure 2: Sketch and Pseudo Code of algorithm 1.b This algorithm hides sensitive rules according to the second of the proposed strategies. It reduces the support of each selected rule by decreasing the frequency of the consequent through transactions that support the rule. This process goes on until either the condence or the support of the rule is below the minimum threshold. An overview of this algorithm is depicted in Figure 2a. Algorithm 1.b starts by generating the set Tr of transactions supporting the rule r and counting the number of items supported by each one of them. To select the transaction, through which to reduce the support of rr , algorithm 1.b sorts Tr in ascending order of transaction size. Then, it chooses the rst 18

transaction in the so ordered Tr . By selecting the transaction supporting the lowest number of items algorithm 1.b minimizes the impact that the applied changes will have on the database. The minimum impact criterion is also adopted to select the item to be deleted from the designated transaction: algorithm 1.b generates all the (|rr | 1)-subsets of rr and chooses the rst item of the (|rr | 1)-itemset with the highest support, thus minimizing the impact on the (|rr | 1)-itemsets. Once the selected item has been deleted form t and the latter has been removed from Tr , algorithm 1.b updates the support and the condence of the rule and checks if its still sensitive: if no further steps are required, another rule is selected from the set RH . A new rule is selected from RH when either the support or the condence of the current rule decreases below the minimum threshold. Therefore the number of executions of the inner loop that are performed by algorithm 1.b in order to hide a rule r, is given by the minimum of the number of iterations required to bring conf (r) < min conf and the number of iterations required to bring supp(r) < min supp. The number of iterations required to reduce conf (r) below the threshold is given by Lemma 4.1. Observe that we can use the same results obtained for algorithm 1.a, since they are independent from the specic algorithm. The number of iterations required to reduce the supp(r) below the minimum threshold is given by the following Lemma. Lemma 4.2
supp(r) Given a rule r, algorithm 1.b takes |D| min supp steps to reduce supp(r) below the min supp

threshold. Proof During each execution of the inner loop, the number of transactions supporting the rule r is decreased by one. Therefore, after k iteration, the support will be Supp(r)k =
Nr k , |D|

where Nr is the number of transactions supporting r. The number of steps required to reduce the support below the min supp threshold is given by the following relation k =
supp(r) |D| ( min conf supp(lr )) or equivalently by Nr k |D|

< min supp. The previous inequality < k. We can then derive the value

can be rewritten as

Nr |D|min supp

<

k |D|

or |D|

Nr |D|min supp

for k as follows k = |D|

supp(r) min supp

since k is an integer.

Figure 2b depicts the pseudo code of algorithm 1.b; in this gure, N iter conf and

19

N iter supp are the numbers of inner loop executions required to reduce the condence and the support of a rule below the min conf and min supp threshold respectively. Since a rule is hidden when either its condence or its support is smaller than the minimum threshold, algorithm 1.b needs to repeat the inner loop only N iterations times, where N iterations is the smallest between N iter conf and N iter supp. 4.4.3 Algorithm 2.a

This algorithm decreases the support of the sensitive rules until either their condence is below the min conf threshold or their support is below the min supp threshold. Figure 3a shows the sketch of algorithm 2.a; more details about the steps it requires are given in Figure 3b. Algorithm 2.a rst generates the set Tr of transactions supporting the rule r and counts the number of items supported by each of them. Then, it sorts Tr in decreasing order of transaction size and chooses the rst transaction in the so ordered Tr : this transaction will be modied to reduce the support of r. By selecting the transaction supporting the lowest number of items algorithm 2.a minimizes the impact that the changes applied will have on the database. To reduce the support of r through t, algorithm 2.a has to turn to 0 the value of one item of r in t; this item is selected according to the minimum impact criterion: algorithm 2.a generates all the (|r| 1)-subsets of r and chooses the rst item of the (|r| 1)itemset with the lowest support, thus minimizing the impact on the (|r| 1)-itemsets. Once the chosen item has been deleted form t and the latter has been removed from Tr , algorithm 2.a updates the support and the condence of the rule and checks if it is still sensitive. If no further steps are required, another rule is selected from the set RH . This happens when either the support or the condence of the current rule are decreased below the minimum threshold. The number of executions of the inner loop performed by algorithm 2.a is therefore given by the smallest between the number of iterations required to bring conf (r) < min conf and the number of iterations required to bring supp(r) < min supp. According to the results from Lemma 4.1 and Lemma 4.2 we can state that this
supp(r) number is the minimum between |D| ( min conf supp(lr )) and |D| supp(r) min supp

. Figure

3b gives the pseudo code for algorithm 2.a; as shown, this algorithm performs N iterations 20

INPUT: a set RH of rules to hide, the source database D, the size of the database |D|, the min conf threshold, the min supp threshold OUTPUT: the database D transformed so that the rules in RH cannot be mined Begin Foreach rule r in RH do { 1. Tr = { t in D / t fully supports r} 2. for each transaction of Tr count the number of items in it 3. sort the transactions in Tr in ascending order of the number of items supported 4. repeat until (conf (r) <min conf) { 5. choose the transaction t in Tr with the lowest number of items (the rst transaction in Tr ) 6. choose the item j in r with the minimum impact on the (|r| 1)-itemsets 7. delete j from t 8. decrease the support of r by 1 9. recompute the condence of r 10. remove t from Tr } 11. remove r from RH } End

Begin Foreach rule r in RH do { 1. Tr = {t D/ t fully supports r} // count how many items are supported // by each trans. of Tr 2. foreach transaction t in Tr do { 3. t.num items=count(t) } // sort Tr in ascending order of // size of the transactions 4. sort(Tr ) supp(r) 5. N iter conf = |D| ( min conf supp(lr )) 6. N iter supp = |D| min supp 7. N iterations = min(N iter conf, N iter supp) 8. For i=1 to N iterations do { // choose the transaction in Tr // with the lowest size 9. t = Tr [1] // choose the item of r // with the minimum impact on the // (|r| 1)-itemsets 10. j = choose item(r) // set to zero the bit of t.list of items // that represents item j 11. set to zero(j, t.values of items) 12. supp(r)=supp(r) - 1 13. Conf(r)=supp(r)/supp(lr ) 14. Tr = Tr - t } 15. RH = RH r } End
supp(r)

(a)

(b)

Figure 3: Sketch of Algorithm 2.a


supp(r) executions of the inner loop, where N iterations is the minimum between |D| ( min conf

supp(lr )) and |D| 4.4.4

supp(r) min supp

Algorithm 2.b

This algorithm hides sensitive rules by decreasing the support of their generating itemsets until their support is below the minimum support threshold. The item with maximum support is hidden from the minimum length transaction. The generating itemsets of the rules in RH are considered for hiding. The generating itemsets of the rules in RH are stored in LH and they are hidden one by one by decreasing their support. The itemsets in LH are rst sorted in descending order of their size and support. Then they are hidden starting from 21

the largest itemset. If there are more than one itemset with maximum size, then the one with the highest support is selected for hiding. The algorithm works like follows. Let Z be the next itemset to be hidden. Algorithm hides Z by decreasing its support. The algorithm rst sorts the items in Z in descending order of their support, and sorts the transactions in TZ in ascending order of their size. The size of a transaction is determined by the number of items it contains. At each step the item i Z, with maximum support is selected and removed from the transaction with minimum size. The execution stops after the support of the current rule to be hidden goes below the minimum support threshold. Given a large itemset Z to be hidden, Algorithm 2.b performs (|TZ | min supp |D|) executions of the inner loop to hide Z. An overview of this algorithm is depicted in Figure 4 where the generating itemsets of all the rules, specied to be hidden, are stored in LH . After hiding an item from a transaction, algorithm 2.b updates the supports of the remaining itemsets in LH along with the list of transactions that support them. Algorithm 2.b chooses the item with maximum support for removal with the intention that a high support item will have less side eects since it has many more transactions that support it compared to a low support item. The idea behind choosing the smallest transaction for removing is that, a smaller transaction will possibly have less side eects on the other itemsets than a longer transaction. 4.4.5 Algorithm 2.c

This algorithm hides sensitive rules by decreasing the support of their generating itemsets until the support is below the minimum support threshold. If there are more than one large itemsets to hide, the algorithm rst sorts the large itemsets with respect to their size and support as algorithm 2.b does. Formally, let the next itemset to be hidden be Z. Algorithm 2.c hides Z from D, by removing the items in Z, from the transactions in TZ , in round robin fashion. The algorithm starts with a random order of items in Z and a random order of transactions in TZ . Assume that the order of items in Z be i0 , i1 , .., i(n1) , and let the order of transactions in TZ be t0 , t1 , ..., t(m1) . In step 0 of the algorithm, the item i0 is removed from t0 . At step 1, i1 is removed from t1 , and in general, at step k, item i(k
mod n)

is removed from

transaction tk . The execution stops after the support of the current itemset, to be hidden, 22

INPUT: a set L of large itemsets, the set LH of large itemsets to hide, the database D and the min supp threshold OUTPUT: the database D modied by the deletion of the large itemsets in LH Begin 1. Sort LH in descending order of size and support of the large itemsets foreach Z in LH { 2. Sort the transactions in TZ in ascending order of transaction size 3. N iterations = |TZ | min supp |D| For k = 1 to N iterations do { 4. Remove the maximal support item of Z from the next transaction in TZ 5. Update the database, D } } End

Begin //sort LH in descending order of size and support 1. sort(LH ) //TH is the data structure that keeps the //transactions that support LH 2. TH = {TZ |Z LH and t D : t TZ t supports Z} foreach Z in LH { //sort TZ in ascending order //of transaction size 3. sort(TZ ) 4. N iterations = |TZ | min supp |D| For k = 1 to N iterations do { //get the top transaction of TZ //and delete it from TZ 5. t = popfrom(TZ ) 6. a = maximal support item in Z //aS is the collection of itemsets // that contain item a 7. aS = {X LH | a X} //Propagate the eects of deleting a from t //to other itemsets in LH // that are supported by t foreach X in aS { 8. if (t in TX ) 9. delete(t,TX ) } //Delete item a from transaction t //in database D 10. delete(a,t, D) } } End

(a)

(b)

Figure 4: Sketch and Pseudo Code of Algorithm 2.b goes below the minimum support threshold. So given an itemset Z, algorithm 2.c performs (|TZ | min supp |D|) executions of the inner loop as algorithm 2.b. An overview of this algorithm is depicted in Figure 5 where the generating itemsets of the rules to be hidden are stored in LH . After hiding an item from a transaction, algorithm 2.c updates the supports of the remaining itemsets in LH along with the list of transactions that support them as in algorithm 2.b. Algorithm 2.c is the base algorithm which is going to be used for comparison with more sophisticated algorithms for reducing the support and condence of the rules. The intuition behind the idea of hiding in round robin fashion, is fairness. This way, no item is over-killed

23

and the chance of having a smaller number of side eects is higher than choosing an item at random and always trying to hide it.
Begin //sort the large itemsets in LH in descending //order of support, descending order of size 1. sort(LH ) Foreach Z in LH do { 2. i = 0 3. j = 0 //nd the transactions supporting Z 4. < TZ >= sequence of transactions that support Z 5. < Z >= sequence of items in Z 6. N iterations = |TZ | min supp |D| //hide Z by deleting its ith item //from the jth transaction supporting Z For k = 1 to N iterations do { //a is the 1-itemset containing the //the ith item appearing in Z 7. a = < Z > [i] //t is the jth transactions supporting Z 8. t = < TZ > [j] // aS is the collection of itemsets // that contain item a 9. aS = {X LH | a X} //Propagate the eects of deleting a from t //to other itemsets in LH //that are supported by t foreach X in aS { 10. if (t in TX ) then 11. delete(t,TX ) } 12. delete(a,t, D) 13. i = (i + 1)modulo size(Z) 14. j = j + 1 } } End

INPUT: a set L of large itemsets, the set LH of large itemsets to hide, the database D and the min supp threshold OUTPUT: the database D modied by the deletion of the large itemsets in LH Begin 1.Sort LH in descending order of size and support of the large itemsets Foreach Z in LH do { 2. i = 0 3. j = 0 4. < TZ >= sequence of transactions that support Z 5. < Z >= sequence of items in Z 6. N iterations = |TZ | min supp |D| For k = 1 to N iterations do { 7. a = ith item in < Z > 8. t = j th transaction in < TZ > 9. Delete a from t 10. Update the database D 11. i = (i + 1)modulo size(Z) 12. j = j + 1 } } End

(a)

(b)

Figure 5: Sketch of Algorithm 2.c

Performance Evaluation

We performed our experiments on a SPARC workstation with 400 MHz processor and with 1 GB of main memory, under SunOS 5.6 operating system. In order to generate the source databases, we made use of the IBM synthetic data generator. The performance of the developed algorithms has been measured according to two criteria: time requirements and side eects produced. As time requirements, we considered the time needed by each algorithm 24

to hide a specied set of rules; as side eects, we considered the number of lost rules and the number of new rules introduced by the hiding process. Hiding rules produces some changes in the original database, changes that aect the set of rules mined. In particular, not all the rules that can be mined from the source DB can still be retrieved in the released DB and some rules that couldnt be mined in the source DB can be retrieved after the hiding process. We call the former rules lost rules and the latter rules new rules. |D| 10k 50k 100k |I| 50 50 50 ATL 5 5 5

Table 4: The datasets used in the evaluation trials To assess the performance of the proposed algorithms, we used them to hide sets of 5 and 10 rules mined from the datasets of Table 4, each time measuring the time required by the hiding process. For each dataset, we generated all the rules that have minimum support and minimum condence and we stored them in an appropriate data structure. After the completion of the hiding process, we mined the released database and then we compared the rules generated by the two databases in the following way: we checked if the non-sensitive rules mined from the source database could still be mined in the released database. To do so, we compared each rule mined from the original database with each rule mined form the released DB. If the rule wasnt found, we considered it lost. Note that this process of rule checking tends to consider the rules selected for hiding as lost, since they cannot be retrieved from the released database. However, since they have been hidden on purpose, we excluded them from the set of lost rules. Once all the lost rules had been determined, we checked if any new rule had been introduced by the hiding process. To do so, we computed the dierence between the number of rules mined from the released DB, and the number of rules mined from the source database and subtracting from this dierence the number of rules lost. If the result is greater than zero, it gives the number of the new rules introduced.

25

|DB|

10k

50k

100k

Rules 8 25 45 12 27 21 6 43 37 38 28 31 8 25 45 12 27 21 6 43 1 38 10 27 8 25 45 12 27 21 6 43 37 38 1 10

Support 2.25% 2.15% 2.44% 2.04% 2.57% 2.18% 2.40% 2.30% 4.65% 2.24% 2.19% 2.36% 2.33% 2.02% 2.27%

Condence 40.34% 37.95% 30.56% 27.09% 23.87% 39.71% 36.83% 29.14% 25.86% 21.60% 39.91% 36.35% 29.87% 26.47% 21.59%

Table 5: 5 rules selected for hiding from databases of size 10k, 50k and 100k

5.1

Performance Evaluation of Algorithm 1.a

The time requirements and side eects evaluation for algorithm 1.a are shown in Figures 6a, 6b and 6c respectively. As depicted form Figure 6a, the time needed by algorithm 1.a is linear in |D| and |RH |, according to the theoretical analysis and to the validation results. The increase of the support of the antecedent of a set of rules increases the number and the size of the frequent itemsets; the eect is two-fold: the existing non-sensitive rules cannot reach the minimum requirements in the released database (rules lost) and the rules, whose support or condence in the source database did not reach the minimum threshold, now they do in the released database (new rules). As Figure 6b shows, the number of non-sensitive rules lost does not depend neither on the number of transactions in the database nor on the number of rules selected for hiding. In the opposite, the number of new rules generated depends on |RH |: as Figure 6c shows, the number of new rules introduced by algorithm 1.a increases when |RH | increases. We experienced that if we hide larger sets of rules, a larger number of new frequent itemsets is introduced and therefore an increasing number of new rules is generated.

26

|DB|

10k

50k

100k

Rules 33 25 46 10 48 01 4 35 3 32 46 48 4 35 32 39 02 35 3 27 46 48 4 35 2 32 05

Support 2.02% 2.16% 2.77% 2.55% 2.22% 3.50% 2.63% 2.02% 2.19% 2.20% 2.59% 3.46% 2.64% 2.58% 2.11%

Condence 22.82% 21.21% 20.65% 18.76% 18.02% 20.58% 19.11% 18.19% 16.05% 15.89% 20.74% 20.24% 18.96% 16.27% 15.41%

Table 6: 5 additional rules selected for hiding from databases of size 10k, 50k, and 100k
Time Requirements for Algorithm 1.a
for ATL =5, |I| = 50
150.0 50.0

Side Effects Evaluation for Algorithm 1.a


for ATL =5, |I| = 50
|Rh| = 5 |Rh| = 10
50000.0

Side Effects for Algorithm 1.a


for ATL =5, |I| = 50
|Rh| = 5 |Rh| = 10

|Rh| = 5 |Rh| = 10

Time Required

40000.0

New Rules
50000.0

Rules Lost

100.0

30000.0

25.0

50.0

20000.0

10000.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 6: Evaluation of algorithm 1.a

(c)

5.2

Performance Evaluation of Algorithm 1.b

The time needed by algorithm 1.b increases proportionally to |D| and |RH |, as Figure 7a shows. This is in accordance to the theoretical analysis and to the validation results. Figure 7b gives the number of rules mined from the source database that cannot be mined after the hiding process; as can be seen, the number of non-sensitive rules lost by algorithm 1.b increases when |RH | increases. The peak in correspondence to the 50k dataset can be explained by looking at the characteristics of the rules hidden. As Table 5 shows, the fourth rule selected in the 50k dataset has a very high support and there are no such rules in the other 27

0.0 10000.0

50000.0

90000.0

0.0 10000.0

90000.0

0.0 10000.0

50000.0

90000.0

two datasets. Moreover, the support of the antecedent of this rule was also very high. To hide this rule required the deletion of its consequent from numerous transactions supporting it. This results to the increase of the number of non-sensitive rules whose support or condence decreased below the threshold. For the algorithm 1.b, new rules can be generated starting from those that have support greater than (or equal to) the min supp but condence lower than min conf: if the deletion of some literals decreases the support of the antecedent of these rules, so that their condence overcomes the minimum threshold, these rules become non-sensitive. As depicted in Figure 7c, the number of new rules is quite low and tends to decrease when the number of transactions in the database increases. We can therefore say that the changes, introduced by algorithm 1.b to hide the rules of Tables 5 and 6, reduce the number of non-sensitive rules mined from the original database proportionally to |RH |, but slightly aect the number of rules that become non-sensitive.
Time Requirements for Algorithm 1.b
for ATL = 5, |I| = 50
50.0

Side Effects for Algorithm 1.b


for ATL = 5, |I| = 50
10.0

Side Effects for Algorithm 1.b


for ATL = 5, |I| = 50
|Rh| = 5 |Rh| = 10

Number of Rules Lost

Time Required

50.0

25.0

Number of New Rules


50000

|Rh| = 5 |Rh| = 10

|Rh| = 5 |Rh| = 10

5.0

25.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 7: Evaluation of algorithm 1.b

(c)

5.3

Performance Evaluation of Algorithm 2.a

The time needed by algorithm 2.a increases proportionally to |D| and |RH |, as Figure 8a shows. This is consistent with the theoretical analysis and the validation results. Figure 8b gives the number of rules mined from the source database that cannot be mined after the hiding process. As it can be seen, the number of non-sensitive rules lost by algorithm 2.a increases when |RH | increases. Regarding the peak performance of algorithm 2.a for the 50k dataset, the same considerations made for algorithm 1.b, apply. Like algorithm 1.b, new rules are generated by algorithm 2.a when the deletion of some literals decreases the support 28

0.0 10000

50000

90000

0.0 10000

90000

0.0 10000

50000

90000

of the antecedent of some (minimum support, but not minimum condence) rules so that their condence overcomes the minimum threshold. Figure 8c gives the number of new rules mined from the database after hiding the rules of Tables 5 and 6. As shown, the number of new rules introduced tends to decrease when the number of transactions in the database increases. Furthermore, similar with the algorithm 1.b, the changes introduced by algorithm 2.a reduce the number of non-sensitive rules mined from the original database proportionally to |RH |, but slightly aect the number of rules that become non-sensitive.
Time Requirements for Algorithm 2.a
for ATL = 5, |I| = 50
50.0

Side Effects for Algorithm 2.a


for ATL = 5, |I| = 50
10.0

Side Effects for Algorithm 2.a


for ATL = 5, |I| = 50
|Rh| = 5 |Rh| = 10

Number of Rules Lost

Time Required

50.0

25.0

Number of New Rules


50000

|Rh| = 5 |Rh| = 10

|Rh| = 5 |Rh| = 10

5.0

25.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 8: Evaluation of algorithm 2.a

(c)

5.4

Performance Evaluation of Algorithm 2.b

The time requirements of algorithm 2.b are shown in Figure 9a for large scale data. Figure 9a shows the results when 5 and 10 rules are hidden for data sets of size 10k, 50k, and 100k. As can be seen from the gure, the time requirements for hiding a set of rules increase linearly with the database size for large data sets as well. In fact the linear behavior is more obvious for larger scale data. Another observation is that the time requirements for hiding 10 rules are higher than hiding 5 rules which is also another expected result. Evaluation of side eects in terms of the lost rules and the new rules as a result of the hiding process are shown in Figure 9b and Figure 9c. We can see from Figure 9b that the number of rules lost after hiding 10 rules is higher than the number of rules lost after hiding 5 rules. This is an intuitive result since hiding more rules means deleting more items, therefore more side eects in terms of the number of rules lost. One would expect that the size of the database would not change the characteristics of the rules in the database, therefore the side 29

0.0 10000

50000

90000

0.0 10000

90000

0.0 10000

50000

90000

Time Requirements for Algorithm 2.b


for ATL =5, |I| = 50
50.0

Side Effects Evaluation for Algorithm 2.b


for ATL =5, |I| = 50
50.0 10.0 45.0

Side Effects for Algorithm 2.b


for ATL =5, |I| = 50
|Rh| = 5 |Rh| = 10

Number of Rules Lost

|Rh| = 5 |Rh| = 10

|Rh| = 5 |Rh| = 10

40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0

Number of New Rules


50000

Time Required

5.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 9: Evaluation of algorithm 2.b

(c)

eects would be more or less similar for dierent database sizes. However we observed an unexpected behavior in the number of rules lost which is shown in Figure 9b. When we look closely at the rules selected to be hidden, we can see why this happens. Since we choose the rules to be hidden randomly, the selected rule set aects the performance of the algorithms in terms of the side eects. The sets of rules hidden for dierent database sizes are shown in Table 5 and Table 6. Table 5 shows the 5 rules selected to be hidden for dierent databases together with their support and condence values. Table 6 shows the additional 5 rules selected to bring the number of rules to be hidden to 10. In the set of 5 rules selected for hiding for the database of 50k transactions, we can see that there is a rule, 1 38 with support 4.65%, and in the additional 5 rules, there is a rule, 46 48, with support 3.50%. When we look at the rule sets for 10k and 100k transactions, we can see that all the rules have their support less than 4%. To hide a rule with high support, more items need to be removed from transactions which will increase the number of rules lost. And the rule with a high support selected for hiding in the 50k transactions case, explains the abnormal behavior in the rules lost which is more obvious for the case of hiding 5 rules.

5.5

Performance Evaluation of Algorithm 2.c

The time requirements of algorithm 2.c are shown in Figure 10a for large scale data. Figure 10a shows the results when 5 and 10 rules are hidden for data sets of size 10k, 50k, and 100k. As it can be seen from the gure, the time requirement for hiding a set of rules increases linearly with the database size. Also, the time requirement for hiding 10 rules is

30

0.0 10000.0

50000.0

90000.0

0.0 10000

90000

0.0 10000

50000

90000

higher than hiding 5 rules which is also an expected result.


Time Requirements for Algorithm 2.c
for ATL =5, |I| = 50
50.0 50.0

Side Effects Evaluation for Algorithm 2.c


for ATL =5, |I| = 50
45.0

Number of Rules Lost

|Rh| = 5 |Rh| = 10

|Rh| = 5 |Rh| = 10

40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0


Time Required
0.0 10000.0

Number of transactions

(a)

Figure 10: Evaluation of algorithm 2.c Evaluation of side eects in terms of the rules lost as a result of the hiding process are shown in Figure 10b. The number of rules that are lost when hiding 10 rules is higher that the number in the case of hiding 5 rules. This is an intuitive result since hiding more rules will aect more rules. We did not include the graphs depicting the number of new rules introduced by algorithm 2.c since there are no new rules introduced. This may rst seem counter-intuitive since algorithm 2.c is a naive algorithm that emphasizes on fairness by removing items from transactions in a round robin fashion. Algorithm 2.c chooses the transactions randomly as opposed to choosing the smallest transaction in size, therefore it will probably choose mostly average size transactions which will have small side eects to the condence of the other rules. This will be clearer with an example. Suppose that we would like to decrease the support of a rule A B. The smallest possible transaction that supports this rule is {A, B}, and suppose that such a transaction exists. Removing A from that transaction will cause the condence of the rules (other than A B) that contain A in their antecedent to increase which will cause the introduction of new rules observing the minimum condence requirement. However for average size transactions this will probably not be the case since they will contain both the antecedent and the consequent of the rules and the condence of these rules will decrease upon the removal of A.

50000.0

90000.0

0.0 10000

Number of transactions

50000

90000

(b)

31

5.6

Comparative Evaluation of the Algorithms

The time requirements of the proposed algorithms are shown in Figure 11a, and Figure 11b for hiding 5 and 10 rules respectively. In both cases algorithm 1.b and algorithm 2.c perform very similar to each other and they have the lowest time requirement. Algorithm 1.a has the highest time requirement. The evaluation of side eects, in terms of the new rules introduced, is shown in Figure 12a and Figure 12b for hiding 5 and 10 rules respectively. We did not include the number of rules introduced for algorithm 1.a since it introduces new rules in the order of thousands. On the contrary, algorithm 2.c does not introduce any new rule, therefore it is not included in Figure 12a and Figure 12b either. In terms of the number of rules introduced, all the algorithms behave similarly, introducing only a few rules, except algorithm 2.c which does not introduce new rules at all and algorithm 1.a, which introduces an unacceptable number of new rules. Performance results in terms of the number of rules lost are shown in Figure 13a and Figure 13b for 5 and 10 rules to be hidden. In terms of the number of rules lost, algorithm 1.a is the best performing algorithm in both Figure 13a and Figure 13b. Algorithm 2.b also hides fewer rules if compared to the other algorithms in both gures. Algorithm 2.c is the worst with respect to the number of rules lost (see Figure 13b), but it is still close, in performance, to the other algorithms.
Time Requirements for Large Scale Data
for ATL =5, |I| = 50, |Rh|=5
150.0 150.0

Time Requirements for Large Scale Data


for ATL =5, |I| = 50, |Rh|=10
Algorithm 1.a Algorithm 1.b Algorithm 2.a Algorithm 2.b Algorithm 2.c

125.0

Time Required

100.0

Time Required
90000

Algorithm 1.a Algorithm 1.b Algorithm 2.a Algorithm 2.b Algorithm 2.c

125.0

100.0

75.0

75.0

50.0

50.0

25.0

25.0

Number of transactions

Number of transactions

(a)

(b)

Figure 11: Time requirements for hiding (a) 5 rules and (b) 10 rules.

32

0.0 10000

50000

0.0 10000

50000

90000

Side Effects for Large Scale Data


for ATL =5, |I| = 50, |Rh|=5
10.0 10.0

Side Effects for Large Scale Data


for ATL =5, |I| = 50, |Rh|=10
Algorithm 1.b Algorithm 2.a Algorithm 2.b

Number of New Rules

5.0

Number of New Rules


90000

Algorithm 1.b Algorithm 2.a Algorithm 2.b


5.0

Number of transactions

Number of transactions

(a)

(b)

Figure 12: New rules introduced after hiding (a) 5 rules and (b) 10 rules.
Side Effects for Large Scale Data
for ATL =5, |I| = 50, |Rh|=5
50.0

Side Effects for Large Scale Data


for ATL =5, |I| = 50, |Rh|=10
50.0

45.0

Number of Rules Lost

35.0 30.0 25.0 20.0 15.0 10.0 5.0

35.0 30.0 25.0 20.0 15.0 10.0 5.0

Number of transactions

Number of transactions

(a)

(b)

Figure 13: Rules lost after hiding (a) 5 rules and (b) 10 rules.

Conclusions

In this paper we presented two fundamental approaches in order to protect sensitive rules from disclosure. The rst approach prevents rules from being generated, by hiding the frequent sets from which they are derived. The second approach reduces the importance of the rules by setting their condence below a user-specied threshold. We developed ve algorithms that hide sensitive association rules based on these two approaches. The rst three algorithms are rule-oriented. In other words, they decrease either the condence or the support of a set of sensitive rules, until the rules are hidden. This can happen either because the large itemsets that are associated with the rules are becoming small or because the rule 33

0.0 10000

50000

90000

0.0 10000

50000

40.0

Number of Rules Lost

40.0

Algorithm 1.a Algorithm 1.b Algorithm 2.a Algorithm 2.b Algorithm 2.c

45.0

Algorithm 1.a Algorithm 1.b Algorithm 2.a Algorithm 2.b Algorithm 2.c

90000

0.0 10000

50000

0.0 10000

50000

90000

condence goes below the threshold. The last two algorithms are itemset-oriented. They decrease the support of a set of large itemesets until it is below a user-specied threshold, so that no rules can be derived from the selected itemsets. We also measured the performance of the proposed algorithms according to two criteria: a) the time that is required by the hiding process and b) the side eects that are produced. As side eects, we considered both the loss and the introduction of information in the database. We loose information whenever some rules, originally mined from the database, cannot be retrieved after the hiding process. We add information whenever some rules, that could not be retrieved before the hiding process, can be mined from the released database. We compared the proposed algorithms on the base of the results of these experiments and we concluded that there is not a best solution for all the metrics. The choice of the algorithm to adopt depends on which criteria one considers as the most relevant: the time required, the information loss or the information that is added. Some assumptions have been made for the development of the proposed algorithms. We are currently considering extensions on these algorithms by dropping these assumptions. Another interesting issue, which we plan to investigate, is the application of the ideas introduced in this paper to other data mining contexts, such as classication mining, clustering, etc.

References
[1] Nabil R. Adam and John C. Wortmann. Security-Control Methods for Statistical Databases: A Comparison Study. ACM Computing Surveys, 21(4):515556, 1989. [2] D. Agrawal and C. C. Aggarwal. On the Design and Quantication of Privacy Preserving Data Mining Algorithms. Proceedings of ACM PODS Conference, 2001. [3] R. Agrawal and R. Srikant. Privacy Preserving Data Mining. Proceedings of ACM SIGMOD Conference, 2000.

34

[4] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios. Disclosure Limitation Of Sensitive Rules. Proceedings of Knowledge and Data Exchange Workshop, 1999. [5] Chris Clifton. Protecting against Data Mining through Samples. Proceedings of the 13th IFIP WG11.3 Conference on Database Security, 1999. [6] Chris Clifton and Donald Marks. Security and Privacy Implications of Data Mining. Proceedings of the 1996 ACM Workshop on Data Mining and Knowledge Discovery, 1996. [7] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996. [8] T. Johnsten and Vijay V. Raghavan. Impact of Decision-Region Based Classication Mining Algorithms on Database Security. Proceedings of the 13th IFIP WG11.3 Conference on Database Security, 1999. [9] Donald G. Marks. Inference in MLS Database. IEEE Transactions on Knowledge and Data Engineering, 8(1):4655, 1996. [10] Daniel E. OLeary. Knowledge Discovery as a Threat to Database Security. Proceedings of the 1st International Conference on Knowledge Discovery and Databases, pages 107 516, 1991. [11] Bhavani Thuraisingham and William Ford. Security Constraint Processing in a Multilevel Secure Distributed Database Management System. IEEE Transactions on Knowledge and Data Engineering, 7(2):274293, 1995.

35

Appendix
A Possible Damage on the Data in Terms of Query Results

Possible queries on a database of transactions could be selection, and projection which may also involve statistical operations, and maybe a temporal extension to those. In terms of data mining, users would like to know what are the maximal set of items purchased having a count greater than a threshold value. These types of queries can not be answered by standard querying tools. But queries such as, what is the count of transactions where milk and bread are purchased together can be answered by the standard querying tools. Users may also ask queries that return a set of transactions in the database. Given the time-stamp of each transaction, users may want to write queries with a temporal dimension such as, how many customer transactions for April, 2002 contain both milk and bread. Among statistical operations, min, max, and average does not make sense in a database of binary values where quantities of items sold, or price information is not involved. Therefore we will not consider these types of queries in our discussion. Queries that return a set of transactions may be considered as micro queries, and the queries that return only statistical information can be considered as macro level queries. Given a set of rules Rh that are hidden from the database D, we can construct a view of the database Dv where Dv is a subset of D which consists of the transactions not modied by the hiding process. IDs of modied transactions could be released so that Dv could be easily constructed. Macro level queries whose results are a subset of Dv , return correct results. Rest of the queries may return incorrect results. This is also true for queries that involve a temporal dimension. In order to improve the correctness of temporal queries, the hiding process may be biased over older transactions, this way ensuring the correctness of queries over more recent transactions. Queries that contain count operation return correct results if they are issued over items that are not used by the hiding process. We can also give a maximum error range for the count queries which can be used by the user to have a rough idea of the error in the returned 36

count values. This maximum error could be the highest support reduction percentage among the items used by the hiding strategies. In terms of data mining, user would like to obtain association rules from the database. In the previous section, we have already discussed the side eects of the hiding process in terms of the hidden or newly appearing association rules.

Analysis and Validation of The Proposed Algorithms

We performed three trials for each of the ve algorithms: the goal of the rst trial was to analyze the behavior of the developed algorithms when the average transaction length increases; the second trial aimed at studying their behavior when the number of items in the database increases while the third trial aimed at analyzing the algorithms when the number of rules that are chosen to be hidden increases. We performed our experiments on a SPARC workstation with 400 MHz processor and with 1 GB of main memory, under SunOS 5.6 operating system. In order to generate the source databases, we made use of the IBM synthetic data generator. The generator creates output les in text format, which is compatible with the format, which can be understood by the code that implements our algorithms. The input to the synthetic data generator, among other parameters, is the number of transactions in the database (|D|), the number of literals appearing in the database (|I|) and the average number of items per transaction (ATL). Series 1 2 3 |D| 10:[1k..10k] 10:[1k..10k] 10:[1k..10k] |I| 50 50 50 |RH | 5 5 5 ATL 5 10 15

Table 7: The datasets used in the frist trial The values for the support and the condence have been xed at 2% and 10% respectively, during the three trials. For the rst trial we generated 10 datasets of size 1K, 2K, 3K, 4K,

37

5K, 6K, 7K, 8K, 9K and 10K respectively, for each tested value of AT L. The number of items in each dataset was xed at |I| = 50 in this trial. Series 1 2 3 |D| 10:[1k..10k] 10:[1k..10k] 10:[1k..10k] |I| 50 100 150 |RH | 5 5 5 ATL 15 15 15

Table 8: The datasets used in the second trial Since we tested our algorithms for 3 dierent values of AT L (5, 10, and 15 items per transactions, on the average), we generated 3 groups (also referred to as series) of 10 databases (one for each value of AT L). Details about the characteristics of the datasets used in our rst trial are given in Table 7. For the representation of the datasets in each series we used the compact notation 10:[1k..10k] to indicate that each series is made up of 10 databases and that the size of each database lies in the range 1k-10k. The parameter |RH | in Table 7 indicates that in our rst trial we ran the proposed algorithms in order to hide a set of 5 disjoint rules. We selected the 5 rules with highest condence. Series 1 2 3 |D| 10:[1k..10k] 10:[1k..10k] 10:[1k..10k] |I| 100 100 100 |RH | 5 10 20 ATL 15 15 15

Table 9: The datasets used in the third trial For the second trial, we generated 3 series of 10 datasets; each dataset has an average transaction length of 15 items and an increasing number of items: 50 for the rst series, 100 for the second series and 150 for the third series. Table 8 shows the values of the parameters |D|, |I|, AT L and |RH | for our second trial. For the third trial, we used the datasets of the second series of Table 8 to hide sets of rules of increasing size. Details about the datasets used in this experiment are given in Table 9. The results that are obtained from the average case analysis of the proposed algorithms, are presented in the following sections.

38

B.1

Analysis and Validation of Algorithm 1.a

Complexity Algorithm 1.a performs in O(|RH | AD + C) time.

Proof For each rule r, algorithm 1.a takes O(AD ) to generate the set Tlr of transactions partially supporting the antecedent of r, O(|Tlr |)1 to sort Tlr and O(|Tlr | AT L) to count the items of lr that are in each transaction of Tlr . To hide a rule r, algorithm 1.a executes the inner loop until the condence of r becomes lower than the min conf threshold.
supp(r) According to Lemma 4.1, this requires |D| ( min conf supp(lr )) iterations. The cost

of each iteration is given by the set all ones operation, which takes O(AT L) time. Since
supp(r) |D| ( min conf supp(lr )) AT L |D| 1 min conf

AT L |D| AT L = AD and con-

sidering that the number of transactions in Tlr is always less than or equal to the number of transactions in D, we can state that algorithm 1.a requires O(|RH | AD ) time to hide all the rules in RH . To validate the theoretical results, we ran algorithm 1.a on the datasets of Table 9; Figure 14c shows the results of this experiment. As we expected, algorithm 1.a is linear in |D| and |RH |. Since the condence and the support of rules depend on the average transaction length and on the number of items in the database, we checked the impact that the changes in these two parameters have on our algorithm, by using this algorithm in order to hide 5 rules in datasets having dierent values for AT L and I. In particular, the increase of AT L (while keeping |I| constant) increases the size of the database, while the increase of |I| (while keeping AT L constant) does not change the average size of D. Since the average database size (AD ) is a primary factor in the performance of algorithm 1.a, we expect that the time required to hide the rules will grow with the increase of AT L, while it will not be signicantly aected by the changes in |I|. The characteristics of the datasets, used in these experiments, are given in Table 7 and 8 respectively. The results of these trials are given
using the bucket-sort algorithm, the time required to sort Tlr in decreasing order of the number of items of lr supported by each transaction is O(|Tlr | + |I|) = O(|Tlr |)
1

39

in Figures 14a and 14b. As shown in these pictures, the increase of the number of items appearing in the database does not signicantly aect the time requirements of algorithm 1.a, while the increase of the average transaction length does aect them.
Validation of Algorithm 1.a
for |I| = 50, |Rh| = 5
50.0 ATL = 5 ATL = 10 ATL = 15

Validation of Algorithm 1.a


for ATL = 15, |Rh| = 5
50.0 50.0

Validation of Algorithm 1.a


for ATL = 15, |I| = 100

Time Required

Time Required

25.0

25.0

Time Required

|I| = 50 |I| = 100 |I| = 150

|Rh| = 5 |Rh| = 10 |Rh| = 20

25.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 14: The validation results of Algorithm 1.a

(c)

B.2

Analysis and Validation of Algorithm 1.b

Lemma 4.3 The number of inner loop executions performed by Algorithm 1.b is O(|D|).

Proof According to Lemmas 4.1 and 4.2 the number of iterations, required to reduce the consupp(r) dence or the support of r below the minimum thresholds, are |D| ( min conf supp(lr ))

and |D|

supp(r) min supp

respectively. When algorithm 1.b executes the inner loop until either

the condence or the support is reduced below the minimum threshold, this means that it
supp(r) executes either |D| ( min conf supp(lr )) or Ks = |D| supp(r) min supp

iterations, which can

be assessed as being O(|D|) in both cases.

Complexity Algorithm 1.b performs in O(|RH | AD ) time.

Proof For each rule r, algorithm 1.b takes O(AD ) to generate the set Tr of transactions supporting 40

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

r, O(|Tr |) to sort Tr and O(|Tr | AT L) to count the items supported by each transaction of Tr . To hide a rule r, algorithm 1.b executes the inner loop until either the condence or the support of r become lower than the threshold. From Lemma 4.3 we know that this requires O(|D|) iterations. The cost of each iteration is given by the choose item operation, which takes O(1) time. Therefore algorithm 1.b takes O(AD + |D| + |Tr | AT L) = O(AD ) to process each rule. Since the rules to hide are |RH |, we can state that algorithm 1.b performs in O(|RH | AD ) time. We ran algorithm 1.b on the three trials discussed at the beginning of this section; the results of these experiments are given in Figures 15a,15b and 15c. The experienced behaviour of algorithm 1.b is consistent with the theoretical analysis: algorithm 1.b is linear in |D| and |RH |, as Figure 15c shows. Algorithm 1.b is also linear with respect to the average transaction length (see Figure 15a), while its not signicantly aected by changes in |I| (Figure 15b).
Validation of Algorithm 1.b
for |I| = 50, |Rh| = 5
50.0 ATL = 5 ATL = 10 ATL = 15

Validation of Algorithm 1.b


for ATL = 15, |Rh| = 5
50.0 |I| = 50 |I| = 100 |I| = 150 50.0

Validation of Algorithm 1.b


for ATL = 15, |I| = 100

Time Required

Time Required

25.0

25.0

Time Required

Rh = 5 Rh = 10 Rh = 20

25.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 15: The validation results of algorithm 1.b

(c)

B.3

Analysis and Validation of Algorithm 2.a

Complexity Algorithm 2.a performs in O(|RH | AD ) time.

Proof For each rule r, algorithm 2.a takes O(AD ) to generate the set Tr of transactions supporting r, O(|Tr |) to sort Tr and O(|Tr | AT L) to count the items supported by each transaction of 41

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

Tr . To hide a rule r, algorithm 2.a executes the inner loop until either the condence or the support of r become lower than the threshold. This takes O(|D|) executions, each execution requiring O(1) time. Therefore, algorithm 2.a takes O(AD + |D| + |Tr | AT L) = O(AD ) to process each rule; since the rules to hide are |RH |, we can state that algorithm 2.a performs in O(|RH | AD ) time. In order to validate the theoretical results, we ran algorithm 2.a on the datasets of Table 9. Figure 16c illustrates the obtained results. As expected, algorithm 2.a is linear in |D| and |RH |. As Figure 16a shows, algorithm 2.a is also linear in AT L. The required time is substantially aected by the changes in this parameter. The main reason for that, is that the increase of the AT L increases the size of the large itemsets. As it is already mentioned, in order to choose the item to delete from a transaction, we nd all the (|r| 1)-subset of r and select the rst item from the (|r| 1)-subset with the highest support. This requires a time proportional to the size of the rule. The results of the second trial are given in Figure 16b. As it can be seen, the time requirements of algorithm 2.a decrease when |I| increases.
Validation of Algorithm 2.a
for |I| = 50, |Rh| = 5
100.0 ATL = 5 ATL = 10 ATL = 15

Validation of Algorithm 2.a


for ATL = 15, |Rh| = 5
100.0 I| = 50 |I| = 100 |I| = 150

Validation of Algorithm 2.a


for ATL = 15, |I| = 100
50.0 Rh = 5 Rh = 10 Rh = 20

Time Required

Time Required

50.0

50.0

Time Required

75.0

75.0

25.0

25.0

25.0

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 16: The validation results from algorithm 2.a

(c)

B.4

Analysis and Validation of Algorithm 2.b

Complexity Algorithm 2.b performs in O(|D| |LH |) time.

Proof To hide a large itemset Z, algorithm 2.b executes the inner for loop (|TZ | (min supp |D|)) 42

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

times. The innermost for-loop executes |LH | times at each iteration of the outer for-loop. For each itemset Z LH , TZ is constructed by reading the database, which takes |D| steps assuming that the transaction length is bounded by a constant. In addition to that, TZ is sorted in ascending order of transaction lengths. Since the transaction length is bound, we can utilize a sorting algorithm that is of order O(|TZ |). So, hiding a single itemset is of order O(|D|) and hiding a set of large itemsets LH is O(|D| |LH |) similar to the algorithm 2.c. To validate the time complexities, we ran experiments on dierent data sets with dierent characteristics. The results are shown in Figure 17a, Figure 17b, and Figure 17c. Figure 17a depicts the eect of dierent number of items to the CPU time for dierent database sizes. Figure 17b and Figure 17c show the eects of average transaction length and number of rules on the CPU time for dierent database sizes. In all the gures, CPU time increases linearly with the database size. Small deviations exist since we choose the itemsets to be hidden randomly and since the supports of the itemsets vary.
Validation of Algorithm 2.b
for ATL = 15, |Rh| = 5
20.0 18.0 16.0 20.0

Validation of Algorithm 2.b


for |I| = 50, |Rh| = 5
30.0 28.0 ATL = 5 ATL = 10 ATL = 15

Validation of Algorithm 2.b


for ATL = 15, |I| = 100

Time Required

Time Required

14.0 12.0 10.0 8.0 6.0 4.0 2.0


14.0 12.0 10.0 8.0 6.0 4.0 2.0


Time Required

|I| = 50 |I| = 100 |I| = 150

18.0 16.0

26.0 24.0 22.0 20.0 18.0 16.0 14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0 1000.0

|Rh| = 5 |Rh| = 10 |Rh| = 20

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 17: The validation results of algorithm 2.b

(c)

B.5

Analysis and Validation of Algorithm 2.c

Complexity Algorithm 2.c performs in O(|D| |LH |) time.

Proof Given a large itemset Z to hide, the algorithm 2.b executes the inner for loop (|TZ | (min supp |D|)) times. Before the loop, transactions that support Z are read from the 43

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

3000.0

5000.0

7000.0

9000.0

database, which takes |D| executions, assuming that the transaction length is bounded by a constant. Inside the for-loop, updating the supports of the other itemsets in LH takes |LH | executions of the innermost for loop. So the overall number of executions for hiding an itemset Z using the algorithm 2.c is (|D| + [|TZ | (min supp |D|)] |LH |). We know that (|TZ | (min supp |D|)) |D|. So hiding a single itemset is O(|D|), and hiding a set of itemsets LH , is O(|D| |LH |). To validate the time complexity of algorithm 2.c, we ran experiments on dierent data sets with dierent characteristics as for algorithm 2.b. The results are shown in Figure 18a, Figure 18b, and Figure 18c. Figure 18a shows the eect of dierent number of items to the CPU time for dierent database sizes. Figure 18b and Figure 18c show the eects of average transaction length and number of rules on the CPU time for dierent database sizes. In all three of the gures, CPU time increases linearly with the database size. Small deviations in the trend exist since we choose the itemsets to be hidden randomly and since the supports of the itemsets vary, but the overall trend is linear with database size.
Validation of Algorithm 2.c
for ATL = 15, |Rh| = 5
20.0 18.0 16.0 20.0

Validation of Algorithm 2.c


for |I| = 50, |Rh| = 5
30.0 28.0 ATL = 5 ATL = 10 ATL = 15

Validation of Algorithm 2.c


for ATL = 15, |I| = 100

Time Required

Time Required

14.0 12.0 10.0 8.0 6.0 4.0 2.0


14.0 12.0 10.0 8.0 6.0 4.0 2.0


Time Required

|I| = 50 |I| = 100 |I| = 150

18.0 16.0

26.0 24.0 22.0 20.0 18.0 16.0 14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0 1000.0

|Rh| = 5 |Rh| = 10 |Rh| = 20

Number of transactions

Number of transactions

Number of transactions

(a)

(b) Figure 18: The validation results of algorithm 2.c

(c)

44

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

0.0 1000.0

3000.0

5000.0

7000.0

9000.0

3000.0

5000.0

7000.0

9000.0

Implementation Issues

In this section, we present some of the choices that we made regarding the implementation of the proposed algorithms. Data Structures In Section 4.4 we introduced the following data structures which are manipulated by the proposed algorithms: sets of transactions (D and T ) set of rules (R and RH ) set of large itemsets (L and LH ) list of items I appearing in the database D. We now give a brief description of how each data structure has been implemented. For the set-of-elements data structure we adopted the following implementation: each set has been represented as an array of references to the elements in the set. The elements of a set are transactions for D and T , rules for RH and large itemsets for L. The elements of a set have been represented as arrays. According to this convention, each transaction t is an array, with one eld for the transaction ID (TID), one eld storing the list of items contained in a transaction and one eld storing the number of these items. The list of items has been implemented as an hash structure having the literals supported by a transactions as keys and integers as values. The same implementation has been adopted to represent the list of items contained in a large itemset; each large itemset X has been implemented as an array with one eld for the list of items and one eld storing the support. Each rule r has been implemented as an array with 4 elds, storing the condence of the rule, a reference to the large itemsets representing the rule antecedent and the rule consequent respectively and a reference to the large itemset Ir corresponding to the rule. Finally, we used an array of strings to represent the list I. Generation of the set T The set TX of transactions partially supporting an itemset X is generated by analyzing each transaction in the database and selecting those transac45

tions containing a proper subset of X. The set TX of transactions fully supporting an itemset X is generated by selecting the transactions in D containing all the items in X. Each set of literals has been implemented as an hash structure with a key for each item in the set and no values for the items of I not in X. Therefore, the number of items of X contained in a transaction t is given by the number of literals of X having a non-zero value in t. Selection of the transaction To hide a rule r, in algorithm 1.a we selected the transaction containing the highest number of items appearing in the left side of r, while in algorithm 1.b and algorithm 2.a, we chose the transaction with the lowest size. To hide a large itemset X, in algorithm 2.b we selected the transaction with the lowest size while in algorithm 2.c we simply chose the j t h transaction in TX . Selection of the item to delete In algorithm 1.b, for each rule r we generated the set of (|rr | 1)-subsets of rr before executing the inner loop. At each call of the choose item function, we sorted the list of (|rr | 1)-itemsets in decreasing order of support and selected the rst (|rr | 1)-itemset in the list. The function returned the rst item appearing in the selected (|rr | 1)-itemset. Once this item had been deleted from t, we updated the support of all the (|rr | 1)-itemsets containing the literal. We did the same for algorithm 2.a, with the only dierence that in this case we selected the item with the highest support to delete in the (|r| 1)-subset of r. For each itemset X, in algorithm 2.b we selected the literal with the highest support, while in algorithm 2.c we simply selected the it h literal of X.

46

Vous aimerez peut-être aussi