Vous êtes sur la page 1sur 6

The 3rd International Conference on Grid and Pervasive Computing - Workshops

Hiding Fuzzy Association Rules in Quantitative Data


Tolga BERBEROGLU and Mehmet KAYA Data Mining and Bioinformatics Laboratory Department of Computer Engineering, Frat University, 23119, Elaz, Turkey kaya@firat.edu.tr Abstract
Data mining and knowledge discovery from databases are researches in which unknown associations automatically discovered from big amounts of data. Advances in data collection, data distribution and related technologies caused researchers to investigate current data mining algorithms from a new point of view. This is personal privacy. With the increase in researches on data mining and sharing of knowledge with many people thru the internet and media, personal privacy problems are considered more seriously. Many techniques have been recently developed against bad purposed data mining. These techniques are classified into different categories. In the first of these categories, called input privacy, the data is manipulated, and the mining result is not affected or minimally affected. The second type of privacy is called as output privacy, where the data is altered. This change makes the mining result preserving certain privacy. In output privacy, specific rules that should be hidden are given in advance. According to this constraint, many data altering techniques for hiding association, classification and clustering rules have been proposed in the literature. However, almost all of them have been done on binary items. But, in real world, the data mostly consist of quantitative values. In this paper, we propose a novel method to hide critical fuzzy association rules from quantitative data. For this purpose, we increase support value of LHS of the rule to be hidden. Experimental results demonstrate the performance and output effects of the proposed algorithm. In the second type of privacy, called output privacy, given critical rules or patterns to be hidden, the data is minimally altered so that the mining result preserve certain privacy [2]. For this purpose, recently, Wang et al. [2] proposed ISL (Increase Support of LHS) and DSR (Decrease Support of RHS) algorithms for binary transactions. In order to increase support value of LHS in ISL method, transactions are modified in a way that LHS items having the value 0 are converted to the value 1. Therefore, confidence value is decreased. In order to decrease support value of RHS in DSR method, LHS items having the value 1 are converted to the value 0, so that trying to obtain a confidence value that is lower than minimum confidence value. There are also other researches similar to the ones stated above, aiming to prevent bad purposed data mining [3-5]. All of them were done on binary values. However, in real world, the data mostly consist of quantitative values. Some work has recently been done on the use of fuzzy sets in discovering association rules for quantitative attributes [6, 7]. Unlike classical set theory where membership is binary, the fuzzy set theory introduced by Zadeh [8] provides excellent means to model the fuzzy boundaries of linguistic terms by introducing gradual membership. Some example linguistic terms include poor, young, rich, excellent, etc. In this paper, we present a novel method for preventing extraction of critical association rules from quantitative data by combining previous researches and taking them to one step ahead. Those previous researches are ISL method that prevents extraction of critical association rules [2] and fuzzy data mining algorithm from quantitative data [6]. The rest of this paper is organized as follows. Privacy preserving rule problem in quantitative data is defined in Section 2. Our approach to hide critical fuzzy association rules is described in Section 3. Determining membership functions for CURE clustering is discussed in Section 4. The fuzzy association rules hiding process is presented in Section 4. The proposed algorithm related examples are included in Section 5. Experimental results are given in Section 6. Section 7 includes a summary and the conclusions.

1. Introduction
Privacy preserving in data mining means hiding output knowledge of data mining by using several methods when this output data is valuable and private. There have been two types of privacy concerning data mining. The first type of privacy, called input privacy, is that the data is manipulated so that the mining results is not affected or minimally affected [1].

978-0-7695-3177-9/08 $25.00 2008 IEEE DOI 10.1109/GPC.WORKSHOPS.2008.33

387

2. Problem Definition
As mentioned earlier, almost all of studies proposed hide association rules were done on binary valued data. However, in real world, data usually contains quantitative values. For example, in a patients blood test, many attributes can be found. However, instead of this presence, attributes quantity in blood is more important for determination of illness. Every people have the problem of cholesterol, but this doesnt mean that one is sick or not, the only criterion for determination of illness is the surplus in cholesterols quantity. In previously presented data mining methods, whether they are using binary values or they are using quantitative values, it is possible to extract association rules that threaten personal privacy. Especially in medical intuitions, there are databases including very extensive information about patients. Possible bad purposed usage of those databases threatens personal privacy of patients. There are some recent examples about this type of usage. For example; Kiser, one of the most important medical institutes of United States sent 858 e-mail messages by mistake. Those messages contained IDs of users and their answers for their illnesses and the questions as well. All of those messages were sent to wrong receivers. (Washington Post, 10 August 2000). In another example, Global Healttrax, an online firm selling health products, sent names, home phones, bank account numbers and credit card data through their website by mistake. (MSNBC, 19 January 2000)

Input: (1) a source database D, (2) a min_support, (3) a min_confidence, (4) a set of predicting items X
Output: 1 Fuzzification of the database, D F 2. In fuzzified database F, calculate every items support value where f F 3. If all f(support)< min_support;EXIT:. // there isnt any rule. 4. Find Large 2_itemsets in F. 5. For each Xs large 2 itemsets 6. Calculate tthe support value of the rule U min(UL , UR ) 7. if rule U(support) > min_support 8. Calculate Rule U (confidence) 9. if rule U(confidence) > min_sonfidence 10. for each TL of rule 11. if TL <0.5 and TL < TR 12. TL = 1 - TL 13. end // if 14. end // for each 15. end // for each 16. Re-calculate confidence value of rule U 17. if rule U(confidence) > min_confidence 18. remove all TL and return to initial form 19. end // if 20. end // for each 21. F prune the database. F D 22. end.

4. Stages of the Algorithm


An association rule is defined as an implication XY, where both X and Y are defined as sets of attributes (interchangeably called items) [9]. Here X is called as the body (LHS) of the rule and Y is called as the head (RHS) of the rule; it is interpreted as follows: for a specified fraction of the existing transactions, a particular value of an attribute set X determines the value of attribute set Y as another particular value under a certain confidence. For instance, an association rule in a supermarket basket data may be stated as, in 20% of the transactions, 75% of the people buying butter also buy milk in the same transaction; 20% and 75% represent the support and the confidence, respectively. The significance of an association rule is measured by its support and confidence. Simply, support is the percentage of transactions that contain both X and Y, while confidence is the ratio of the support of X U Y to the support of X. So, the problem can be stated as: find all

3. The Proposed Algorithm


In a quantitative database, if a critical rule XY needs to be hidden, its confidence value is decreased to a value smaller than the minimum confidence value. One way of decreasing confidence value is increasing the support value of an item X at LHS, and the other way is decreasing support value of an item Y at RHS to a value lower than the minimum support value. In our method, in order to decrease confidence value of a rule, we increase the support value of an item at LHS. For this purpose, the value of item in LHS is subtracted from 1 in the case the value of item in LHS is lower than 0.5 and than value of item in RHS. Abbreviations used in the proposed algorithm are given as follows: Initial database with n transaction data, D; fuzzified database, F; a set of predicting items, X; transactions belong to a LHS item, TL; transactions belong to a RHS item, TR; rule, U.

388

association rules that satisfy user-specified minimum support and confidence. STEP 1 Transform the quantitative value v j of each item i j entered into a fuzzy set represented as f j1 f j2 f jl by using the given membership ( + + ... )
R j1 Rj2 R jl

In this section, an example is give to demonstrate the proposed algorithm. This is a simple example to show how the proposed algorithm can be used to hide critical fuzzy association rules from a set of quantitative transaction data.
Table 1. The set of 5 quantitative transaction data

functions for item quantities, where l is the number of regions for i j . STEP 2 Calculate the count of each attribute region (linguistic term) Rjk in the transaction data as:
n

count jk =

f
i =1

jk

T1 T2 T3 T4 T5

A 10 3 6 7 11

B 5 11 3 5 4

C 8 6 9 8 7

D 3 14 13 12 10

STEP 3 Check whether count jk of each Rjk over n is larger than or equal to the predefined minimum support value. If Rjk satisfies the above condition, put it in the set of large-1 itemsets (L1). STEP 4 Join the large itemsets L1 to generate the candidate set C2. Two regions belonging to the same item cannot simultaneously exist in an itemset in C2. STEP 5 Calculate the fuzzy value of each transaction data as: f I = Min f I j Then, calculate the fuzzy count
j =1 n 2

of I in the transactions as: count I =

f
i =1

i I

Fig. 1. The membership functions used in this example

STEP 6 According to user specified minimum confidence value, rules are extracted. A confidence value of a AoBo rule is computed as follows: Confidence(AoBo)=

Support(AoBo) Support(Ao)

STEP 7 Critical rules are determined. Then in order to hide the rules, confidence values are tried to be decreased. This is achieved by one of two strategies. The first one is to increase the support count of Ao, i.e., LHS of the rule, but not support count of Ao Bo . The second one is to decrease the support the support count of the itemset Ao Bo .

Confidence(Ao Bo) =
An Example:

Support(AoBo) Support(Ao)

In this example, each item has three fuzzy regions: z, o and b. Thus, three fuzzy membership values are produced for each item according to the predefined membership functions. Note that only the same set of membership functions is used all the items for simplicity. Minimum support and confidence values are set at 2.2 and %75, respectively. Fuzzification of transaction data is given in Table 2. If we take AoBz (Support= 2.4 and Confidence= 100%) as an a critical rule, our aim will be decreasing the value of Confidence(AoBz) by decreasing the value of Support(AoBz) or by increasing the value of Support(Ao). In order to increase the value of Support(Ao) we subtract Ao from 1 in the case Ao is lower than Bz.

Table 2. Fuzzification of transaction data

Transactio n T1 T2 T3 T4 T5 Count

Az 0 0.6 0.8 0.6 0 2.0

A Ao 1 0 0.2 0.4 0.8 2.4

Ab 0 0 0 0 0.2 0.2

Bz 1 0 0.6 1 0.8 3.4

B Bo 0 0.8 0 0 0 0.8

Bb 0 0.2 0 0 0 0.2

Cz 0.4 0.8 0.2 0.4 0.6 2.4

C Co 0.6 0.2 0.8 0.6 0.4 2.6

Cb 0 0 0 0 0 0

Dz 0.6 0 0 0 0 0.6

D Do 0 0.2 0.4 0.6 1 2.2

Db 0 0.8 0.6 0.4 0 1.8

389

Confidence(AoBz)=

Support(AoBz) 2.4 = = 100% Support(Ao) 2.4

in the transactions, new fuzzy database is obtained. This is shown in Table 7.


Table 5. Database before modifying T2

To hide critical fuzzy association rule AoBz, transaction T3 in item A is modified from 0.2 to 10.2=0.8. The new database is shown in Table 4.
Table 3. Fuzzy values of items, Ao and Bz

T1 T2 T3 T4 T5 Count

Ao 1 0 0.2 0.4 0.8 2.4

Bz 1 0 0.6 1 0.8

Support 1 0 0.2 0.4 0.8 2.4

Ao Bz 1 1 T1 0 0 T2 0.8 0.6 T3 0.4 1 T4 0.8 0.8 T5 Count 3.0

Support 1 0 0.6 0.4 0.8 2.8

Table 6. Database after modifying T2

Table 4. Database after modifying T3

T1 T2 T3 T4 T5 Count

Ao 1 0 0.8 0.4 0.8 3.0

Bz 1 0 0.6 1 0.8

Support 1 0 0.6 0.4 0.8 2.8

Ao Bz 1 1 T1 1 0 T2 0.8 0.6 T3 0.4 1 T4 0.8 0.8 T5 Count 4.0

Support 1 0 0.6 0.4 0.8 2.8

Confidence(AoBz)=

In this case, confidence value of AoBz rule is calculated as:


Confidence(AoBz)=

Support(AoBz) 2.8 = = 70% Support(Ao) 4

Support(AoBz) 2.8 = = 1 = 93% Support(Ao) 3

If this is not enough to hide the relevant rule, then in transactions having Ao equals to 0, we subtract the value of Ao from 1. The procedure is denoted in Table 5 and Table 6. Thus, we obtain a smaller confidence value to hide the relevant rule as shown in the following formula. Since this value is lower than minimum confidence value, the rule is hidden from the user. After modifying the values

As can be easily seen in Table 7, the sum of the fuzzy values in T2 ad T3 transactions passes the value 1. In order to handle this problem, the region having fuzzy support value modified is subtracted from 1 and the result is written as the other fuzzy region value. For example; Since Az + Ao = 0.8 + 0.8 = 1.6, Ao is subtracted from 1 and the result 1-0.8 = 0.2 is written as the value of the Az. Finally, according to the values in Table 8, the transformed database is shown in Table 9.

Table 8. Adjusted new fuzzy values

Transactions T1 T2 T3 T4 T5 Count Az 0 0 0.2 0.6 0 2.0

A Ao 1 1 0.8 0.4 0.8 4

Ab 0 0 0 0 0.2 0.2

Bz 1 0 0.6 1 0.8

B Bo 0 0.8 0 0 0

Bb 0 0.2 0 0 0

Cz 0.4 0.8 0.2 0.4 0.6

C Co 0.6 0.2 0.8 0.6 0.4

Cb 0 0 0 0 0

Dz 0.6 0 0 0 0

D Do 0 0.2 0.4 0.6 1

3.4 0.8 0.2

2.4 2.6

0.6 2.2

Db 0 0.8 0.6 0.4 0 1.8

390

Table 9. Transformed database


120

Total Rules Hidden Rules

TID T1 T2 T3 T4 T5

A 10 10 9 7 11

B 5 11 3 5 4

C 8 6 9 8 7

D 3 14 13 12 10

Number of Rules

100 80 60 40 20 0

5. Experimental Results
In order to understand characteristics of the proposed algorithm in a numerical way, we have done several numbers of experiments and observed the output effects. For the experiments, a computer having an Intel Centrino 1.6 processor and 512MB RAM is used. In the software side Windows XP operating system was running. We applied the proposed algorithm to the Wisconsin Breast Cancer database from UCI Machine Learning Repository [10]. The database consists of 10 attributes and one of them is categorical. Thus, we ignored this categorical attribute. We conducted three different experiments. The first experiment is dedicated to show the relationship between number of total and hidden rules, and number of transactions. In this experiment, as minimum support values are taken as 24, 48, 62 and 74, respectively, the minimum confidence value is set at 70%. The results are depicted in Fig. 2. The second experiment deals with finding the number of total and hidden rules for different values of minimum support at 200 transactions. As can be easily seen from Fig. 3, the number of rule decreases with increase of minimum support value. In this experiment, the size of data set is 200. The final experiment finds the number of total and hidden rules for different values of minimum confidence. The results are reported in Fig. 4, which demonstrates that the number of hidden rules quickly rises with the increase of minimum confidence value as the number of total rule decreases slowly.
7 6 Total Rules Hidden Rules

50

60

70

80

90

100

M inimum Support

Fig. 3. The number of rules under different minimum support values


Total Rules Hidden Rules

60

Number of Rules

50 40 30 20 10 0 50 60 70 80 90

M inimum Confidence

Fig. 4. The number of rules under different minimum confidence values

5. Discussion and Conclusions


In this paper, we proposed a privacy preserving data mining method by hiding fuzzy association rules. Unlike classical approaches, our method handles hiding the association rules in quantitative datasets. For this purpose, it employs fuzzy set concept. Experiments conducted on the Breast Cancer dataset illustrated that the proposed approach produces meaningful results and has reasonable efficiency. The results of the proposed algorithm are consistent and hence encouraging. Currently, we are investigating the possibility of combining the approach presented in this paper with multi-objective genetic algorithms.

Number of Rules

5 4 3 2 1 0 50 100 150 200

Acknowledgements
This study was supported by Frat University, Scientific Research Projects Office under Grant No: FUBAP1476.
Number of Transaction

Fig. 2. Number of total and hidden rules

391

6. References
1. V. Verkios, E. Bertino, I. G. Fovino, L. P. Provenza, Y. Saygn and Y. Theodoris, State-of-the-art in Privacy Preserving Data Mining, SIGMOD Record, Vol.33, No. 1,50-57, March 2004. S. L. Wang, B. Parikh and A. Jafari, Hiding Informative Association Rule Sets, Expert Systems with Applications, Volume 33, Issue 2, Pages 316323, August 2007. S. L. Wang, and A. Jafari, Using unknowns for hiding sensitive predictive association rules, IEEE International Conference on Information Reuse and Integration, Page(s):223-228, 15-17 Aug. 2005. Y. Saygn. V. Verkios, and C. Clifton, Using unknowns to prevent Discovery of Associations Rules, SIGMOD Record 30(4): 45-54, December 2001. S. Oliveira and O. Zaiane, Privacy preserving frequent itemset mining, IEEE International Conference on Data Mining, pp.43-54, November 2002. 6. T. P. Hong, C. S. Kuo, S. C. Chi, Mining association rules from quantitative data, Intell. Data Anal. 3 (5) pp.363376, 1999. 7. M. Kaya, R. Alhajj, F. Polat and A. Arslan, Efficient Automated Mining of Fuzzy Association Rules, Proc. of DEXA, 2002. 8. L.A. Zadeh, Fuzzy Sets, Information and Control, Vol.8, pp.338-353, 1965. 9. R. Agrawal. T. Imielinski. And A. Swami. Mining Associations Rules between Sets of Items in Large Database, In Proceedings of ACM SIGMOD International Conference on Management of Data, Washington DC, May 1993. 10. http://mlearn.ics.uci.edu/databases/breast-cancerwisconsin/breast-cancer-wisconsin.data

2.

3.

4.

5.

392

Vous aimerez peut-être aussi