Académique Documents
Professionnel Documents
Culture Documents
Association Rule Mining Given a set of transactions, find rules that will predi
ct the occurrence of an item based on the occurrences of other items in the tran
saction Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer
, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Dia
per, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke
}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! L27/1709-09 3 Definition: Frequent Itemset Itemset A collection of one or more items E
xample: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support
count ( ) Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2
Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Di
aper}) = 2/5 TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper,
Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Frequent Item
set An itemset whose support is greater than or equal to a minsup
Brute-force approach: List all possible association rules Compute the support an
d confidence for each rule Prune rules that fail the minsup and minconf threshol
ds Computationally prohibitive! L27/17-09-09 6 Mining Association Rules Example
of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.
0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Di
aper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) TID Items 1
Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Mil
k, Diaper, Beer 5 Bread, Milk, Diaper, Coke Observations: All the above rules ar
e binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating
from the same itemset have identical support but can have different confidence T
hus, we may decouple the support and confidence requirements L27/17-09-09 7 Mini
ng Association Rules Two-step approach: 1. Frequent Itemset Generation Generate
all itemsets whose support minsup 2. Rule Generation Generate high confidence ru
les from each frequent itemset, where each rule is a binary partitioning of a fr
equent itemset Frequent itemset generation is still computationally expensive L2
7/17-09-09 8 Frequent Itemset Generation null AB AC AD AE BC BD BE CD CE DE A B
C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Giv
en d items, there are 2 d possible candidate itemsets L27/17-09-09 9 Frequent It
emset Generation
Count the support of each candidate by scanning the DB Eliminate candidates that
are infrequent, leaving only those that are frequent L27/17-09-09 16 Reducing N
umber of Comparisons Candidate counting: Scan the database of transactions to de
termine the support of each candidate itemset To reduce the number of comparison
s, store the candidates in a hash structure Instead of matching each transaction
against every candidate, match it against candidates contained in the hashed bu
ckets TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer,
Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke N Transactions Hash
Structure k Buckets L27/17-09-09 17 Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6
1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,
6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {
1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf siz
e: max number of itemsets stored in a leaf node (if number of candidate itemsets
exceeds max leaf size, split the node) L27/17-09-09 18 Association Rule Discove
ry: Hash tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7
3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2
sh Function Candidate Hash Tree Hash on
le Discovery: Hash tree 1 5 9 1 4 5 1 3
3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7
Tree Hash on 2, 5 or 8 L27/17-09-09 20
5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6
5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function
if number of frequent items also increases, both computation and I/O costs may a
lso increase Size of database since Apriori makes multiple passes, run time of a
lgorithm may increase with number of transactions Average transaction width tran
saction width increases with denser data sets This may increase max length of fr
equent itemsets and traversals of hash tree (number of subsets in a transaction
increases with its width) L27/17-09-09 26 Compact Representation of Frequent Ite
msets Some itemsets are redundant because they have identical support as their s
uperse ts Number of frequent itemsets Need a compact representation TID A1 A2 A3
A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9
C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1 , ` . | 10 1 10 3 k k L27/17-09-09 27 Maximal Frequ
ent Itemset null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE
BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCD E Border Infrequent Itemsets Maxi
mal Itemsets An itemset is maximal frequent if none of its immediate supersets i
s frequent L27/17-09-09 28 Closed Itemset An itemset is closed if none of its im
mediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D
} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B
} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support
E:1 E:1 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D}
7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Pointers are used to assist frequent item
set generation D:1 E:1 Transaction Database Item Pointer A B C D E Header table
L27/17-09-09 39 FP-growth null A:7 B:5 B:1 C:1 D:1 C:1 D:1 C:3 D:1 D:1 Condition
al Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1
)} Recursively apply FPgrowth on P Frequent Itemsets found (with sup > 1): AD, B
D, CD, ACD, BCD D:1 L27/17-09-09 40 Tree Projection Set enumeration tree: null A
B AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE A
BCD ABCE ABDE ACDE BCDE
AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2 k 2 candid
association rules (ignoring L and L) L27/17-09-09 46 Rule Generation How to eff
iciently generate rules from frequent itemsets? In general, confidence does not
have an antimonotone property c(ABC D) can be larger or smaller than c(AB D) But c
onfidence of rules generated from the same itemset has an anti-monotone property
e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.
.t. number of items on the RHS of the rule L27/17-09-09 47 Rule Generation for A
priori Algorithm ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D BC=>AD BD=>AC CD=>AB AD=>
BC AC=>BD AB=>CD D=>ABC C=>ABD B=>ACD A=>BCD Lattice of rules ABCD=>{ } BCD=>A A
CD=>B ABD=>C ABC=>D BC=>AD BD=>AC CD=>AB AD=>BC AC=>BD AB=>CD D=>ABC C=>ABD B=>A
CD A=>BCD Pruned Rules Low Confidence Rule L27/17-09-09 48 Rule Generation for A
priori Algorithm Candidate rule is generated by merging two rules that share the
same prefix in the rule consequent join(CD=>AB,BD=>AC) would produce the candid
ate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high conf
idence BD=>AC CD=>AB D=>ABC L27/17-09-09 49 Effect of Support Distribution
Many real data sets have skewed support distribution Support distribution of a r
etail data set L27/17-09-09 50 Effect of Support Distribution How to set the app
ropriate minsup threshold? If minsup is set too high, we could miss itemsets inv
olving interesting rare items (e.g., expensive products) If minsup is set too lo
w, it is computationally expensive and the number of itemsets is very large Usin
g a single minimum support threshold may not be effective L27/17-09-09 51 Multip
le Minimum Support How to apply multiple minimum supports? MS(i): minimum suppor
t for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5
% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support
is no longer anti-monotone Suppose: Support(Milk, Coke) = 1.5% and Support(Milk,
Coke, Broccoli) = 0.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is fr
equent L27/17-09-09 52 Multiple Minimum Support A Item MS(I) Sup(I) A 0.10% 0.25
% B 0.20% 0.26% C 0.30% 0.29% D 0.50% 0.05% E 3% 4.20% B C D E AB AC AD AE BC BD
BE CD CE
DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE L27/17-09-09 53 Multiple Minimum Supp
ort A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE
CDE Item MS(I) Sup(I) A 0.10% 0.25% B 0.20% 0.26% C 0.30% 0.29% D 0.50% 0.05% E
3% 4.20% L27/17-09-09 54 Multiple Minimum Support (Liu 1999) Order the items acc
ording to their minimum support (in ascending order) e.g.: MS(Milk)=5%, MS(Coke)
= 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that: L 1 : set of frequent items
r o d u c t P r o d u c t P r o d u c t P r o d u c t P r o d u c t P r o d u c
t P r o d u c t P r o d u c t P r o d u
the relative number of male and female students in the samples 2x 10x L27/17-0909 68 Property under Inversion Operation 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 A B C D (a) (b) 0 1 1 1 1 1 1 1 1 0 0 0
0 . 3 0 . 2 0 . 1 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 Corr
elation L27/17-09-09 74 Effect of Support-based Pruning Support < 0.01 0 50 100
150 200 250 300 1 0 . 9 0
. 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0
. 6 0 . 7 0 . 8 0 . 9 1 Correlation Support < 0.03
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 Correlation Support-base
d pruning eliminates mostly negatively correlated itemsets L27/17-09-09 75 Effec
t of Support-based Pruning Investigate how support-based pruning affects other m
easures Steps: Generate 10000 contingency tables Rank each table according to th
e different measures Compute the pair-wise correlation between the measures L27/
17-09-09 76 Effect of Support-based Pruning All Pairs(40.14%) 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 Conviction Oddsratio Col Strength Correlatio
n Interest PS CF YuleY Reliability Kappa
Klosgen YuleQ Confidence Laplace IS Support Jaccard Lambda Gini J-measure Mutual
Info x Without Support Pruning (All Pairs) x Red cells indicate correlation bet
ween the pair of measures > 0.85 x 40.14% pairs have correlation > 0.85 -1 -0.8
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Corre
lation J a c c a r d Scatter Plot between Correlation & Jaccard Measure L27/17-0
9-09 77 Effect of Support-based Pruning x 0.5% support 50% x 61.45% pairs have c
orrelation > 0.85 0.005<=support <=0.500(61.45%) 1 2 3 4 5 6 7 8 9 10 11 12 13 1
4 15 16 17 18 19 20 21 Interest Conviction Oddsratio Col Strength Laplace Confid
ence Correlation Klosgen Reliability PS YuleQ CF
YuleY Kappa IS Jaccard Support Lambda Gini J-measure Mutual Info -1 -0.8 -0.6 -0
.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation
J a c c a r d Scatter Plot between Correlation & Jaccard Measure: L27/17-09-09 7
8 0.005<=support <=0.300(76.42%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20 21 Support Interest Reliability Conviction YuleQ Oddsratio Confidence CF Yul
eY Kappa Correlation Col Strength IS Jaccard Laplace PS Klosgen Lambda Mutual In
fo Gini J-measure Effect of Support-based Pruning x 0.5% support 30% x 76.42% pa
irs have correlation > 0.85
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlatio
n J a c c a r d Scatter Plot between Correlation & Jaccard Measure L27/17-09-09
79 Subjective Interestingness Measure Objective measure: Rank patterns based on
statistics computed from data e.g., 21 measures of association (support, confide
nce, Laplace, Gini, mutual information, Jaccard, etc). Subjective measure: Rank
patterns according to user interpretation A pattern is subjectively interesting
if it contradicts the expectation of a user (Silberschatz & Tuzhilin) A pattern
is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) L27/17
-09-09 80 Interestingness via Unexpectedness Need to model expectation of users
(domain knowledge) Need to combine expectation of users with evidence from data
(i.e., extracted patterns) + Pattern expected to be frequent Pattern expected to
be infrequent Pattern found to be frequent Pattern found to be infrequent + Exp
ected Patterns + Unexpected Patterns L27/17-09-09 81 Interestingness via Unexpec
tedness
1 , 2 , k } i
Web Data (Cooley et al 2001) Domain knowledge in the form of site structure Give
n an itemset F = {X X , X (X
: Web pages) L: number of links connecting the pages lfactor = L / (k k-1) cfact
or = 1 (if graph is connected), 0 (disconnected graph) Structure evidence = cfac
tor lfactor Usage evidence Use Dempster-Shafer theory to combine domain knowledg
e and evidence from data ) ... ( ) ... ( 2 1 2 1 k k X X X P X X X P