Académique Documents
Professionnel Documents
Culture Documents
Chapter 5
Association Analysis: Basic Concepts
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
02/14/2018 Introduction to Data Mining, 2nd Edition 5
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1 d d 1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
02/14/2018 Introduction to Data Mining, 2nd Edition 7
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
02/14/2018 Introduction to Data Mining, 2nd Edition 13
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
02/14/2018 Introduction to Data Mining, 2nd Edition 38
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
10
Number of frequent itemsets 3
10
k
k 1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
Items
4
5
6
7
8
9
10
Items
4
5
6
7
8
9
10
Items
A B C D E F G H I J
Support threshold (by count) : 5
1 Frequent itemsets: {F}
2
Support threshold (by count): 4
3 Frequent itemsets: ?
Transactions
4
5
6
7
8
9
10
Items
4
5
6
7
8
9
10
Items
Items
6
7
8
9
10
Items
Items
Items
Items
7
8
9
10
Items
4
5
6
7
8
9
10
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Items
4
5
6
7
8
9
10
Items
4
5
6
7
8
9
10
Items
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
1
2
3
Transactions
4
5
6
7
8
9
10
Items
1
2
3
Transactions
4
5
6
7
8
9
10
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
a. What is the number of frequent itemsets for each dataset? Which dataset will produce the
most number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support
levels (i.e., itemsets containing items with mixed support, ranging from 20% to more than
70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will
produce the most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce
the most number of closed frequent itemsets?
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
The criterion
confidence(X Y) = support(Y)
is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) P(Y)
P(Y | X )
Lift
P(Y ) lift is used for rules while
interest is used for itemsets
P( X , Y )
Interest
P( X ) P(Y )
PS P( X , Y ) P( X ) P(Y )
P( X , Y ) P( X ) P(Y )
coefficient
P( X )[1 P( X )] P(Y )[1 P(Y )]
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
0.1 0.9
Lift 10 Lift 1.11
(0.1)(0.1) (0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
B B A A
A p q B p r
A r s B q s
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc
02/14/2018 Introduction to Data Mining, 2nd Edition 82
Property under Row/Column Scaling
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
A B C D E F
Transaction 1
1 0 0 1 0 0
. 0 0 1 1 1 0
. 0
0
0
0
1
1
1
1
1
1
0
0
. 0 1 1 0 1 1
. 0
0
0
0
1
1
1
1
1
1
0
0
. 0
0
0
0
1
1
1
1
1
1
0
0
Transaction N 1 0 0 1 0 0
B B B B
A p q A p q
A r s A r s+k
Invariant measures:
support, cosine, Jaccard, etc
Non-invariant measures:
correlation, Gini, mutual information, odds ratio, etc
=> Customers who buy HDTV are more likely to buy exercise machines
College students:
c({HDTV Yes} {Exercise Machine Yes}) 1 / 10 10%
c({HDTV No} {Exercise Machine Yes}) 4 / 34 11.8%
Working adults:
c({HDTV Yes} {Exercise Machine Yes}) 98 / 170 57.7%
c({HDTV No} {Exercise Machine Yes}) 50 / 86 58.1%
Many items
Support with low
distribution of support
a retail data set
caviar milk
Observation:
conf(caviarmilk) is very high
but
conf(milkcaviar) is very low
Therefore,
min( conf(caviarmilk),
conf(milkcaviar) )
𝑠 𝑋 𝑠 𝑋 𝑠 𝑋
hconf 𝑋 = min , ,…,
𝑠(𝑥1 ) 𝑠(𝑥2 ) 𝑠(𝑥𝑑 )
𝑠(𝑋)
=
max 𝑠 𝑥1 , 𝑠 𝑥2 , … , 𝑠(𝑥𝑑 )
02/14/2018 Introduction to Data Mining, 2nd Edition 97
Cross Support and H-confidence
= 𝑟(𝑋)
0 ≤ hconf 𝑋 ≤ 𝑟(𝑋) ≤ 1
H-confidence is anti-monotone