Académique Documents
Professionnel Documents
Culture Documents
Lecture outline
Bread
Peanuts
Milk
Fruit
Jam
Jam
Soda
Chips
Milk
Bread
Bread
Jam
Soda
Chips
Milk
Fruit
Fruit
Soda
Chips
Milk
Steak
Jam
Soda
Chips
Bread
Fruit
Soda
Peanuts
Milk
Jam
Soda
Peanuts
Milk
Fruit
Fruit
Peanuts
Cheese
Yogurt
TID Items
1
Examples
{bread}
{soda}
{bread}
{milk}
{chips}
{jam}
Itemset
A collection of one or more items,
e.g., {milk, bread, jam}
k-itemset, an itemset that
contains k items
Support count ( )
Frequency of occurrence
of an itemset
({Milk, Bread}) = 3
({Soda, Chips}) = 4
Support
Fraction of transactions
that contain an itemset
s({Milk, Bread}) = 3/8
s({Soda, Chips}) = 4/8
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
TID Items
1
Example, {bread}
{milk}
Support (s)
Fraction of transactions
that contain both X and Y
Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
Customer
buys Milk
Customer
buys Bread
10
{Bread, Jam}
{Milk} s=0.4 c=0.75
{Milk, Jam}
{Bread} s=0.4 c=0.75
{Bread}
{Milk, Jam} s=0.4 c=0.75
{Jam}
{Bread, Milk} s=0.4 c=0.6
{Milk}
{Bread, Jam} s=0.4 c=0.5
All the above rules are binary partitions of the same itemset:
{Milk, Bread, Jam}
Rules originating from the same itemset have identical
support but can have different confidence
We can decouple the support and confidence requirements!
11
minsup
Rule Generation
Generate high confidence rules from frequent itemset
Each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is computationally expensive
13
null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
ABCDE
BCDE
14
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the
database
TID Items
candidates
w
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d
Computational Complexity
15
16
18
Apriori principle
If an itemset is frequent, then all of
its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
19
null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
ABCE
Pruned
supersets
Prof. Pier Luca Lanzi
ABDE
ABCDE
ACDE
BCDE
20
Items (1-itemsets)
Item
Bread
Peanuts
Milk
Fruit
Jam
Soda
Chips
Steak
Cheese
Yogurt
Count
4
4
6
6
5
6
4
1
1
1
2-itemsets
2-Itemset
Minimum Support = 4
Bread, Jam
Peanuts, Fruit
Milk, Fruit
Milk, Jam
Milk, Soda
Fruit, Soda
Jam, Soda
Soda, Chips
Count
4
4
5
4
5
4
4
4
3-itemsets
3-Itemset
Milk, Fruit, Soda
Count
4
Apriori Algorithm
21
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k
that are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only
those that are frequent
22
23
Join Step
Ck is generated by joining Lk-1 with itself
Prune Step
Any (k-1)-itemset that is not frequent cannot be a subset
of a frequent k-itemset
24
How to Improve
Aprioris Efficiency
25
Rule Generation
26
A, A BCD, B ACD,
BD, AD
BC, BC AD,
28
c(ABC
D)
D)
c(AB
CD)
c(A
BCD)
ABCD=>{ }
Low
Confidence
Rule
BCD=>A
CD=>AB
29
ACD=>B
BD=>AC
D=>ABC
BC=>AD
C=>ABD
Pruned
Rules
Prof. Pier Luca Lanzi
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
A=>BCD
AB=>CD
30
CD=>AB
BD=>AC
31
Support
distribution of
a retail data set
32