Vous êtes sur la page 1sur 43

COMP9318: Data Warehousing

and Data Mining


— L6: Association Rule Mining —

COMP9318: Data Warehousing and Data Mining 1


n Problem definition and preliminaries

COMP9318: Data Warehousing and Data Mining 2


What Is Association Mining?
n Association rule mining:
n Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
n Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database [AIS93]
n Motivation: finding regularities in data
n What products were often purchased together? — Beer
and diapers?!
n What are the subsequent purchases after buying a PC?

n What kinds of DNA are sensitive to this new drug?

n Can we automatically classify web documents?

COMP9318: Data Warehousing and Data Mining 3


Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
n Foundation for many essential data mining tasks
n Association, correlation, causality
n Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
n Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
n Broad applications
n Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
n Web log (click stream) analysis, DNA sequence
analysis, etc. c.f., google’s spelling suggestion
COMP9318: Data Warehousing and Data Mining
Basic Concepts: Frequent Patterns and
Association Rules
n Itemset X={x1, …, xk}
n Shorthand: x1 x2 … xk
Transaction-id Items bought
n Find all the rules XàY with min
10 { A, B, C } confidence and support
20 { A, C }
n support, s, probability that a
30 { A, D }
transaction contains XÈY
40 { B, E, F }
n confidence, c, conditional
Customer
probability that a transaction
buys both
Customer
having X also contains Y.
buys diaper

Let min_support = 50%,


min_conf = 70%: frequent itemset
Customer
sup(AC) = 2 association rule
buys beer A è C (50%, 66.7%)
C è A (50%, 100%)
COMP9318: Data Warehousing and Data Mining 5
Mining Association Rules—an Example

Transaction-id Items bought


Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
{A, C} 50%
For rule A è C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
major computation challenge: calculate the support of itemsets
ç The frequent itemset mining problem
COMP9318: Data Warehousing and Data Mining 6
n Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases

COMP9318: Data Warehousing and Data Mining 7


Association Rule Mining Algorithms
Candidate Generation
& Verification
n Naïve algorithm
n Enumerate all possible itemsets

and check their support against


min_sup
n Generate all association rules

and check their confidence


against min_conf
n The Apriori property
n Apriori Algorithm

n FP-growth Algorithm

COMP9318: Data Warehousing and Data Mining 8


All Candidate Itemsets for {A, B, C, D, E}
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

COMP9318: Data Warehousing and Data Mining 9


Apriori Property

n A frequent (used to be called large) itemset is an


itemset whose support is ≥ min_sup.
n Apriori property (downward closure): any subsets
of a frequent itemset are also frequent itemsets
n Aka the anti-monotone property of support
ABC ABD ACD BCD
“any supersets of
an infrequent
AB AC AD BC BD CD itemset are
also infrequent
A B C D itemsets”

COMP9318: Data Warehousing and Data Mining 10


Illustrating Apriori Principle
Q: How to design an null

algorithm to improve
the naïve algorithm?
A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets

COMP9318: Data Warehousing and Data Mining 11


Apriori: A Candidate Generation-and-test Approach

n Apriori pruning principle: If there is any itemset


which is infrequent, its superset should not be
generated/tested!
n Algorithm [Agrawal & Srikant 1994]
1. Ck ç Perform level-wise candidate generation
(from singleton itemsets)
2. Lk ç Verify Ck against Lk
3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty

COMP9318: Data Warehousing and Data Mining 12


The Apriori Algorithm
n Pseudo-code:

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do begin
increment the count of all candidates in Ck+1
that are contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return ∪kLk;
COMP9318: Data Warehousing and Data Mining 13
The Apriori Algorithm—An Example
minsup = 50%
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E}
{B, C, E} 2
COMP9318: Data Warehousing and Data Mining 14
Important Details of Apriori
1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?)
n Step 2: pruning
2. How to count supports of candidates?

Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
COMP9318: Data Warehousing and Data Mining 15
Generating Candidates in SQL

n Suppose the items in Lk-1 are listed in an order


n Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
n Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

COMP9318: Data Warehousing and Data Mining 16


Derive rules from frequent itemsets

n Frequent itemsets != association rules


n One more step is required to find association
rules
n For each frequent itemset X,
For each proper nonempty subset A of X,
n Let B = X - A

n A à B is an association rule if

n Confidence (A à B) ≥ min_conf,
where support (A à B) = support (AB), and
confidence (A à B) = support (AB) / support (A)

COMP9318: Data Warehousing and Data Mining 17


Example – deriving rules from frequent
itemsets
n Suppose 234 is frequent, with supp=50%
n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with

supp=50%, 50%, 75%, 75%, 75%, 75% respectively


n These generate these association rules:

n 23 => 4, confidence=100%
n 24 => 3, confidence=100%
n 34 => 2, confidence=67% = (N* 50%)/(N*75%)
n 2 => 34, confidence=67%
n 3 => 24, confidence=67%
n 4 => 23, confidence=67%
n All rules have support = 50%

Q: is there any optimization (e.g., pruning) for this step?


COMP9318: Data Warehousing and Data Mining 18
Deriving rules
n To recap, in order to obtain A à B, we need
to have Support(AB) and Support(A)
n This step is not as time-consuming as
frequent itemsets generation
n Why?

n It’s also easy to speedup using techniques


such as parallel processing.
n How?

n Do we really need candidate generation for


deriving association rules?
n Frequent-Pattern Growth (FP-Tree)

COMP9318: Data Warehousing and Data Mining 19


Bottleneck of Frequent-pattern Mining

n Multiple database scans are costly


n Mining long patterns needs many passes of
scanning and generates lots of candidates
n To find frequent itemset i1i2…i100

n # of scans: 100 ✓ ◆ ✓ ◆ ✓ ◆
100 100 100 100
n # of Candidates: 1 +
2
+ . . . +
100
= 2 1

n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether?

COMP9318: Data Warehousing and Data Mining 20


n FP-growth

COMP9318: Data Warehousing and Data Mining 21


No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X
minsup = 1

n Apriori:
n L1 = {J, L, S, P, R}

n C2 = all the ( 2) combinations


5

n Most of C2 do not contribute to the result


n There is no way to tell because
No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X
minsup = 1
Ideas:
• Keep the support set for
each frequent itemset
• DFS

J è JL? J
{A, C}
J è ???
Only need to look at support
set for J ɸ
No Pain, No Gain
Java Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X
minsup = 1
Ideas: {C}
JPR
• Keep the support set for
each frequent itemset {C} {A,C}
• DFS JP JR

J

{A, C}

ɸ
Notations and Invariants
n CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}

n DB = DB|∅ (i.e., conditioned on nothing)

n Shorthand: DB|px = DB|(p∪x)

n SupportSet(p∪x, DB) = SupportSet(x, DB|p)


n {x | x mod 6 = 0 ⋀ x ∈ [100] } =

{x | x mod 3 = 0 ⋀ x ∈ even([100]) }
n A FP-tree is equivalent to a DB|p
n One can be converted to another

n Next, we illustrate the alg using conditionalDB

25
FP-tree Essential Idea /1
n Recursive algorithm again!
easy task, as
all frequent itemsets in
only items (not
n FreqItemsets(DB|p): itemsets) are
DB|p belong to one of
the following
needed
categories:

n X = FindLocallyFrequentItems(DB|p)
patterns ~ xip
patterns ~ ★px1
output { (x p) | x ∈ X }
patterns ~ ★px2
n Foreach x in X obtained
patterns ~ ★pxi
via
n DB*|px = GetConditionalDB+(DB*|p, x) recursion patterns ~ ★pxn
n

n FreqItemsets(DB*|px)
No Pain, No Gain
DB|J

Java Lisp Scheme Python Ruby

Alice X X

Charlie X X X
minsup = 1

n FreqItemsets(DB|J):
n {P, R} ç FindLocallyFrequentItems(DB|J)
n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J
FP-tree Essential Idea /2

Also output each item in


n FreqItemsets(DB|p): X (appended with the
n If boundary condition, then … conditional pattern)

n X = FindLocallyFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X) Remove items not in X;
output { (x p) | x ∈ X } potentially reduce # of
transactions (∅ or dup).
n Foreach x in X Improves the efficiency.
n DB*|px = GetConditionalDB+(DB*|p, x)
n [optional] if DB*|px is degenerated, then powerset(DB*|px)
n FreqItemsets(DB*|px) Also gets rid of items
already processed
before x è avoid
duplicates
Grayed items are for illustration purpose only.

Lv 1 Recursion
FCAMP
CBP
n minsup = 3 FCAMP
DB*|P
DB*|M (sans P)
FCADGIMP FCAMP
DB*|B (sans MP)
ABCFLMO FCABM
BFHJOW FB DB*|A (sans BMP)

BCKSP CBP DB*|C (sans ABMP)


AFCELPMN FCAMP
DB*|F (sans CABMP)
DB DB*

X = {F, C, A, B, M, P} FCA
Output: F, C, A, B, M, P FCA
FCA
Lv 2 Recursion on DB*|P

n minsup = 3
Which is actually FullDB*|CP

FCAMP C C
CBP C DB*|C C
FCAMP C C

DB DB* Context = Lv 3
recursion on DB*|CP:
X = {C} DB has only empty
sets or X = {} è
Output: CP immediately returns
Lv 2 Recursion on DB*|A (sans …)
Further
n minsup = 3 recursion
(output: FCA)
Which is actually FullDB*|CA

FC
FC DB*|C FC
FCA
FCA FC FC
FCA FC
F
DB*|F F
DB DB*
F
X = {F, C}
boundary
Output: FA, CA case
Different Example: Output: FAP

Lv 2 Recursion on DB*|P X = {F}


F
F

n minsup = 2
Which is actually FullDB*|AP
FC
DB*|A F

F
FCAMP FCA DB*|C F
FCBP FC
FAP FA
DB*|F

DB DB*

X = {F, C, A}
Output: FP, CP, AP
I will give you back the FP-tree

n An FP-tree tree of DB consists of:


n A fixed order among items in DB

n A prefix, threaded tree of sorted transactions

in DB
n Header table: (item, freq, ptr)

n When used in the algorithm, the input DB is


always pruned (c.f., PruneDB())
n Remove infequent items

n Remove infrequent items in every transaction


FP-tree Example
TID Items bought (ordered) frequent items
minsup = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

{} {} {}

f :2 f :4 c :1
f :1 Item freq head
f 4

c :1 c :2
… c
a
4
3
c :3 b :1 b :1

b 3
m 3
a :3 p :1
a :1 a :2 p 3

b :1 m :2 b :1 Output
m :1 m :1
f
c
p :1 m :1 p :2 m :1 a
p :1
b
m
Insert t2 Insert all ti p
Insert t1
TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p} p's conditional pattern base
f c a m : 2
c b : 1 Output
2 3 2 1 2 pc

{}

Item freq head f :4 c :1


f 4
Cleaned p’s
conditional
c 4
pattern base
a 3 c :3 b :1 b :1
b 3
C :2
m 3 C :1
p 3 a :3 p :1

m :2 b :1 {}
Header
STOP
Table
p :2 m :1 c :3
TID frequent items
100 {f, c, a, m, p} m's conditional pattern base Output
200 {f, c, a, b, m} f c a : 2 mf
300 {f, b} f c a b : 1 mc
400 {c, b, p} ma
3 3 3 1
500 {f, c, a, m, p}
{}

Item freq head f :4 c :1


f 4
c 4
a 3 c :3 b :1 b :1
b 3
m 3
a :3

{}
m :2 b :1 Header
gen_powerset Table
f :3
m :1 Output
mac
c :3
maf
mcf
macf
a :3
b's conditional pattern base
f c a : 1
f : 1
c : 1

2 2 1

{}

Item freq head f :4 c :1


f 4
c 4
a 3 c :3 b :1 b :1 STOP
b 3

a :3

b :1
a's conditional pattern base
f c : 3 Output
af
3 3 ac

{}

Item freq head f :4 c :1


f 4
c 4
a 3 c :3

a :3 {}
gen_powerset Header
Table
f :3
Output
acf
c :3
c's conditional pattern base
f : 3 Output
3
cf

{}

Item freq head f :4 c :1


f 4
c 4
c :3

{}
STOP Header
Table
f :3
STOP

{}

Item freq head


f :4
f 4
FP-Growth vs. Apriori: Scalability With the Support
Threshold

100 Data set T25I20D10K


90 D1 FP-grow th runtime
D1 Apriori runtime
80

70
Run time(sec.)

60

50

40

30
20

10

0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)

COMP9318: Data Warehousing and Data Mining 42


Why Is FP-Growth the Winner?

n Divide-and-conquer:
n decompose both the mining task and DB according to
the frequent patterns obtained so far
n leads to focused search of smaller databases
n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching

COMP9318: Data Warehousing and Data Mining 43

Vous aimerez peut-être aussi