Fast Algorithms For Mining Association Rules: Milan Garg Rohit Das Sarthak Mittal

Fast Algorithms for Mining Association Rules
Rakesh Agarwal Ramakrishnan Srikant
Milan Garg
Rohit Das
Sarthak Mittal
Introduction
Bar Code technology has brought in a lot of scope for retail organizations to
store large amounts of data which is referred to as basket data. We use
association mining rules over basket data to find most frequently occurring
rules over the dataset. Eg People who buy tires also get car services done.
Finding such rules helps a lot for attached mail and cross market services. The
databases involved are very large and hence we need fast algorithms for this
task.
Problem Statement
Given a database of transactions , find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.
TID Items
{Diaper} -> {Beer}
1 Bread,Milk
2 Bread,Diaper,Beer,Eggs {Milk,Bread} -> {Eggs,Coke}
3 Milk,Diaper,Beer,Coke {Beer,Bread} -> {Milk}
4 Bread,Milk,Diaper,Beer
5 Bread,Milk,Diaper,Coke
Motivation
Walmart has lots of data about point of sales
Data on each user
Basket of items that has been bought during each visit.
Similarly, phone companies have information available about phone

calls made at a particular time to a location
How can the above information be leveraged for better logistics,

better placement of items on shelves etc.
L1 = {large 1-itemsets};
for( k=2; Lk-1; k++) do begin
Ck = apriori-gen(Lk-1); // New candidates
Algorithm for forall transactions t D do // Counting support
Apriori Ct = subset(Ck, t); // Candidates in t
forall candidates c Ct do
c.count++;
end
Lk = {c Ck | c.count minsup}
End
Answer = k Lk;
Flowchart for Apriori Algorithm
Get Frequent Generate Candidate Get Frequent Item

Start
Items Item set sets
No
Yes Generate
End Generate Strong
set = NULL
Rules
Apriori Candidate Generation
Takes in Lk-1 and returns Ck. Candidate Generation (Join)
Two steps:
Join large itemsets Lk-1 with Lk-1.
Prune out all itemsets in joined result

which contain a (k-1)subset not found in
Lk-1.
Candidate Generation (Example)
L3 C4 C4
{Beer, Butter, Coke} Join {Beer, Butter, Coke, Cheese} Prune {Beer, Butter, Coke, Cheese}
{Beer, Butter, Cheese} {Beer, Coke, Cheese, Maggi}
{Beer, Coke, Cheese}
{Beer, Coke, Maggi}
{Butter, Coke, Cheese}

Hash-tree Example
Depth
Beer Butter 1
L3
{Beer, Butter, Coke}
{Beer, Butter, Cheese} Butter Coke {Butter, Coke,

2
Cheese}
{Beer, Coke, Cheese}
{Beer, Coke, Maggi} {Beer, Butter, Coke} {Beer, Coke, Cheese}

{Beer, Butter, Cheese} {Beer, Coke, Maggi} 3
{Butter, Coke, Cheese}
AprioriTid
Does not use the transactions in the database for counting itemset
support.
Instead stores transactions as sets of possible large itemsets, Ck.
Each member of Ck is of the form:
< TID, {Xk}> , Xk is a possible large itemset

AprioriTid Example
TID Items TID Set of Itemsets Itemset Suport
100 134 100 {{1} ,{3} ,{4}} {1} 2
200 235 200 {{2} ,{3} ,{5}} {2} 3
300 1235 300 {{1} ,{2} ,{3} ,{5}} {3} 3
400 25 400 {{2} ,{5}} {5} 3

AprioriTid Example
Itemsets TID Set of Itemsets
Itemset Suport
{12} 100 {{1 3}}
{1 3} 2
{13} 200 {{2 3} ,{2 5} ,{3 5}}
{2 3} 2
{15} 300 {{1 2} ,{1 3} ,{1 5} {2
{2 5} 3
3} ,{2 5} ,{3 5}}
{23}
{3 5} 2
400 {{2 5}}
{25}
{35}
TID Set-of-Items ItemSet Support
Itemset 200 {{2 3 5}} {2 3 5} 2
{2 3 5} 300 {{2 3 5}}
AprioriHybrid
It is not necessary to use the same algorithm for all

the passes.
Combine the two algorithms!
Start with Apriori
When Ck can fit in main memory switch to

AprioriTID
Apriori Advantages/Disadv
Advantages: Disadvantages:
This algorithm has less It requires many scans of
memory consumption. database.
Easy implementation. It allows only a single
It uses Apriori property for minimum support threshold.
pruning therefore, itemsets It is favourable only for
left for further support small database.
checking remain less. It explains only the presence
or absence of an item in the
database.
Conclusion
The performance gap increased with the problem size

and ranged from a factor of three for small problems
to more than an order of magnitude for large
problems
The algorithms presented in the paper have been

implemented on several data repositories and were
found to give consistent results.
Thanks!

Fast Algorithms For Mining Association Rules: Milan Garg Rohit Das Sarthak Mittal

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Fast Algorithms For Mining Association Rules: Milan Garg Rohit Das Sarthak Mittal

Transféré par

Droits d'auteur :

Formats disponibles

Fast Algorithms for Mining Association Rules

Rakesh Agarwal Ramakrishnan Srikant

2 Bread,Diaper,Beer,Eggs {Milk,Bread} -> {Eggs,Coke}

3 Milk,Diaper,Beer,Coke {Beer,Bread} -> {Milk}

Walmart has lots of data about point of sales

Data on each user

Basket of items that has been bought during each visit.

Similarly, phone companies have information available about phone

How can the above information be leveraged for better logistics,

for( k=2; Lk-1; k++) do begin

Ck = apriori-gen(Lk-1); // New candidates

Algorithm for forall transactions t D do // Counting support

Apriori Ct = subset(Ck, t); // Candidates in t

Get Frequent Generate Candidate Get Frequent Item

Takes in Lk-1 and returns Ck. Candidate Generation (Join)

Join large itemsets Lk-1 with Lk-1.

Prune out all itemsets in joined result

{Beer, Butter, Cheese} {Beer, Coke, Cheese, Maggi}

{Beer, Coke, Cheese}

{Beer, Coke, Maggi}

{Butter, Coke, Cheese}

{Beer, Butter, Coke}

{Beer, Butter, Cheese} Butter Coke {Butter, Coke,

{Beer, Coke, Maggi} {Beer, Butter, Coke} {Beer, Coke, Cheese}

Instead stores transactions as sets of possible large itemsets, Ck.

Each member of Ck is of the form:

< TID, {Xk}> , Xk is a possible large itemset

TID Items TID Set of Itemsets Itemset Suport

100 134 100 {{1} ,{3} ,{4}} {1} 2

200 235 200 {{2} ,{3} ,{5}} {2} 3

300 1235 300 {{1} ,{2} ,{3} ,{5}} {3} 3

400 25 400 {{2} ,{5}} {5} 3

It is not necessary to use the same algorithm for all

Combine the two algorithms!

Start with Apriori

When Ck can fit in main memory switch to

The performance gap increased with the problem size

The algorithms presented in the paper have been

Vous aimerez peut-être aussi