Vous êtes sur la page 1sur 11

Association Rule Mining using Apriori algorithm

For food dataset


Project done by K Raja (13MCMB25) & T Shiva Prasad (13MCMB16)
Under the guidance of
Dr. V.Ravi, Associate Professor, IDRBT

Association Rule Mining using Apriori algorithm


For food dataset
Abstract
Market-Basket Analysis is a process to analyse the habits of buyers to find the
relationship between different items in their market basket. The discovery of these
relationships can help the merchant to develop a sales strategy by considering the items
frequently purchased together by customers. In this research, the data mining with market
basket analysis method is implemented, where it can analyse the buying habit of the
customers. The testing is conducted in Minimarket X. Searching for frequent item sets
performed by Apriori algorithm to get the items that often appear in the database and the pair
of items in one transaction. Pair of items that exceed the minimum support will be included
into the frequent item sets are selected. Frequent item sets that exceed the minimum support
will generate association rules after decoding. One frequent item sets can generate association
rules and find the confidence. The test results show, the application can generate the
information what kind of products are frequently bought in the same time by the customers
according to Hybrid-dimension Association Rules criteria. Results from the mining process
show a correlation between the data (association rules) including the support and confidence
that can be analysed. This information will give additional consideration for owners of
Minimarket X to make the further decision.

1. Introduction
Minimarket X wants to analyse the shopping habits of customers to find associations
and correlations among the items in their shopping basket. Specifically, this market basket
analysis aims to determine what items are frequently purchased together by customers. The
application is designed to perform multi-dimensional data mining, where each variable
represents on one particular dimension. The dimension involved is the items purchased and
time. The information generated is the rules and the correlation between the items involved in
the decision making. The goal of this research is to develop an application to find out what
products are often purchased together by customers with the attributes that influence it, such
as product and the time of purchase, using the case study in Minimarket X, using HybridDimension Association Rules criteria.

2. Problem Statement
The aim of this project is to find the relation between the items in the basket in a
particular market X and determine the items that are frequently purchased together by the
customer. Here we have taken Food dataset from IBM SPSS Modeller.

3. Data Mining
3.1. Market Basket Analysis
This is a process which analyses the habits of the buyer to find a relationship between
different items on their shopping cart (market basket). The discovery of these relationships
can help the seller to develop a sales strategy to consider items frequently purchased together

by customers. For example, if a buyer buys flour, how likely they will buy sugar on the same
transaction [1].

3.2. Association Rules


Association rule is a procedure which is looking for a relationship among an item with
other items. Association rule is usually used "if" and "then" such as "if A then B and C", this
shows if A then B and C. To determine the Association's rules, it needs to be specified the
support and confidence to restrict whether the rule is interesting or not.
Support: A measure that indicates how much the level of dominance of an item or
item set of the overall transaction
Confidence: A measure that shows the relationship between items in a conditional
(e.g. how frequently purchased item B if the person buying the item A).

3.3. Apriori Algorithm


Apriori is an iteration approach known as level-wise search, where k-item set is used
to explore (k +1)-item set. First, a collection of 1-itemset is found by checking the database to
accumulate counts for each item, and records of the items. The result is denoted by L1.
Furthermore, L1 is used to find L2, a collection of 2-itemset that is used to search for L3, and
so on until there are no k-item set to be found. Lk invention requires inspectors entire
database [1]. To increase the efficiency of frequent item set search, it is used Apriori rule
which says "All parts not empty from the frequent item set are also frequent". When written
in the form of pseudo code, Apriori algorithm is as follows [2]

While the pseudo code of the formation of joint candidate item set is given below

After the results obtained, just made a strong association rule from these results. This
can be obtained by following rule strength measures. [3]
a) Support
The rule X Y holds with support s if s% of transactions in D contain X Y. Rules
that have as greater than a user-specified support is said to have minimum support.
number of transactions that contain antecedents
Support =
total number of transactions
b) Confidence
The rule X Y holds with confidence c if c% of the transactions in D that contain X
also contain Y. Rules that have a c greater than a user-specified confidence is said to
have minimum confidence.
( )
Confidence =
()

c) Interestingness
Identifies rare rules, even though they have less individual support count adds
interestingness.
( ) ( )
( )

(1
)
()
()

Where D is total number of records


Interestingness =

d) Comprehensibility
The Comprehensibility measure is needed to make the discovered rules easy to
understand. The comprehensibility tries to quantify the understand ability of the rule.
log(1 + ||)
log(1 + | |)
Here |Y| and |XY| are the number of attributes involved in the consequent body and
the total rule respectively.
Comprehensibility =

e) Lift
The lift value is a measure of importance of a rule. The lift value of an association
rule is the ratio of the confidence of the rule and the expected confidence of the rule.
The expected confidence of a rule is defined as the product of the support values of
the rule body and the rule head divided by the support of the rule body.
Lift(X Y) =

( )
() ()

f) Leverage
Leverage measures the difference of X and Y appearing together in the data set and
what would be expected if X and Y where statistically dependent. The rational in a
sales setting is to find out how many more units (items X and Y together) are sold
than expected from the independent sells.
Lift(X Y) = ( ) () ()

g) Conviction
The ratio of the expected frequency that X occurs without Y (that is to say, the
frequency that the rule makes an incorrect prediction) if X and Y were independent
divided by the observed frequency of incorrect predictions.
Conviction(X Y) =

1 ()
1 ( )

h) Coverage
Sometimes called antecedent support. It measures how often a rule is
applicable in a database.
= ( )

4. Basic architecture

a)
b)
c)
d)

Input Data: Giving the existing data set


Training the data: The algorithm will learn about the data
Building the Model: based on that knowledge the model will be build.
Knowledge: Obtain rules form the model

e) Decision making: Take decisions based on the rules

5. Models other than Apriori algorithm


1) FpGrowth
FP Growth Stands for frequent pattern growth. It is a scalable technique for mining
frequent pattern in a database. FP growth improves Apriority to a big extent Frequent
Item set Mining is possible without candidate generation.
Simply a two-step procedure
a) Build a compact data structure called the FP-tree
Pass 1:
Scan data and find support for each item.
Discard infrequent items.
Sort frequent items in decreasing order based on their support
Use this order when building the FP-Tree, so common prefixes can be shared
Pass 2:
FP-Growth reads 1 transaction at a time and maps it to a path.
Fixed order is used, so paths can overlap when transactions share items
(when they have the same prefix).
Increment counter
Pointers are maintained between nodes containing the same item, creating
singly linked lists (dotted lines)
The more paths that overlap, the higher the compression. FP-tree may fit in
memory.
Frequent item sets extracted from the FP-Tree.
b) Extracts frequent item sets directly from the FP-tree
FP-Growth extracts frequent item sets from the FP-tree.
Bottom-up algorithm - from the leaves towards the root
Divide and conquer: first look for frequent item sets ending in e, then de,
etc. . . . . Then d, then cd, etc. . . .
First, extract prefix path sub-trees ending in an item (set).
Each prefix path sub-tree is processed recursively to extract the frequent
item sets. Solutions are then merged
2) Predictive Apriori
Class implementing the predictive Apriori algorithm to mine association rules. It
searches with an increasing support threshold for the best 'n' rules concerning a
support-based corrected confidence value.
In predictive Apriori association rule algorithm, support & confidence is combined
into a single measure called predictive Accuracy.
{Support, Confidence}=> Accuracy
In this predictive Apriori association rule algorithm, this predictive accuracy is used
to generate the Apriori association rule. [4]

6. Results
Support min
Confidence
min
Max rule
length
Lift filtering

Apriori: Rules

Apriori: Metrics

0.05
0.5
4
1.1

FPgrowth: Rules

FPgrowth: Metrics

7. Comparing Models
Comparison table
Parameter
Technique

Memory Utilization

Number of Scans
Time

Apriori
Uses Apriori property ,join and
prune property

FpGrowth
It constructs conditional
frequent pattern tree and
conditional base from data
base which satisfies minimum
support
Due to compact structure and
no candidate generation
require less memory
Scan the DB only twice

Due to the large number of


candidates are generated.so
requires large memory space
Multiple scans for generating
candidate sets
Execution time is more as time is
Execution time is small than
wasted in producing candidates for Apriori algorithm
every time

Based on Metrics
Support* confidence

Support:

Confidence:

8. Conclusion
So we can use FpGrowth as good model for association rule mining in Market Basket
Analysis.
9. References
[1] D. Olson and S. Yong, Introduction to Business Data Mining. New York: McGrawHill, 2007
[2] H. Jiawei and K. Micheline, Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2001.
[3] Association Rules Extraction using Multi-objective Feature of Genetic Algorithm by
Mohit K. Gupta and Geeta Sikka
[4] International Journal of Advanced Research in Computer Science and Software
Engineering, Volume 3, Issue 6, June 2013