Vous êtes sur la page 1sur 7

Data Structures

Notes for Lecture 14


Techniques of Data Mining
By
Samaher Hussein Ali
2007-2008

Association Rules: Basic Concepts and Application


1. Association rules:
Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction

2. Definition: Frequent Itemset


• Itemset
- A collection of one or more items, Example: {Milk, Bread, Diaper}.
- k-itemset, Example: An itemset that contains k items
• Support count (σ)
- Frequency of occurrence of an itemset: E.g. σ({Milk, Bread,Diaper}) = 2
• Support
- Fraction of transactions that contain an itemset: E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
- An itemset whose support is greater than or equal to a minsup threshold

3. Definition: Association Rule


Association Rules are on of the promising aspects of data mining as knowledge discovery tool, and
have been widely explored to date. They allow to capture all possible rules that explain the presence of
1
some attributes according to the presence of the other attributes. In association rule, , is a
statement of the form “for a specified fraction of transactions, a particular value of attribute set X
determines the value of attribute set Y as another particular value under a certain confidence”. Thus,
association rules aim at discovering the patterns of co-occurrences of attributes in a database. The
association rules may be useful in many application such as [supermarket transaction analysis, store
layout and promotions on the items, telecommunications alarm correlation, university course enrollment
analysis, customer behavior analysis in retailing, catalog design, word occurrence in text documents,
user’s visits to WWW pages, military mobilization, stock transactions,…..,etc.
4. Mining Association Rules: Two-step approach
A. Frequent Itemset Generation
B. Rule Generation .Generate all itemsets whose support ≥ minsup –Generate high confidence –where
each rule is a binary partitioning of a frequent itemset rules from each frequent itemset.
As a Results
Association Rule
• An implication expression of the form X → Y, where X and Y are itemsets
Example: {Milk, Diaper} → {Beer}
Rule Evaluation Metrics
• Support (s)
Fraction of transactions that contain both X and Y
• Confidence (c)
Measures how often items in Y appear in transactions that contain X

2
5. The Apriori Algorithm
Is an efficient association rule mining algorithm. Apriori employs breadth-first-search and uses a
hash-tree structure to count candidate item sets efficiently.
ƒ The algorithm generates candidate item sets (patterns) of length k from k-1 length item sets.
ƒ Then, the patterns which have an infrequent sub pattern are pruned. According to the min
support.
ƒ The whole transaction database is scanned to determine frequent item sets among the candidates.
ƒ For determining frequent items in a fast manner, the algorithm uses a hash tree to store candidate
item sets.
ƒ Note:- A hash tree has item sets at the leaves and hash tables at internal nodes. The following
figure explains the hash tree structure

Figure 1: Illustrating Hash Tree Structure

Figure 2: Illustrating Apriori Principle

3
4
5.1. Example
Let’s look at a scientific example, based on the science transaction database, D, of Table 1. There are
nine transactions in this database, that is D = 9. We use Figure 3 to illustrate the Apriori algorithm for

finding frequent itemsets in D.


Table.1 Transactional Data for an Computer Branch

A. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1.
The algorithm simply scans all of the transactions in order to count the number of occurrences of each
item.
B. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent 1-
itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support.
In our example, all of the candidates in C1 satisfy minimum support.
C. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate a

candidate set of 2-itemsets, C2.H C2 consists of 2-itemsets. Note that no candidates are removed
from C2 during the prune step because each subset of the candidates is also frequent.
D. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is
accumulated, as shown in the middle table of the second row in Figure 3.
E. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2
having minimum support.
F. The generation of the set of candidate 3-itemsets,C3, is detailed in Figure 4 From the join step, we
first get C3 =L2 ∞ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}}.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that the four latter candidates cannot possibly be frequent. We therefore remove them from
C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to
determine L3. Note that when given a candidate k-itemset, we only need to check if its (k-1)-subsets are
frequent since the Apriori algorithm uses a level-wise search strategy. The resulting pruned version of
C3 is shown in the first table of the bottom row of Figure 3.

5
G. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets
in C3 having minimum support (Figure 3).
H. The algorithm uses L3 ∞ L3 to generate a candidate set of 4-itemsets, C4. Although the join results in

{{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3, I5}} is not frequent. Thus, C4 = ,
and the algorithm terminates, having found all of the frequent itemsets.

Figure 3: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2

6
Figure4: Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property

Generating association rules: Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are
the association rules that can be generated from l ? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2,
I5},{I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with its
confidence:
I1^I2 I5, confidence = 2/4 = 50%
I1^I5 I2, confidence = 2/2 = 100%
I2^I5 I1, confidence = 2/2 = 100%
I1 I2^I5, confidence = 2/6 = 33%
I2 I1^I5, confidence = 2/7 = 29%
I5 I1^I2, confidence = 2/2 = 100%
If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are
output, because these are the only ones generated that are strong. Note that, unlike conventional
classification rules, association rules can contain more than one conjunct in the right-hand side of the
rule.

Vous aimerez peut-être aussi