Kuliah9 - StudiKasus Datamining-1

Data Mining
More data is
generated:
Bank, telecom, other
business transactions ...

Scientific Data:
astronomy, biology, etc
Web, text, and ecommerce
More data is captured:

Storage technology
faster and cheaper

DBMS capable of
handling bigger DB
We have large data stored in one or more

database/s.
We starved to find new information
within those data (for research usage,
competitive edge, etc).
We want to identify patterns or rules
(trends and relationships) in those data.
We know that a certain data exist inside a
database, but what are the consequences
of that datas existence?
There is often information hidden in the data that is

not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
Data
Mining is a process of extracting

previously unknown, valid and
actionable information from large
databases and then using the
information to make crucial business
decisions (Cabena et al. 1998).
Data mining: discovering interesting
patterns from large amounts of data
(Han and Kamber, 2001).
Definition
from [Connolly, 2005]:
The process of extracting valid, previously
unknown, comprehensive, and actionable

information from large databases and
using it to make crucial business
decisions.
The
thin red line of data mining: it is

all about finding patterns or rules by
extracting data from large databases
in order to find new information that
could lead to new knowledge.
What is not Data

Mining?
Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon
What is Data Mining?

Certain names are more
prevalent in certain US
locations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)
Database
Find all credit applicants with last name of Smith.

Identify customers who have purchased more
than $10,000 in the last month.
Find all customers who have purchased milk
Data
Mining
Find all credit applicants who are poor credit
risks. (classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased
with milk. (association rules)
Prentice Hall
Data Mining in Knowledge Discovery

Process
Integration
Da
ta
Se
& lect
Cl io
ea n
nin
g
DATA
Ware
house
ma
tio
Mi
nin
Knowledge
__ __ __
__ __ __
__ __ __
Transformed
Data
Target
Data
Knowledge
Patterns
and
Rules
Understanding
Raw
Data
Tra
ns
f or
Interpretation
& Evaluation
Marketing, customer profiling and
retention, identifying potential customers,

market segmentation.
Fraud
detection
identifying credit card fraud, intrusion

detection
Scientific data analysis

Text and web mining
Any application that involves
amount of data
a large
10
Data
region1
A data
record
A data
record
Data
region2
11
image
1
EN7410
17-inch
LCD
Monitor
Black/Dark
charcoal
$299.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
2
17-inch
LCD
Monitor
$249.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
3
AL1714
17-inch
LCD
Monitor,
Black
$269.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
4
SyncMaste
r 712n 17inch LCD
Monitor,
Black
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
Was:
$369.9
9
$299.9
9
Save
$70
After:
$70
mail-inrebate(s
)
12
Word-of-mouth on the Web

The Web has dramatically changed the way that
consumers express their opinions.

One can post reviews of products at merchant
sites, Web forums, discussion groups, blogs
Techniques are being developed to exploit these
sources.
Benefits of Review Analysis

Potential Customer: No need to read many
reviews
Product manufacturer: market intelligence,
product benchmarking
CS 583
13
Extracting
product features (called

Opinion Features) that have been
commented on by customers.
Identifying opinion sentences in each
review and deciding whether each
opinion sentence is positive or
negative.
Summarizing and comparing results.
CS 583
14
GREAT Camera., Jun 3, 2004

Reviewer: jprice174 from
Atlanta, Ga.
I did a lot of research last
year before I bought this
camera... It kinda hurt to
leave behind my beloved
nikon 35mm SLR, but I was
going to Italy, and I needed
something
smaller,
and
digital.
The pictures coming out of
this camera are amazing.
The 'auto' feature takes
great pictures most of the
time. And with digital,
you're not wasting film if
the picture doesn't come
out.
.
CS 583
Summary:
Feature1: picture
Positive: 12
The pictures coming out of this
camera are amazing.
Overall this is a good camera
with a really good picture
clarity.
Negative: 2
The pictures come out hazy if
your hands shake even for a
moment during the entire
process of taking a picture.
Focusing on a display rack
about 20 feet away in a
brightly lit room during day
time, pictures produced by this
camera were blurry and in a
shade of orange.
Feature2: battery life
15
Summary of
reviews of
Digital camera 1
+
_
Picture
Comparison of
reviews of
Battery
Zoom
Size
Weight
Digital camera 1
Digital camera 2
_
CS 583
16
Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining

17
Association rules try to find association

between items in a set of transactions.
For example, in the case of association
between items bought by customers in
supermarket:
90% of transactions that purchase bread
and butter also purchase milk
Output
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
18
A transaction is a set of items: T={ia, ib,it}

T I, where I is the set of all possible items
{i1, i2,in}
D, the task relevant data, is a set of
transactions (database of transactions).
Example:
items sold by supermarket (I:Itemset): {sugar,

parsley, onion, tomato, salt, bread, olives, cheese, butter}
Transaction by customer (T): T1: {sugar, onion,

salt}
Database (D): {T1={salt, bread, olives}, T2={sugar,

onion, salt}, T3={bread}, T4={cheese, butter},
T5={tomato}, }
19
An association rule is the form:

P Q, where P I, Q I, and P Q =
Example:
{bread} {butter, cheese}
{onion, tomato} {salt}
20
Support of a rule P Q = Support of (P Q)

in D
sD(P Q ) = sD(P Q)
percentage of transactions in D containing P and Q.
#transactions containing P and Q divided by
cardinality of D.
Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and
Q in the subset of transactions that contain already

P.
21
Thresholds:
minimum support: minsup

minimum confidence: minconf
Frequent
itemset P
support of P larger than minimum
support
Strong
rule P Q (c%)
(P Q) frequent,
c is larger than minimum confidence
22
Transaction
ID
Items
Bought
Min. support 50%

Min. confidence 50%
2000
A, B, C
1000
A, C
Frequent
Itemset
4000
A, D
{A}
75%
5000
B, E, F
{B}
50%
{C}
50%
{A,C}
50%
For rule {A} {C}:
Support
support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.6%
For rule {C} {A}:

support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
23
Input
A database of transactions
Each transaction is a list of items (Ex. purchased by
a customer in a visit)
Find all strong rules that associate the

presence of one set of items with that of
another set of items.
Example: 98% of people who purchase tires and
auto accessories also get automotive services done

There are no restrictions on the number of items in
the head or body of the rule.
The
most famous algorithm is APRIORI

24
Find
the frequent itemsets: the sets of

items that have minimum support
A subset of a frequent itemset must also
be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)

Use
the frequent itemsets to generate

association rules.
Source: [Sunysb, 2009]
TID
List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using apriori
algorithm.
Then, association rules will be
generated using min. support &
min. confidence
Step 1: Generating 1-itemset Frequent Pattern

Scan D
for count
of each
candidate
Itemset
{l1}
{l2}
Compare
Sup.Count candidate
support count
6
with minimum
support count
7
Itemset
Sup.Count
{l1}
{l2}
{l3}
{l3}
{l4}
{l4}
{;5}
{;5}
C1
L1
The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate
Step 2: Generating 2-itemset

Itemset
Sup.
Frequent
Pattern
Itemset
Count
Generate
C2
candidat
es from
L1
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
{l2,l3}
{l2,l3}
{l2,l4}
{l2,l4}
{l2,l5}
{l2,l5}
{l3,l4}
{l3,l4}
{l3,l5}
{l3,l5}
{l4,l5}
C2
{l4,l5}
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
Scan D for
count of
each
candidate
C2
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2}
{l1,l3}
{l1,l5}
{l2,l3}
{l2,l4}
{l2,l5}
2
L2
To discover the set of frequent 2-itemsets, L2,

the algorithm uses L1JoinL1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and
the support count for each candidate itemset
in C2 is accumulated (as shown in the middle
table).
The set of frequent 2-itemsets, L 2, is then
determined, consisting of those candidate 2itemsets in C2 having minimum support.
Note:We havent used Apriori Property yet.

Generate
C3
candidat
es from
L2
Step 3: Generating 3-itemset

Compare
Frequent
Pattern
Scan D
candidate
for count
of each
candidate
Itemset
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
{l1,l2,l3}
{l1,l2,l5}
C3
C3
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
2
L3
The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
Based on the Apriori property that all subsets

of a frequent itemset must also be frequent, we
can determine that four latter candidates
cannot possibly be frequent.
For example, lets take {I1, I2, I3}. The 2-item
subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are
members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}
which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and

hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2,
I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}}
after checking for all members of result of
Join operation for Pruning.
Now, the transactions in D are scanned in
order to determine L3, consisting of those
candidates 3-itemsets in C having minimum
support.
Step 4: Generating 4-itemset Frequent

Pattern
The algorithm uses L3JoinL3 to generate a
candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, this itemset is pruned
since its subset {{I2, I3, I5}} is not frequent.
Thus, C4= , and algorithm terminates, having
found all of the frequent items. This
completes our Apriori Algorithm.
Whats Next ? These frequent itemsets will be used
to generate strong association rules( where
strong association rules satisfy both minimum
support & minimum confidence).
Step 5: Generating Association Rules

from Frequent Itemsets
Procedure:
For each frequent itemset I, generate all
nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.
In
our example:
We had L = {{l1},{l2},{l3},{l4},{l5},
{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}
Let minimum confidence thresholdis , say

70%.
The resulting association rules are shown
below, each listed with its confidence.
R1: {I1,I2} {I5}

Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.
R2: {I1,I5} {I2}

R2 is Selected.
R3: {I2,I5} {I1}

R3 is Selected.
R4: {I1} {I2,I5}

Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: {I2} {I1,I5}

Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: {I5} {I1,I2}

Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
In this way, We have found three strong

association rules.
Learn a method for predicting the instance class

from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Prepare a collection of records (training set )

Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes (decision tree, neural
network, etc)
Prepare test set to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
After happy with the accuracy, use your model to
classify new instance
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
40
ca
go
e
t
l
ir ca
ca
go
e
t
l
ir ca
tin
n
co
u
uo
ss
a
l
c
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
10
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
41
ca
go
e
t
l
a
c
ri
ca
go
e
t
l
a
c
ri
tin
n
co
us
o
u
ss
a
cl
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
te
ca
10
ica
r
go
te
ca
ica
r
go
l
tin
n
co
us
o
u
ss
a
l
c
Tid Refund Marital

Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that

fits the same data!
Start from the root of tree.
Test Data
Refund
Yes
No
NO
10
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a

classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997

45
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair

transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.
46
Customer
Attrition/Churn:
Goal: To predict whether a customer is likely
to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.
Label the customers as loyal or disloyal.

Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
47
Clustering is a process of partitioning a set of data

(or objects) in a set of meaningful sub-classes, called
clusters.
Helps users understand the natural

grouping or structure in a data set.
Cluster: a collection of data objects that
are similar to one another and thus
can be treated collectively as one group.
Clustering: unsupervised classification:
no predefined classes
48
Find natural grouping of

instances given un-labeled data
A good clustering method will produce high

quality clusters in which:
the intra-class similarity (that is within a cluster) is high.
the inter-class similarity (that is between clusters) is
low.
The quality of a clustering result also depends on

both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
The quality of a clustering result also depends on
the definition and representation of cluster
chosen.
50
Partitioning
algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
51
Partitioning method: Given a number k,

partition a database D of n objects into a set
of k clusters so that a chosen objective
function is minimized (e.g., sum of distances
to the center of the clusters).
Global optimum: exhaustively enumerate all
partitions too expensive!
Heuristic methods based on iterative
refinement of an initial partition
52
Hierarchical decomposition of the data set (with

respect to a given similarity measure) into a set of
nested clusters
Result represented by a so called dendrogram
Nodes in the dendrogram represent possible clusters
can be constructed bottom-up (agglomerative
approach) or top down (divisive approach)
Clustering obtained by cutting the dendrogram at a desired level: each

connected component forms a cluster.
53
cluster
similarity = similarity of two most

similar members
- Potentially long and
skinny clusters
+ Fast
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
d (1, 2 ), 3 min{ d1 ,3 , d 2 , 3 } min{ 6,3} 3
d (1, 2 ), 4 min{ d1, 4 , d 2 , 4 } min{ 10,9} 9
d (1, 2 ), 5 min{ d1,5 , d 2 , 5 } min{ 9,8} 8
3
2
1
1
1
2
3
4
5
0
2
10
9
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
d (1, 2 , 3 ), 4 min{ d(1, 2 ), 4 , d 3 , 4 } min{ 9,7} 7

d (1, 2 , 3 ),5 min{ d (1, 2 ), 5 , d3 , 5 } min{ 8,5} 5
4
3
2
1
1
1 0
2 2
3 6
4 10
5 9
(1,2) 3 4 5
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
d (1, 2 , 3 ),( 4 , 5 ) min{ d (1, 2, 3), 4 , d (1, 2 , 3 ),5 } 5
4
3
2
1
cluster similarity = similarity of two least

similar members
+ tight clusters
- slow
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
d (1, 2), 3 max{ d1, 3 , d 2, 3} max{ 6,3} 6
d (1, 2), 4 max{ d1 , 4 , d 2 , 4 } max{10,9} 10
d (1, 2), 5 max{ d1, 5 , d 2,5 } max{ 9,8} 9
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
d (1, 2 ), ( 4 ,5) max{ d (1, 2), 4 , d (1, 2 ), 5 } max{10,9} 10

d 3 ,( 4 , 5) max{ d3 , 4 , d 3, 5 } max{ 7,5} 7
4
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
d (1, 2 , 3 ),( 4 , 5 ) max{ d(1, 2 ), ( 4 , 5 ) , d 3, ( 4 , 5 )} 10
4
3
2
1
Dendogram: Hierarchical Clustering

Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
62

Kuliah9 - StudiKasus Datamining-1

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Kuliah9 - StudiKasus Datamining-1

Transféré par

Droits d'auteur :

Formats disponibles

Data Mining

business transactions ...

More data is captured:

faster and cheaper

We have large data stored in one or more

There is often information hidden in the data that is

Total new disk (TB) since 1995

Mining is a process of extracting

from [Connolly, 2005]:

The process of extracting valid, previously

unknown, comprehensive, and actionable

thin red line of data mining: it is

What is not Data

What is Data Mining?

Find all credit applicants with last name of Smith.

Find all credit applicants who are poor credit

Data Mining in Knowledge Discovery

Marketing, customer profiling and

retention, identifying potential customers,

identifying credit card fraud, intrusion

Scientific data analysis

Word-of-mouth on the Web

consumers express their opinions.

Benefits of Review Analysis

product features (called

GREAT Camera., Jun 3, 2004

Association rules try to find association

A transaction is a set of items: T={ia, ib,it}

items sold by supermarket (I:Itemset): {sugar,

Transaction by customer (T): T1: {sugar, onion,

Database (D): {T1={salt, bread, olives}, T2={sugar,

An association rule is the form:

Support of a rule P Q = Support of (P Q)

Q in the subset of transactions that contain already

minimum support: minsup

support of P larger than minimum

Min. support 50%

For rule {A} {C}:

support = support({A, C}) = 50%

For rule {C} {A}:

Find all strong rules that associate the

auto accessories also get automotive services done

most famous algorithm is APRIORI

the frequent itemsets: the sets of

Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset)

the frequent itemsets to generate

T100 I1, I2, I5

Step 1: Generating 1-itemset Frequent Pattern

Step 2: Generating 2-itemset

To discover the set of frequent 2-itemsets, L2,

Step 3: Generating 3-itemset

Based on the Apriori property that all subsets

BUT, {I3, I5} is not a member of L2 and

Step 4: Generating 4-itemset Frequent

Step 5: Generating Association Rules

For each frequent itemset I, generate all

Let minimum confidence thresholdis , say

R1: {I1,I2} {I5}

R2: {I1,I5} {I2}

R3: {I2,I5} {I1}

R4: {I1} {I2,I5}

R5: {I2} {I1,I5}

R6: {I5} {I1,I2}