Académique Documents
Professionnel Documents
Culture Documents
More data is
generated:
Bank, telecom, other
The Data
Gap
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
Data
Definition
Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon
Database
Data
Mining
risks. (classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased
with milk. (association rules)
Prentice Hall
Se
& lect
Cl io
ea n
nin
g
DATA
Ware
house
ma
tio
Mi
nin
Knowledge
__ __ __
__ __ __
__ __ __
Transformed
Data
Target
Data
Knowledge
Patterns
and
Rules
Understanding
Raw
Data
Tra
ns
f or
Interpretation
& Evaluation
Fraud
detection
amount of data
a large
10
Data
region1
A data
record
A data
record
Data
region2
11
image
1
EN7410
17-inch
LCD
Monitor
Black/Dark
charcoal
$299.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
2
17-inch
LCD
Monitor
$249.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
3
AL1714
17-inch
LCD
Monitor,
Black
$269.9
9
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
image
4
SyncMaste
r 712n 17inch LCD
Monitor,
Black
Add
to
Cart
(Delivery
/ Pick-Up
)
Penny
Shoppin
g
Compar
e
Was:
$369.9
9
$299.9
9
Save
$70
After:
$70
mail-inrebate(s
)
12
reviews
Product manufacturer: market intelligence,
product benchmarking
CS 583
13
Extracting
14
CS 583
Summary:
Feature1: picture
Positive: 12
The pictures coming out of this
camera are amazing.
Overall this is a good camera
with a really good picture
clarity.
Negative: 2
The pictures come out hazy if
your hands shake even for a
moment during the entire
process of taking a picture.
Focusing on a display rack
about 20 feet away in a
brightly lit room during day
time, pictures produced by this
camera were blurry and in a
shade of orange.
Feature2: battery life
15
Summary of
reviews of
Digital camera 1
+
_
Picture
Comparison of
reviews of
Battery
Zoom
Size
Weight
Digital camera 1
Digital camera 2
_
CS 583
16
Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining
17
Output
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
18
19
Example:
{bread} {butter, cheese}
{onion, tomato} {salt}
20
cardinality of D.
Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and
Thresholds:
Frequent
itemset P
support
Strong
rule P Q (c%)
(P Q) frequent,
c is larger than minimum confidence
22
Transaction
ID
Items
Bought
2000
A, B, C
1000
A, C
Frequent
Itemset
4000
A, D
{A}
75%
5000
B, E, F
{B}
50%
{C}
50%
{A,C}
50%
Support
Input
A database of transactions
Each transaction is a list of items (Ex. purchased by
a customer in a visit)
The
Find
be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset
TID
List of Items
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using apriori
algorithm.
Then, association rules will be
generated using min. support &
min. confidence
Itemset
{l1}
{l2}
Compare
Sup.Count candidate
support count
6
with minimum
support count
7
Itemset
Sup.Count
{l1}
{l2}
{l3}
{l3}
{l4}
{l4}
{;5}
{;5}
C1
L1
The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate
Generate
C2
candidat
es from
L1
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
{l2,l3}
{l2,l3}
{l2,l4}
{l2,l4}
{l2,l5}
{l2,l5}
{l3,l4}
{l3,l4}
{l3,l5}
{l3,l5}
{l4,l5}
C2
{l4,l5}
{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}
Scan D for
count of
each
candidate
C2
Compare
candidate
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2}
{l1,l3}
{l1,l5}
{l2,l3}
{l2,l4}
{l2,l5}
2
L2
Generate
C3
candidat
es from
L2
Itemset
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
{l1,l2,l3}
{l1,l2,l5}
C3
C3
support
count with
minimum
support
count
Itemset
Sup.
Count
{l1,l2,l3}
{l1,l2,l5}
2
L3
The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.
In
our example:
We had L = {{l1},{l2},{l3},{l4},{l5},
{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
40
ca
go
e
t
l
ir ca
ca
go
e
t
l
ir ca
tin
n
co
u
uo
ss
a
l
c
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
10
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
41
ca
go
e
t
l
a
c
ri
ca
go
e
t
l
a
c
ri
tin
n
co
us
o
u
ss
a
cl
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
te
ca
10
ica
r
go
te
ca
ica
r
go
l
tin
n
co
us
o
u
ss
a
l
c
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
Test Data
Refund
Yes
No
NO
10
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Customer
Attrition/Churn:
to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.
48
low.
Partitioning
algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
51
52
cluster
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
3
2
1
1
1
2
3
4
5
0
2
10
9
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
4
3
2
1
1
1 0
2 2
3 6
4 10
5 9
(1,2) 3 4 5
3 4 5
3 0
9 7 0
8 5 4 0
(1,2) 0
3 3 0
4 9 7 0
5 8 5 4 0
(1,2,3) 4 5
(1,2,3) 0
4 7 0
5 5 4 0
4
3
2
1
+ tight clusters
- slow
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
4
3
2
1
1 2 3 4 5
1
2
3
4
5
0
2
10
9
3 0
9 7 0
8 5 4 0
(1,2) 3 4 5
(1,2) 0
3 6 0
4 10 7 0
5 9 5 4 0
(1,2) 3 ( 4,5)
(1,2) 0
3 6 0
(4,5) 10 7 0
4
3
2
1
62