Vous êtes sur la page 1sur 62

Data Mining

More data is
generated:
Bank, telecom, other

business transactions ...


Scientific Data:
astronomy, biology, etc
Web, text, and ecommerce

More data is captured:


Storage technology

faster and cheaper


DBMS capable of
handling bigger DB

We have large data stored in one or more


database/s.
We starved to find new information
within those data (for research usage,
competitive edge, etc).
We want to identify patterns or rules
(trends and relationships) in those data.
We know that a certain data exist inside a
database, but what are the consequences
of that datas existence?

There is often information hidden in the data that is


not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
4,000,000
3,500,000

The Data
Gap

3,000,000
2,500,000
2,000,000
1,500,000

Total new disk (TB) since 1995

1,000,000

Number of
analysts

500,000
0
1995

1996

1997

1998

1999

From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

Data

Mining is a process of extracting


previously unknown, valid and
actionable information from large
databases and then using the
information to make crucial business
decisions (Cabena et al. 1998).
Data mining: discovering interesting
patterns from large amounts of data
(Han and Kamber, 2001).

Definition

from [Connolly, 2005]:

The process of extracting valid, previously

unknown, comprehensive, and actionable


information from large databases and
using it to make crucial business
decisions.
The

thin red line of data mining: it is


all about finding patterns or rules by
extracting data from large databases
in order to find new information that
could lead to new knowledge.

What is not Data


Mining?

Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon

What is Data Mining?


Certain names are more
prevalent in certain US
locations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by search
engine according to their
context (e.g. Amazon
rainforest, Amazon.com,)

Database

Find all credit applicants with last name of Smith.


Identify customers who have purchased more
than $10,000 in the last month.
Find all customers who have purchased milk

Data

Mining

Find all credit applicants who are poor credit

risks. (classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased
with milk. (association rules)
Prentice Hall

Data Mining in Knowledge Discovery


Process
Integration
Da
ta

Se
& lect
Cl io
ea n
nin
g

DATA
Ware
house

ma
tio

Mi
nin

Knowledge

__ __ __
__ __ __
__ __ __

Transformed
Data
Target
Data

Knowledge

Patterns
and
Rules

Understanding

Raw
Data

Tra
ns
f or

Interpretation
& Evaluation

Marketing, customer profiling and

retention, identifying potential customers,


market segmentation.

Fraud

detection

identifying credit card fraud, intrusion


detection

Scientific data analysis


Text and web mining
Any application that involves

amount of data

a large

10

Data
region1
A data
record

A data
record

Data
region2

11

image
1

EN7410
17-inch
LCD
Monitor
Black/Dark
charcoal

$299.9
9

Add
to
Cart

(Delivery
/ Pick-Up
)

Penny
Shoppin
g

Compar
e

image
2

17-inch
LCD
Monitor

$249.9
9

Add
to
Cart

(Delivery
/ Pick-Up
)

Penny
Shoppin
g

Compar
e

image
3

AL1714
17-inch
LCD
Monitor,
Black

$269.9
9

Add
to
Cart

(Delivery
/ Pick-Up
)

Penny
Shoppin
g

Compar
e

image
4

SyncMaste
r 712n 17inch LCD
Monitor,
Black

Add
to
Cart

(Delivery
/ Pick-Up
)

Penny
Shoppin
g

Compar
e

Was:
$369.9
9

$299.9
9

Save
$70
After:
$70
mail-inrebate(s
)

12

Word-of-mouth on the Web


The Web has dramatically changed the way that

consumers express their opinions.


One can post reviews of products at merchant
sites, Web forums, discussion groups, blogs
Techniques are being developed to exploit these
sources.

Benefits of Review Analysis


Potential Customer: No need to read many

reviews
Product manufacturer: market intelligence,
product benchmarking
CS 583

13

Extracting

product features (called


Opinion Features) that have been
commented on by customers.
Identifying opinion sentences in each
review and deciding whether each
opinion sentence is positive or
negative.
Summarizing and comparing results.
CS 583

14

GREAT Camera., Jun 3, 2004


Reviewer: jprice174 from
Atlanta, Ga.
I did a lot of research last
year before I bought this
camera... It kinda hurt to
leave behind my beloved
nikon 35mm SLR, but I was
going to Italy, and I needed
something
smaller,
and
digital.
The pictures coming out of
this camera are amazing.
The 'auto' feature takes
great pictures most of the
time. And with digital,
you're not wasting film if
the picture doesn't come
out.
.

CS 583

Summary:
Feature1: picture
Positive: 12
The pictures coming out of this
camera are amazing.
Overall this is a good camera
with a really good picture
clarity.

Negative: 2
The pictures come out hazy if
your hands shake even for a
moment during the entire
process of taking a picture.
Focusing on a display rack
about 20 feet away in a
brightly lit room during day
time, pictures produced by this
camera were blurry and in a
shade of orange.
Feature2: battery life

15

Summary of
reviews of
Digital camera 1

+
_
Picture

Comparison of
reviews of

Battery

Zoom

Size

Weight

Digital camera 1
Digital camera 2

_
CS 583

16

Clustering
Classification
Association Rules
Other Methods:
Outlier detection
Sequential patterns
Prediction
Trends and analysis of changes
Methods for special data types, e.g., spatial data
mining, web mining

17

Association rules try to find association


between items in a set of transactions.
For example, in the case of association
between items bought by customers in
supermarket:
90% of transactions that purchase bread
and butter also purchase milk

Output
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%

18

A transaction is a set of items: T={ia, ib,it}


T I, where I is the set of all possible items
{i1, i2,in}
D, the task relevant data, is a set of
transactions (database of transactions).
Example:

items sold by supermarket (I:Itemset): {sugar,


parsley, onion, tomato, salt, bread, olives, cheese, butter}

Transaction by customer (T): T1: {sugar, onion,


salt}

Database (D): {T1={salt, bread, olives}, T2={sugar,


onion, salt}, T3={bread}, T4={cheese, butter},
T5={tomato}, }

19

An association rule is the form:


P Q, where P I, Q I, and P Q =

Example:
{bread} {butter, cheese}
{onion, tomato} {salt}

20

Support of a rule P Q = Support of (P Q)


in D
sD(P Q ) = sD(P Q)
percentage of transactions in D containing P and Q.
#transactions containing P and Q divided by

cardinality of D.

Confidence of a rule P Q
cDP Q) = sD(P Q)/sD(P)
percentage of transactions that contain both P and

Q in the subset of transactions that contain already


P.
21

Thresholds:

minimum support: minsup


minimum confidence: minconf

Frequent

itemset P

support of P larger than minimum

support

Strong

rule P Q (c%)

(P Q) frequent,
c is larger than minimum confidence

22

Transaction
ID

Items
Bought

Min. support 50%


Min. confidence 50%

2000

A, B, C

1000

A, C

Frequent
Itemset

4000

A, D

{A}

75%

5000

B, E, F

{B}

50%

{C}

50%

{A,C}

50%

For rule {A} {C}:

Support

support = support({A, C}) = 50%


confidence = support({A, C})/support({A}) = 66.6%

For rule {C} {A}:


support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
23

Input
A database of transactions
Each transaction is a list of items (Ex. purchased by

a customer in a visit)

Find all strong rules that associate the


presence of one set of items with that of
another set of items.
Example: 98% of people who purchase tires and

auto accessories also get automotive services done


There are no restrictions on the number of items in
the head or body of the rule.

The

most famous algorithm is APRIORI


24

Find

the frequent itemsets: the sets of


items that have minimum support
A subset of a frequent itemset must also

be a frequent itemset
i.e., if {AB} isa frequent itemset, both {A} and
{B} should be a frequent itemset

Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset)


Use

the frequent itemsets to generate


association rules.
Source: [Sunysb, 2009]

TID

List of Items

T100 I1, I2, I5

T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3

T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3

Consider a database, D,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22%).
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using apriori
algorithm.
Then, association rules will be
generated using min. support &
min. confidence

Step 1: Generating 1-itemset Frequent Pattern


Scan D
for count
of each
candidate

Itemset
{l1}
{l2}

Compare
Sup.Count candidate
support count
6
with minimum
support count
7

Itemset

Sup.Count

{l1}

{l2}

{l3}

{l3}

{l4}

{l4}

{;5}

{;5}

C1

L1

The set of frequent 1-itemsets, L1, consists of the candidate 1itemsets satisfying minimum support.
In the first iteration of the algorithm, each item is member of the set of
candidate

Step 2: Generating 2-itemset


Itemset
Sup.
Frequent
Pattern
Itemset
Count

Generate
C2
candidat
es from
L1

{l1,l2}

{l1,l3}

{l1,l4}

{l1,l5}

{l2,l3}

{l2,l3}

{l2,l4}

{l2,l4}

{l2,l5}

{l2,l5}

{l3,l4}

{l3,l4}

{l3,l5}

{l3,l5}

{l4,l5}
C2

{l4,l5}

{l1,l2}
{l1,l3}
{l1,l4}
{l1,l5}

Scan D for
count of
each
candidate

C2

Compare
candidate
support
count with
minimum
support
count

Itemset

Sup.
Count

{l1,l2}

{l1,l3}

{l1,l5}

{l2,l3}

{l2,l4}

{l2,l5}

2
L2

To discover the set of frequent 2-itemsets, L2,


the algorithm uses L1JoinL1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and
the support count for each candidate itemset
in C2 is accumulated (as shown in the middle
table).
The set of frequent 2-itemsets, L 2, is then
determined, consisting of those candidate 2itemsets in C2 having minimum support.
Note:We havent used Apriori Property yet.


Generate
C3
candidat
es from
L2

Step 3: Generating 3-itemset


Compare
Frequent
Pattern
Scan D
candidate
for count
of each
candidate

Itemset

Itemset

Sup.
Count

{l1,l2,l3}
{l1,l2,l5}

{l1,l2,l3}
{l1,l2,l5}
C3

C3

support
count with
minimum
support
count

Itemset

Sup.
Count

{l1,l2,l3}

{l1,l2,l5}

2
L3

The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
Property.
In order to find C3, we compute L2JoinL2.
C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.

Based on the Apriori property that all subsets


of a frequent itemset must also be frequent, we
can determine that four latter candidates
cannot possibly be frequent.
For example, lets take {I1, I2, I3}. The 2-item
subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are
members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}
which shows how the pruning is performed. The
2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2 and


hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2,
I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}}
after checking for all members of result of
Join operation for Pruning.
Now, the transactions in D are scanned in
order to determine L3, consisting of those
candidates 3-itemsets in C having minimum
support.

Step 4: Generating 4-itemset Frequent


Pattern
The algorithm uses L3JoinL3 to generate a
candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, this itemset is pruned
since its subset {{I2, I3, I5}} is not frequent.
Thus, C4= , and algorithm terminates, having
found all of the frequent items. This
completes our Apriori Algorithm.
Whats Next ? These frequent itemsets will be used
to generate strong association rules( where
strong association rules satisfy both minimum
support & minimum confidence).

Step 5: Generating Association Rules


from Frequent Itemsets
Procedure:

For each frequent itemset I, generate all

nonempty subsets of I.
For every nonempty subset s of I, output the rule
s (I-s) if support_count(I) /
support_count(s) min_conf where min_conf
is minimum confidence threshold.

In

our example:

We had L = {{l1},{l2},{l3},{l4},{l5},

{l1,l2},{l1,l3},{l1,l5},{l2,l3},{l2,l3},
{l2,l5},{l1,l2,l3},{l1,l2,l5}}.
Lets take I = {l1,l2,l5}
Its all nonempty subsets are {l1,l2}, {l1,l5},
{l2,l5}, {l1}, {l2}, {l5}

Let minimum confidence thresholdis , say


70%.
The resulting association rules are shown
below, each listed with its confidence.

R1: {I1,I2} {I5}


Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.

R2: {I1,I5} {I2}


Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is Selected.

R3: {I2,I5} {I1}


Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is Selected.

R4: {I1} {I2,I5}


Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.

R5: {I2} {I1,I5}


Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.

R6: {I5} {I1,I2}


Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.

In this way, We have found three strong


association rules.

Learn a method for predicting the instance class


from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

Prepare a collection of records (training set )


Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes (decision tree, neural
network, etc)
Prepare test set to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
After happy with the accuracy, use your model to
classify new instance

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
40

ca

go
e
t

l
ir ca
ca

go
e
t

l
ir ca

tin
n
co

u
uo

ss
a
l
c

Refund Marital
Status

Taxable
Income Cheat

No

No

Single

75K

100K

No

Yes

Married

50K

Single

70K

No

No

Married

150K

Yes

Married

120K

No

Yes

Divorced 90K

No

Divorced 95K

Yes

No

Single

40K

No

Married

No

No

Married

80K

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

Married

No

60K

10

10

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

Training
Set

Learn
Classifier

Test
Set

Model

From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
41

ca

go
e
t

l
a
c
ri
ca

go
e
t

l
a
c
ri

tin
n
co

us
o
u

ss
a
cl

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Splitting Attributes

Refund
Yes

No

NO

MarSt
Single, Divorced
TaxInc

< 80K
NO

NO
> 80K
YES

10

Training Data

Married

Model: Decision Tree

te
ca

10

ica
r
go

te
ca

ica
r
go

l
tin
n
co

us
o
u

ss
a
l
c

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Married

MarSt

NO

Single,
Divorced
Refund
No

Yes
NO

TaxInc
< 80K
NO

> 80K
YES

There could be more than one tree that


fits the same data!

Start from the root of tree.

Test Data

Refund
Yes

No

NO

10

MarSt
Single, Divorced
TaxInc
< 80K
NO

Married
NO

> 80K
YES

Refund Marital
Status

Taxable
Income Cheat

No

80K

Married

Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, dont buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and companyinteraction related information about all such
customers.
Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a


classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997


45

Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc

Label past transactions as fraud or fair


transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.
From: http://www-users.cs.umn.edu/~kumar/dmbook/index.php
46

Customer

Attrition/Churn:

Goal: To predict whether a customer is likely

to be lost to a competitor.
Approach:
Use detailed record of transactions with each of
the past and present customers, to find attributes.
How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital
status, etc.

Label the customers as loyal or disloyal.


Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
47

Clustering is a process of partitioning a set of data


(or objects) in a set of meaningful sub-classes, called
clusters.

Helps users understand the natural


grouping or structure in a data set.
Cluster: a collection of data objects that
are similar to one another and thus
can be treated collectively as one group.
Clustering: unsupervised classification:
no predefined classes

48

Find natural grouping of


instances given un-labeled data

A good clustering method will produce high


quality clusters in which:
the intra-class similarity (that is within a cluster) is high.
the inter-class similarity (that is between clusters) is

low.

The quality of a clustering result also depends on


both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
The quality of a clustering result also depends on
the definition and representation of cluster
chosen.
50

Partitioning

algorithms: Construct
various partitions and then evaluate
them by some criterion.
Hierarchy algorithms: Create a
hierarchical decomposition of the set
of data (or objects) using some
criterion. There is an agglomerative
approach and a divisive approach.
51

Partitioning method: Given a number k,


partition a database D of n objects into a set
of k clusters so that a chosen objective
function is minimized (e.g., sum of distances
to the center of the clusters).
Global optimum: exhaustively enumerate all
partitions too expensive!
Heuristic methods based on iterative
refinement of an initial partition

52

Hierarchical decomposition of the data set (with


respect to a given similarity measure) into a set of
nested clusters
Result represented by a so called dendrogram
Nodes in the dendrogram represent possible clusters
can be constructed bottom-up (agglomerative
approach) or top down (divisive approach)

Clustering obtained by cutting the dendrogram at a desired level: each


connected component forms a cluster.
53

cluster

similarity = similarity of two most


similar members
- Potentially long and
skinny clusters
+ Fast

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

d (1, 2 ), 3 min{ d1 ,3 , d 2 , 3 } min{ 6,3} 3

d (1, 2 ), 4 min{ d1, 4 , d 2 , 4 } min{ 10,9} 9

d (1, 2 ), 5 min{ d1,5 , d 2 , 5 } min{ 9,8} 8

3
2
1

1
1
2
3
4
5

0
2

10
9

3 4 5

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

(1,2,3) 4 5
(1,2,3) 0

4 7 0
5 5 4 0

d (1, 2 , 3 ), 4 min{ d(1, 2 ), 4 , d 3 , 4 } min{ 9,7} 7


d (1, 2 , 3 ),5 min{ d (1, 2 ), 5 , d3 , 5 } min{ 8,5} 5

4
3
2
1

1
1 0
2 2
3 6

4 10
5 9

(1,2) 3 4 5

3 4 5

3 0

9 7 0
8 5 4 0

(1,2) 0

3 3 0

4 9 7 0

5 8 5 4 0

(1,2,3) 4 5
(1,2,3) 0

4 7 0
5 5 4 0

d (1, 2 , 3 ),( 4 , 5 ) min{ d (1, 2, 3), 4 , d (1, 2 , 3 ),5 } 5

4
3
2
1

cluster similarity = similarity of two least


similar members

+ tight clusters
- slow

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

d (1, 2), 3 max{ d1, 3 , d 2, 3} max{ 6,3} 6

d (1, 2), 4 max{ d1 , 4 , d 2 , 4 } max{10,9} 10

d (1, 2), 5 max{ d1, 5 , d 2,5 } max{ 9,8} 9

3
2
1

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

(1,2) 3 ( 4,5)
(1,2) 0

3 6 0
(4,5) 10 7 0

d (1, 2 ), ( 4 ,5) max{ d (1, 2), 4 , d (1, 2 ), 5 } max{10,9} 10


d 3 ,( 4 , 5) max{ d3 , 4 , d 3, 5 } max{ 7,5} 7

4
3
2
1

1 2 3 4 5
1
2
3
4
5

0
2

10
9

3 0

9 7 0
8 5 4 0

(1,2) 3 4 5
(1,2) 0

3 6 0

4 10 7 0

5 9 5 4 0

(1,2) 3 ( 4,5)
(1,2) 0

3 6 0
(4,5) 10 7 0

d (1, 2 , 3 ),( 4 , 5 ) max{ d(1, 2 ), ( 4 , 5 ) , d 3, ( 4 , 5 )} 10

4
3
2
1

Dendogram: Hierarchical Clustering


Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

62

Vous aimerez peut-être aussi