Vous êtes sur la page 1sur 20

Clustering

Gilad Lerman
Math Department, UMN
Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
What is Clustering?
Partitioning data into classes with
high intra-class similarity
low inter-class similarity
Is it well-defined?

What is Similarity?
Clearly, subjective measure or problem-dependent
How Similar Clusters are?
Ex1: Two clusters or one clusters?
How Similar Clusters are?
Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity
Given Cluster
Mean:


Within Cluster Sum of Squares:


Note that
{ }
=
1
1 1
,...,
N
S x x
=

1
1
1
1
N
i
x
N

e =
=

1
2 2
2
1 1
1
WCSS( )= , where ( )
i
D
i j
x S j
S x y y

e
=

1
2
1
argmin
i
i
c
x S
x c
Within Cluster Sum of Squares
For Set of Clusters S={S
1
,,S
K
}



Can use

So get Within Clusters Manhattan Distance




Question: how to compute/estimate c?

= e


2
1
WCSS( )=
i j
K
i j
j x S
S x
= =
= =

2
1
1 1
( ) instead of ( )
D D
j j
j j
y y y y
= e
e

1
1
1
WCMD( )= ,
where argmin
i j
i j
K
i j
j x S
j i
c
x S
S x m
m x c
Minimizing WCSS
Precise minimization is NP-hard
Approximate minimization for WCSS by
K-means
Approximate minimization for WCMD by
K-medians
The K-means Algorithm
Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers
For each center assign nearest cluster
Repeat till convergence .



Demonstration: K-means/medians
Applet

K-means: Pros and Cons
Pros
Often fast
Often terminates at a local minimum
Cons
May not obtain the global minimum
Depends on initialization
Need to specify K
Sensitive to outliers
Sensitive to variations in sizes and densities of clusters
Not suitable for non-convex shapes
Does not apply directly to categorical data
Spectral Clustering
Idea: embed data for easy clustering
Construct weights based on proximity:

(Normalize W )
Embed using eigenvectors of W
o
= =
2
/
if and 0 otherwise
i j
x x
ij
W e i j
Clustering vs. Classification
Clustering find classes in an unsupervised
way (often K is given though)
Classification labels of clusters are given
for some data points (supervised learning)
Data 1: Face images
Facial images (e.g., of persons 5,8,10) live on different
planes in the image space
They are often well-separated so that simple clustering
can apply to them (but not always)
Question: What is the high-dimensional image space?
Question: How can we present high-dim. data in 3D?



Data 2: Iris Data Set
50 samples from each of 3 species
4 features per sample:
length & width of sepal and petal
Setosa Versicolor Virginica
Data 2: Iris Data Set
Data 2: Iris Data Set
Setosa is clearly separated from 2 others
Cant separate Virginica and Versicolor
(need training set as done by Fischer in 1936)
Question: What are other ways to visualize?
Data 3: Color-based Compression
of Images
Applet
Question: What are the actual data points?
Question: What does the error mean?
Some methods for # of Clusters
(with online codes)
Gap statistics
Model-based clustering
G-means
X-means
Data-spectroscopic clustering
Self-tuning clustering
Your mission
Learn about clustering (theoretical results,
algorithms, codes)
Focus: methods for determining # of clusters
Understand details
Compare using artificial and real data
Conclude good/bad scenarios for each (prove?)
Come up with new/improved methods
Summarize info: literature survey and possibly
new/improved demos/applets
We can suggest additional questions tailored to
your interest

Vous aimerez peut-être aussi