Vous êtes sur la page 1sur 22

Clustering

Basic ideas

QRM:Prasenjit Chakrabarti
Clustering

• Clustering is a ill-defined problem

• Loosely: you cannot have a unique solution for clustering

QRM:Prasenjit Chakrabarti
Why it is so?
Conditions for desirable properties of a clustering algorithm:
1. Scale Invariant : 𝑓 𝐷 = 𝑓(𝛼𝐷) where 𝛼 is scale. Meaning if I multiply the
distances between pair-wise point by a constant 𝛼, my clustering solutions will
remain the same

2. Richness: Any possible partitioning would be possible outcome: Loosely, we


can have multiple clusters for the same set of data

3. Consistency: Initially suppose we had n-clusters. Loosely, distance between all


points in each cluster are reduced and distance between a different cluster is
enlarged then number of clusters should remain same as n.
• We CAN NOT have all the three properties satisfying at the same time.

QRM:Prasenjit Chakrabarti
K-mean cluster
• Suppose we have n data points indexed by 1,2,…n

• Suppose we need K-clusters, k𝜖 1,2, … 𝐾

• We need to assign each point to one cluster. K=C(i)

• The goal is to find a grouping of data such that the distances between points
within a cluster tend to be small and distances between points in different
clusters tend to be large.

QRM:Prasenjit Chakrabarti
K-mean cluster
Consider the sum of the distances between all the points. Let the Total distance be
T. Now consider two points i and j. They can be in a single cluster, or can be the
part of two different clusters. Then total distance can be written as

𝑇 = ෍ ෍ ( ෍ 𝑑𝑖𝑗 + ෍ 𝑑𝑖𝑗 )
𝑘=1 𝐶 𝑖 =𝑘 𝐶 𝑗 =𝑘 𝐶(𝑗)≠𝑘
=𝑤 𝐶 +𝑏 𝐶
Where 𝑑𝑖𝑗 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡𝑤𝑜 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖 𝑎𝑛𝑑 𝑗
𝑤 𝐶 = Within cluster distance
𝑏 𝐶 = between cluster distance

QRM:Prasenjit Chakrabarti
K-mean cluster
Thus, it becomes an optimization problem
Loosely:

Our objective is : either to minimize 𝑤 𝐶 / maximize 𝑏 𝐶

This is how the clustering problem is transformed into a clustering problem.

• Generally, for large data points we use K-mean clustering, where we


exogenously define number of clusters.

QRM:Prasenjit Chakrabarti
Some
Data

This could easily be


modeled by a Gaussian
Mixture (with 5
components)
But let’s look at an
satisfying, friendly and
infinitely popular
alternative…

QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)

QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations

QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)

QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns

QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
QRM:Prasenjit Chakrabarti
K-means
Start
Advance apologies: in
Black and White this
example will deteriorate
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999,
(KDD99) (available on
www.autonlab.org/pap.html)

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
continues

QRM:Prasenjit Chakrabarti
K-means
terminates

QRM:Prasenjit Chakrabarti

Vous aimerez peut-être aussi