Académique Documents
Professionnel Documents
Culture Documents
Basic ideas
QRM:Prasenjit Chakrabarti
Clustering
QRM:Prasenjit Chakrabarti
Why it is so?
Conditions for desirable properties of a clustering algorithm:
1. Scale Invariant : 𝑓 𝐷 = 𝑓(𝛼𝐷) where 𝛼 is scale. Meaning if I multiply the
distances between pair-wise point by a constant 𝛼, my clustering solutions will
remain the same
QRM:Prasenjit Chakrabarti
K-mean cluster
• Suppose we have n data points indexed by 1,2,…n
• The goal is to find a grouping of data such that the distances between points
within a cluster tend to be small and distances between points in different
clusters tend to be large.
QRM:Prasenjit Chakrabarti
K-mean cluster
Consider the sum of the distances between all the points. Let the Total distance be
T. Now consider two points i and j. They can be in a single cluster, or can be the
part of two different clusters. Then total distance can be written as
𝑇 = ( 𝑑𝑖𝑗 + 𝑑𝑖𝑗 )
𝑘=1 𝐶 𝑖 =𝑘 𝐶 𝑗 =𝑘 𝐶(𝑗)≠𝑘
=𝑤 𝐶 +𝑏 𝐶
Where 𝑑𝑖𝑗 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡𝑤𝑜 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖 𝑎𝑛𝑑 𝑗
𝑤 𝐶 = Within cluster distance
𝑏 𝐶 = between cluster distance
QRM:Prasenjit Chakrabarti
K-mean cluster
Thus, it becomes an optimization problem
Loosely:
QRM:Prasenjit Chakrabarti
Some
Data
QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
QRM:Prasenjit Chakrabarti
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
QRM:Prasenjit Chakrabarti
K-means
Start
Advance apologies: in
Black and White this
example will deteriorate
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999,
(KDD99) (available on
www.autonlab.org/pap.html)
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
continues
…
QRM:Prasenjit Chakrabarti
K-means
terminates
QRM:Prasenjit Chakrabarti