Vous êtes sur la page 1sur 8

Chapter 4 Clustering

4.1 Clustering
Introduction
Examples

4.2 A Clustering Objective


- Specifying The Cluster Assignments
- Group Representatives
- A Clustering Objective
- Optimal And Suboptimal Clustering
- Partitioning The Vectors With Representative Fixed
- Optimizing the Group Representative With Assignment Fixed

4.3 K-Means Algorithm


- Algorithm 4.1
- Comments And Clarifications
- Convergence
- Interpreting The Representatives
- Choosing k
- Complexity

4.4 Examples
4.4.1 Image Clustering
4.4.2 Document Topic Discovery

4.5 Applications
- Classification
- Reccomendations
- Guessing Missing Entries

Exercises

Notes:

In this chapter we consider the problem of partitioning a bunch of vectors into


groups or clusters, where vectors in each cluster are 'close' to each other. We
describe a famous clustering algorithm called the k-means algorithm and give some
applications.

4.1 Clustering
Introduction

Suppose we have n vectors (Note: here n is the number of vectors, not their size) .
our goal is to partition them (if possible) into k groups or clusters, with the
vectors in each cluster being 'close' or 'similar' to each other.

Clustering is widely used in many application areas, especially when the vectors
are feature vectors.
Normally we have k much smaller than N, i.e small number of groups compared to the
number of vectors.

In typical applications, k varies from small n, to a few hundred.


Part of the problem is determining the optimal value of k.

Example given of 300 2-vectors divided into 3 clusters and graphically displayed.
this example is not typical though.
problems: (1) clustering 2-vectors is easy. We can just scatter plot them and
examine visually. When k is larger than 2 (and n is usually much larger than 2)
this method cannot be used.
(2) this specific example has 3 clear clusters with no borderline
cases, which are difficult to classify. In typical applications, many points lie
'between' clusters and are difficult to classify.
(3) In this example it is very clear (visually) that the value of k =
number of clusters is 3. In many domains/datasets the value of k is not clear.

Examples
Before we delve deeply, some examples
- Topic Discovery
the vectors x1 thru xn are word histograms of n documents. A clustering
algorithm classifies documents into k groups, which typically can be interpreted as
documents belonging to the same genre, author, or topic. Since the clustering
algorithm has no concept of 'topic' built in (and just does 'blind' clustering)
this is sometimes called topic discovery.

- Patient Clustering
if vectors x1 thru x-n are feature vectors associated with N patients in a
hospital, then a clustering algorithm can partition patients into groups in terms
of the features encoded in the feature vector.

- Customer Market Segmentation


if vectors x1 thru x-n represent the quantities of N distinct items that a
specific customer bought (or not) over a specific period of time, then the
clustering algorithm segments customers into groups that have similar buying
patterns.

- Zipcode clustering
vectors x1 through xn represent some kind of data about zipcode i, such as
number of residents in various age groups, household size, education statistics,
income statistics etc. We can then cluster the zipcodes into groups with similar
characteristics.

- Student clustering
each vector is the detailed grading record of a specific student in a
course. a clustering algorithm can partition the students into those who perform at
specific levels.

- Survey Response Clustering


each vector is a single person's response to a survey with n questions.
Clusters are people who responded 'similarly'.

- Weather Zones
each vector represents a country's temperature and rainfall, with (say) the
first twelve entries giving the average monthly temperatures for a year, and the
second twelve entries giving the average rainfall. (these can be standardized, so
each entry varies between 1 and -1). Thus the vector summarizes the weather pattern
within a country i. The clustering algorithm can then partition all the countries
into clusters having similar weather patterns, 'weather zones' and we can plot
these on a map.

- Daily Energy Use Patterns


Each vector has 24 entries, recording the average per hour energy
consumption of an individual i over (say) a month. A clustering algorithm
partitions these individuals into those with similar energy use patterns. We can
expect the algorithm to 'discover' which individual has a heated swimming pool, or
solar panels, or an electric water heater.
- Financial Sectors
Each vector represents a company. Each entry is a financial or business
attribute (total capitalization, financial returns, risks, trading volume, profits
and loss, or dividends paid. these quantities would need scaling to be in similar
ranges). A clustering algorithm would group these companies into 'sectors' with
similar financial attributes.

(_ assuming we have some way to detect good vs bad clustering) it is of great use
to know that the data is well clustered at a specific value of k (say 5 or 37). We
can then understand the data better, assign labels to each of k clusters (e.g "well
capitalized" and so on).

4.2 A Clustering Objective


in this section we
(1) formalize the idea of clustering
(2) introduce a natural measure of the quality of a given cluster.

- Specifying The Cluster Assignments

Vectors are x1 thru x-N (so N vectors total. Note: here N is the count of
vectors *not* dimension of each vector. In the example with 300 2-vectors, n would
be 300). The index that ranges from 1 to n is i.
k clusters. represented by sets G1 thru G-k each of which contains the 'id's
( a number between 1 and N) of the vectors in that cluster. The index used to range
from 1 to k is j.
e.g: consider vectors x_1 thru x_5, divided into three clusters G1 =
{2,3,4} G2 = {5} G3 = {1}

the clustering as a whole can be represented by an N-vector, with its entries


corresponding to which group the i-th vector has been assigned to.
e.g in the above case this would be a 5-vector (3,1,1,1,2). The group assignment
vector is notated c, so in this case
c = (3,1,1,1,2)

c_i is the group to which vector i belongs.

- Group Representatives
(KEY) With each group we associate a *group representative* n-vector which we
notate as z1 thru zk. (_ j is the index on groups, so j is the index for z)
(IMP) these representative vectors do *NOT* need to be a member of the group. They
can be any n-vector (_ so this opens the possibility of selecting 'representatives'
(say a specific type of customer) and actually *create* k clusters we define. Nice)
(KEY) we want the representative vector in a cluster to be close to all the other
vectors in the cluster.
suppose there are m vectors in group g
then z_g is the representative vector. We want the qauntity || x_i - z_g || to
be small for all i from 1 to N

- A Clustering Objective
we now decide a measure to judge a specific clustering.

using the book notation


|| x_i - z_(c_i))|| i.e i points to a specific vector c_i gives the group
the vector belongs to since c is the 'cluster assignment vector' which has a
representative z_(c_i). the distance between the i-th vector and its representative
should be minimal.
Extend to all vectors, so the *total distance* of vectors from their group
representatives should be minimal. Use squares to take care of negative distances
etc (same logic as in least squares) divide by N to get average which gives us

For each vector we calculate its distance-squared from *its* group representative,
then take the mean (this is the mean squared of the distances of each vector from
its representative),

J^clust = ( || x-1 - z_(c_1) ||^2 + ... + || x_n - z(c_n) ||^2) / N

Note that J^clust depends on two variables


- the assignements of vectors to specific clusters (changing this changes
J^clust) as represented by the assignment vector c
- choice of group representatives z_1 through z_n (changing this changes
J^clust independently of above)

(KEY) The smaller J^clust is, the better the clustering.


the extreme case is when J^clust = 0. this happens only when there are k vectors
(_or their multiples?) grouped into k groups, and each vector is equal to the group
representative.

Choosing J^clust as the clustering measure makes sense because it encourages each
vector in a group to be close to its representative. But other measures may be
useful, such as those which encourage a balanced clustering.
In this book we stick with J^clust.

- Optimal And Sub-optimal Clustering

We seek an assignment c and representative vectors z_1 through z_k that minimize
the 'objective' function J^clust. Such a clustering (that minimizes J^clust) is
'optimal'. For all but the smallest problems it is computationally expensive (_
order of computation not given) to calculate the optimal clustering.

The good news. K-means algorithm (next section) needs far less computational power,
and can easily handle billions of vectors, and often finds a good, if not the best
clustering. The clusterings found by k-means are suboptimal == they are not the
best possible clustering (but often good enough)

It also turns out that we can


1. find best clustering if representatives are fixed
2. find best representatives if the clustering is fixed.

- Partitioning The Vectors With Representative Fixed

Suppose that the representatives z1 .. zk are fixed.


we seek c_1 thru c_N (i.e we seek assignments of vectors to clusters) that achieve
the smallest value of J^clust

J^clust = ( || x-1 - z_(c_1) ||^2 + ... + || x_n - z(c_n) ||^2) / N

as seen, J^clust is a sum of n-terms


If a vector x_i is assigned to one group vs another, the only change in the
calculation of J^clust is in the i-th term where the z_(c_i) the representative
vector changes.

the i-th term is || x_i - z_(c_i) ||^2 .

We can choose c_i ( the group to which the vector x_i is assigned, which in turn
changes the representative vector z_(c_i) which is subtracted from x_) so as to
minimize this term. In other words we assign x_i to the nearest neighbor within
the representatives.

In notation, this means that

|| x_i - z_(c_i)|| == min over j from 1 to k || x_i - z_j ||

so the expression for J^clust is now


J^clust = ( min_{j = 1 .. k} || x_1 - z_j||^2 + ... + min_{j = 1 .. k} || x_n -
z_j||^2 ) / N

in words this is the mean of the squared distance of the data vectors to their
nearest representative vector.

- Optimizing the Group Representative With Assignment Fixed

Suppose we fix the assignment of vectors to groups.


Needed: representatives for the groups.

we start with the equation for J^clust

J^clust = ( || x-1 - z_(c_1) ||^2 + ... + || x_n - z(c_n) ||^2) / N

(_ we know the x_i s and the c_i s . we need to find z_ci)

we rephrase this as a sum of k terms (the N terms of the original are partitioned
into k sums, one for each group)

J^clust = J_1 + J_2 + ... + J_k where

J_j = (1/N) * sigma_{i element of G_J} || x_i - z_j ||^2

i.e each term is the mean square of the distance between the vectors in the group
and the representative of that group

So (to achieve the minimal J^clust) we choose the representative vector z_j such
that J_j is minimized.

(key conceptual step). the choice of z_j, the representative of a particular group
j affects only J_j term.
ie we choose z_j to minimize the distance of this vectors to the vectors x_i that
belong to the group.

To do this we choose z_j to be the centroid of the group.

z_j = (1/ # G_J) sigma_{i element of G(j)} x_i, where # represents the cardinality
(of a set) operator.

So if we fix the group assignments we choose each group's representative vector to


be the centroid of the vectors in that group.

4.3 K-Means Algorithm

At first glance it seems as if we can now solve the problem of choosing group
assignments and group representatives to minimize J^clust since we know how to
choose the optimal representatives given assignments and vice versa.

But this is an illusion since the two choices are circular. In the beginning we
know neither.

So we use iteration. We alternate between optimizing the representatives and


optimizing the assignments till the J^clust value stops decreasing.

- Algorithm 4.1

Given vectors x_1 through x_N and an initial list of group representative vectors
z_1 thru z_k,
repeat until convergence
1. Partition the N vectors into k groups. for each vector from x_1 to x_N
assign x_i to the group associated with the nearest representative vector
2. Update representatives. For each group j = 1 to k, set z_j to be the
mean of the vectors in the group.

- Comments And Clarifications

1. Ties in group 1 can be broken by assigning x_i randomly among the groups, maybe
be assigning it to the group with the lower index number j.

2. It is possible in step 1 for one or more groups to be empty, in which case we


drop this group from further consideration.

3. 'until convergence' can be interpreted in many way.


1. If the group assignments found in step 1 are the same in two consecutive
iterations, then the representative calculations in the subsquenet steps will be
the same, and then nothing changes. This is a good point to judge 'converged'
or 2. When the improvement (decline) of successive J^clust becomes very small,
we stop.

4. how to choose initial representatives. Many sophisticated methods but beyond


scope of this book.
two possibilities.
1. pick representatives randomly from the original vectors
2. start with a random assignment of the N vectors to k groups. Take each
group's mean vector as its representative.

- Convergence

J^clust decreases in each step (till convergence). This means that the k-means
algorithm converges in a finite number of steps.
however, depending on the initial choice of representatives, the algorithm can and
does converge to different final values of J^clust, with different partitions and
representatives (so in practice, k-means is run many times and the best of the
results are taken).

k-means is a heuristic, and does not guarantee optimal partitions. Still it is very
useful and widely used.

- Interpreting The Representatives

The representatives z_1 thru z_k associated with a clustering are quite
interpretable.
- Choosing k

basically try various k, choose one with minimal value of J-clust (_and that
matches domain knowledge?)

- Complexity

In step 1 of the k-means algorithm, we find the nearest neighbor (among the list of
the inital k representatives) of each of the N n-vectors.
This represents 3 N k n flops.

In step 2 we average the N vectors (in the k groups). For a cluster with p vectors,
this requires n (p - 1) flops, which we approximate np flops.
Averaging all clusters requires N n flops. This is less than the cost of the 1st
step.

Total cost of the 2 steps for one iteration is (3k + 1) N n flops. It's order is N
n flops.

Each run of k-means takes typically less than a few tens of iterations, and usually
k means is run a modest number of times, on the order of 10.

So a very rough estimate of the number of flops required to run k-mean 10 times is
about 1000 n k N flops.

4.4 Examples
4.4.1 Image Clustering
4.4.2 Document Topic Discovery

4.5 Applications

Clustering
in general and k-means in particular have many uses.
- exploratory data analysis on a collection of vectors
- interpret (final) group representatives /clusters in domain terms.
can also be used for more specific tasks

- Classification
- cluster a large number of n vectors. Then label the representatives by
hand. This helps us classify *new* vectors by choosing nearest rep to get a label
for them .

- Reccomendations
- suppose vector gives the number of times a user has listened to n songs.
These vectors are usually very sparse thanks to extremely large collections of
music, of which a particular user listens only to a very few. Clustering reveals
users with similar musical taste.
The group representative z_j_i is interpreted as the average number of times users
in the group listen to song i.
This allows us to suggest songs to a user. For each user, find which clusterj her
'listening habits vector' falls in. Then find songs that (a) she hasn't listened to
but others in her group have, most often. To recommend five songs we find indices
(song indices) l s.t x_i_l = 0 (the user hasn't listend to this song) with the 5
largest values of z_j_l (the top 5 songs listened most by the group, constrained by
what the user has not listened to)

- Guessing Missing Entries


- suppose we have a collection of vectors with some entries missing or not
given.
suppose these vectors record attributes of people - like age, sex, years
of education, income, number of children etc.

we first use k-means clustering of the data using vectors that are
complete.

now consider a vector with missing entries. we can't find the nearest
neighbor since some entries are blank.
instead, we find the closest group representative to x using only the known
entries

Let K be the set of indices for which the vector has entries.
then find j that minimises sigma i-over-k (x_i - z_j_(i))^2
this gives us the closest group representative using only known attributes.
then we fill in unknown attributes from the nearest group rep.

Exercises

Vous aimerez peut-être aussi