Lecture Notes On Clustering

Lecture Notes on
Clustering
Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU
14 December 2016
Contents
1 Introduction 2
2 Hard partitional clustering 2

2.1 K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Davies-Bouldin index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Soft partitional clustering 3

3.1 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.2 Isotropic Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.4 Conditions for a local optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.5 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.6 Practical problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.7 Unisotropic Gaussians + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Partition coefficient index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Agglomerative hierarchical clustering 6

4.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 The hierarchical clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Validating hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Applications 8
These lecture notes depend on my lecture notes on vector quantization.
2009, 2011, 2014 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all
figures from other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their
own copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free
copyrights (here usually figures I have the rights to publish but you dont, like my own published figures). Figures I do not
have the rights to publish are grayed out, but the word Figure, Image, or the like in the reference is often linked to a pdf.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.
1
1 Introduction
Data1 are often given as points (or vectors) xn in a Euclidean vector space and often form groups that
are close to each other, so called clusters (D: Cluster). In data analysis one is, of course, interested to
discover such a structure, a process called clustering.
Clustering algorithms can be classified into hard or crisp clustering, where each point is assigned to
exactly one cluster, and soft or fuzzy clustering, where each point can be assigned to several clusters with
certain probabilities that add up to 1. Another distinction can be made between partitional clustering ,
where all clusters are on the same level, and hierarchical clustering , where the clustering is done from fine
to coarse by merging points successively to larger and larger clusters (agglomerative hierarchical clustering),
or from coarse to fine, where the points are successively split into smaller and smaller clusters (divisive
hierarchical clustering). I will discuss clustering algorithms of different types in turn.
2 Hard partitional clustering

2.1 K-means algorithm
A particularly simple method for clustering is K-means, which is identical to the LBG or generalized
Lloyd algorithm we know from vector quantization, just applied to clustered data. The idea is to represent
each cluster k by a center point ck and assign each data point xn to one of the clusters k, which
can be written in terms of index sets Ck . The center points and the assignment are then chosen such that
the mean squared distance between data points and center points
K X
X
E := kxn ck k2 (1)
k=1 nCk
is minimized. This can be interpreted, for instance, in terms of a reconstruction error. Imagine we replace
each data point by its associated center point. This will lead to an error, which could be quantified by (1). The
task is to minimize this error. There is actually a close link to vector quantization (D: Vektorquantisierung)
here.
To achieve the minimization in practice we split the problem into two phases. First we keep the assignment
fixed and optimize the position of the center points; then we keep the center points fixed and optimize the
assignment.
If the assignment is fixed, it is easy to show that the optimal choice of the center positions is given
by
1 X
ck = xn , (2)
Nk
nCk
which is simple the center of gravity of the points assigned to this cluster.
If the center points are fixed, it is obvious that each point should be assigned to the nearest center
position. Thus, a Voronoi tessallation (D: Dirichlet-Zerlegung) is optimal.
The K-means algorithm now consists of applying these two optimizations in turn until conver-
gence. The initial center locations could be chosen randomly from the data points. A drawback of this and
many other clustering algorithms is that the number of clusters is not determined. One has to decide
on a proper K in advance, or one simply runs the algorithm with several different K-values and picks the
best according to some criterion.
Also note that the result of the algorithm is not necessarily a global optimum of the error func-
tion (1). For instance, imagine two distinct clusters of equal size and K = 4. If in such a situation three
center points are initialized to lie in one cluster and only one lies in the other, the algorithm will optimize this
only locally with three center points in one cluster and one in the other, and it will find the better solution
where two center points are in each cluster. It is therefore advisable to run the algorithm several times
with different initial center locations and pick the best result.
1 Important text (but not inline formulas) is set bold face; marks important formulas worth remembering; marks less
important formulas, which I also discuss in the lecture; + marks sections that I typically skip during my lectures.
2
Figure 1: Examples of a converged K-means algorithm, once with 5 (yellow) center points (left), and
two different runs with 10 center points (middle and right). The data points are drawn in black and the
Voronoi tessallation in red. (Created with DemoGNG 1.5 written by Hartmut Loos and Bernd Fritzke, see
http://www.demogng.de/js/demogng.html for a more recent version, with kind transfer of copyrights in
this figure by the authors.)

CC BY-SA 4.0
2.2 Davies-Bouldin index

(This section is based on (Gan et al., 2007, 17.2.2).)
To evaluate the quality of a clustering a plethora of validity indices have been proposed. One of them
is the Davies-Bouldin index or, for short, the DB index. First define cluster dispersion as
s
1 X
k := kxn ck k2 , (3)
Nk
nCk
which can be interpreted as a generalized standard deviation. Then define cluster similarity of
two clusters as
k + l
Skl := . (4)
kck cl k
Thus, two clusters are considered similar if they have large dispersion relative to their distance.
A good clustering should be characterized by clusters being as dissimilar as possible. This should apply in
particular to neighboring clusters, because it is clear that distant clusters are dissimilar in any case. Thus,
an overall validation of the clustering can be done by the DB index
K
1 X
VDB := max Skl . (5)
K l6=k
k=1
The DB index does not systematically depend on K and is therefore suitable to find the best
optimal number of clusters, e.g. by plotting VDB and picking a pronounced minimum.
3 Soft partitional clustering

3.1 Gaussian mixture model
(This section is based on (Bishop, 1995, 2.6), a book I can highly recommend.)
3
3.1.1 Introduction
The K-means algorithm is a very simple method with sharp boundaries between the clusters and no particular
characterization of the shape of individual clusters. In a more refined algorithm, one might want to
model each cluster with a Gaussian, capturing the shape of the clusters. This leads naturally to
a probabilistic interpretation of the data as a superposition of Gaussian probability distributions. For
simplicity we first assume that the Gaussians are isotropic, i.e. spherical.
3.1.2 Isotropic Gaussians
If we assume that the Gaussians are isotropic, the probability density function (pdf) of cluster k can
be written as
kx ck k2

1
p(x|k) := exp , (6)
(2k2 )d/2 2k2
where controls the width of the Gaussian. There is also a prior probability P (k) that a data
point belongs to a particular cluster k. The overall pdf for the data is then given by the total probability
K
X
p(x) = p(x|k)P (k) , (7)
k=1
and the probablity density of the data given the model (7) is simply
Y
p({xn }) = p(xn ) . (8)
n
3.1.3 Maximum likelihood estimation
The problem now is that we do not know the parameters of the model, i.e. the values of the centers ck
and the widths k of the Gaussians and the probabilities P (k) for the clusters. How could we estimate or
optimize them?
The simple idea is to choose the parameters such, that the probability density of the data is
maximized. In other words we want to choose the model such that the data becomes most probable. This
is referred to as maximum likelihood estimation (D: Maximum-Likelihood Schatzung), and p({xn }) as
a function of the model parameters is referred to as a likelihood function.
3.1.4 Conditions for a local optimum
A standard method of optimizing a function analytically is to calculate the gradient and set it to zero. I do
not want to work this out here but only state that at a (local) optimum the following equations hold.
P
P (k|xn )xn
ck = Pn , (9)
P (k|xm )
Pm
1 n P (k|xn )kxn ck k2
k2 = P , (10)
d m P (k|xm )
1 X
P (k) = P (k|xn ) . (11)
N n
where all sums go overP all N data points. These equations are perfectly reasonable, as one can see if one
realizes that P (k|xn )/ m P (k|xm ) can be interpreted as a weighting factor for how much data point xn
contributes to cluster k. The key function in these equations is P (k|xn ) which is according to Bayes theorem
p(x|k)P (k)
P (k|x) = (12)
p(x)
p(x|k)P (k)
= P . (13)
l p(x|l)P (l)
4
3.1.5 EM algorithm
The problem with equations (911) is that the parameters on the left-hand side occur im-
plicitely also on the right-hand side. Thus we cannot use these equations directly to calculate the
parameters. However, one can start with some initial parameter values and then iterate through these
equations to improve the estimate. One can actually show that the likelihood increases with each iteration,
if a change occurs. This iterative scheme is referred to as the expectation-maximization algorithm, or
simply EM algorithm. Notice that this is completely different from a gradient ascent method.
3.1.6 Practical problems
Two problems might occur during optimization. Firstly, one of the Gaussians might focus on just one
data point and become infinitly narrow and infinitely high, leading to a divergence of the likelihood.
Secondly, the method can get stuck in a local optimum and miss the globally optimal solution. In
either case it helps to run the algorithm several times and discard inappropriate solutions.
Another general problem is again that the number of clusters is not determined by the algorithm but
must be chosen in advance. Again, running the algorithm several times with different values of K helps.
3.1.7 Unisotropic Gaussians +
The Gaussian mixture model can be generalized to unisotropic Gaussians, which may be elongated
or compressed in certain directions in space. One can think of a cigar-shaped or a UFO-shaped Gaussian.
In that case one would generalize (6) to

1 1 T 1
p(x|k) := exp (x c k ) k (x c k ) , (14)
(2)d/2 |k |1/2 2
with the covariance matrix k playing the role of the width parameter 2 in (6). Note that k is symmetric
and positive semi-definite.
Equations (9) and (11) would stay the same, only (10) would change. Taken together we get the equations
P
P (k|xn )xn
ck = Pn , (15)
m P (k|xm )
T
P
n P (k|xPn )(xn ck )(xn ck )
k = , (16)
m P (k|xm )
1 X
P (k) = P (k|xn ) . (17)
N n
Otherwise, the EM algorithm would work just the same.
3.2 Partition coefficient index

To evaluate a soft clustering result one could try to generalize the DB index (5), e.g. by introducing a
weighting factor P (k|xn ) into (3) and summing over all data points. Another approach is to use only the
cluster membership information contained in P (k|xn ). It is clear that
P (k|xn ) [0, 1] (since P it is a probability) , (18)

X
P (k|xn ) = 1 (since xn has to belong to some cluster) (19)
k
X
= P (k|xn ) = N, (20)
k,n
X
P (k|xn ) > 0 (since each cluster should should contain at least one point) . (21)
n
5
The partition coefficient index is defined as
1 X
VP C := P (k|xn )2 . (22)
N
k,n
While (20) equals N in any case, irrespectively of how the data points are assigned to the clusters, the
partition coefficient index, due to the square, lies between 1/K, if all points are assigned with equal
probability to all clusters, and 1, if each point is assigned exactly to one cluster. Thus, VP C = 1
would be optimal and indicate clearly separated clusters.
Notice that the spatial information is taken into account only implicitely, which works only for clustering
models such as the Gaussian mixture model, which have soft tails. For K-means, this index would always
be one by construction, regardless of whether the clustering is good or not.
4 Agglomerative hierarchical clustering

(This section is based on (Gan et al., 2007, 7.2).)
4.1 Dendrograms
In agglomerative hierarchical clustering one starts by considering each single data point as a separate
cluster. Then one merges points that are near to each other into clusters, and finally merges
clusters that are near to each other into clusters. In the end all points form one big cluster.
Documenting the hierarchical merging process results in a tree-like structure and represents the cluster
structure of the data on all levels from fine (cluster distance slighty greater 0) to coarse (cluster distance
). It can be visualized with a dendrogram. Many algorithms can be viewed in this scheme and differ
only in the definition of what near to each other means for clusters.
Let d(xn , xm ) be the distance between two points and Ck indicate a cluster of (possibly only one) points xn .
If we define the distance D(Ck , Cl ) between two clusters Ck and Cl as
Ds (Ck , Cl ) := min d(xn , xm ) , (23)

nCk ,mCl
then it depends on the distance between the nearest two points of the two clusters. If we define
Dc (Ck , Cl ) := max d(xn , xm ) , (24)

nCk ,mCl
then the distance depends on the farthest two points of the two clusters, see figure 2. The former distance
measure gives rise to the single-link method the latter to the complete-link method. These names
come from the idea that you introduce links between all the points in the order of their distance. In the
single-link method, two clusters become merged as soon as they are connected by a single link, which then
naturally has length Ds (Ck , Cl ). In the complete-link method, two clusters become merged only if all points
in one cluster have a link to all points in the other cluster. The last link added before the clusters are merged
then naturally has length Dc (Ck , Cl )
Figure 3 illustrates agglomerative hierarchical clustering with the single- and the complete-link method.
Notice that the resulting dendrograms are qualitatively different and that the distances are naturally larger
in the complete-link method.
4.2 The hierarchical clustering algorithm

Producing a dendrogram by agglomerative hierarchical clustering works as follows:
1. Define each data point as a cluster, Ck := {xk }. Represent each one-point cluster as a point on
the abscissa of a graph, the ordinate of which represents cluster distance.
6
C3
Ds Dc
C1 Ds
C2
Dc
Figure 2: Two different measures of cluster distance. Ds (Ck , Cl ) measures the distance between the two
nearest points of the two clusters and Dc (Ck , Cl ) measures the distance between the two farthest points of
the two clusters. Notice that with Ds cluster C2 would first be merged with C1 , while with Dc it would first
be merged with C3 .
CC BY-SA 4.0
singlelink method completelink method

cluster distance
cluster distance
a b c d e a b c d e
a a
c 2 c
1 1
2
b 3 b
4
d d
4
3
e e
Figure 3: An example of the single-link and the complete-link method on a data distribution of 5 data
points a-e. The numbers at the links indicate the order in which the clusters are linked. On the left they are
linked by the minimal smallest distance between points of two clusters; on the right they are linked by the
minimal largest distance between points of two clusters. In this example the dendrograms are qualitatively
different. Also the distances are generally larger in the complete-link method.
CC BY-SA 4.0
7
2. Find the two clusters Ck0 and Cl0 that are closest to each other, i.e.
(k 0 , l0 ) := arg min D(Ck , Cl ) . (25)

k,l
Draw vertical lines in the graph on top of each cluster up to the distance of these two closest
clusters, i.e. up to D(Ck0 , Cl0 ).
3. Merge the two closest clusters into one, i.e. define a new cluster Cq0 := Ck0 unionCl0 and discard Ck0
and Cl0 . Rearrange the clusters on the abscissa such that the two new closest ones become neighbors
(and already connected clusters remain neighbors). Draw a vertical line between the two closest
cluster.
4. Go to step 2 unless there is only one cluster left, then stop.
Depending on how the distance measure D(Ck , Cl ) is defined does this algorithm result in different dendro-
grams and has different intuitive interpretations.
4.3 Validating hierarchical clustering

One way to validate a hierarchical clustering as obtained by the single-link or the complete-link method is to
rerun the clustering with data where noise has been added to each data point. If the clustering tree
remains stable against this perturbation, one can assume that the clustering is robust and meaningful;
if it is not, then one should not trust the clustering result.
5 Applications
In this analysis, the expres-

sion of a large number of
Genome-wide expression patterns genes were tested in a time se-
ries in response to a partic-
ular protocol. The time se-
ries were then clustered and
ordered with the help of a den-
drogram. Five groups of genes
emerged that are known to
be involved in (A) cholesterol
biosynthesis, (B) the cell cy-
cle, (C) the immediate-early
(Eisen et al, 1998, Proc. Natl. Acad. Sci. USA 95:148638, Fig. 1, non-free) response, (D) signaling and
angiogenesis, and (E) wound
Genes cluster that are involved in healing and tissue remodel-
(A) cholesterol biosynthesis, ing. These clusters also con-
tain named genes not involved
(B) the cell cycle,
in these processes and numer-
(C) the immediate-early response, ous uncharacterized genes.

(D) signaling and angiogenesis, and Figure: (Eisen et al., 1998,
(E) wound healing and tissue remodeling. Fig. 1)1 non-free.
8
Words can be clustered based
on a large text corpus by
Semantics of words defining similarity between
words depending on common
context, i.e. if two words cooc-
cur with the same words they
are considered similiar, other-
wise they are not. Clustering
can then reveal semantic sim-
ilarities.

Figure: (Gries and Stefanow-
itsch, 2010, Fig. 3)2 non-
free.
(Gries & Stefanowitsch, 2010, Fig. 3, non-free)

Similarity between words is defined by the common context around them,
e.g. other words.
The habbits of Northern Cali-

fornian shoppers were charac-
Northern Californian Shoppers terized by a number of factors,
then a clustering was applied
and a number of prototypes
identified. Factors with abso-
lute values greater than 0.25
and greater than 0.5 are set
in boldface and highlighted in
green, respectively.

Table: (Mokhtarian and Ory,
2007, Tab. 3)3 unclear.
(Mokhtarian & Ory, 2007, Tab. 3, unclear)
9
References
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK. 3
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A., 95:1486314868. 8
Gan, G., Ma, C., and Wu, J. (2007). Data Clustering - Theory, Algorithms, and Applications. ASA-SIAM
Series on Statistics and Applied Probability. SIAM, Philadelphia, VA, USA. 3, 5, 6, 8
Gries, S. T. and Stefanowitsch, A. (2010). Cluster analysis and the identification of collexeme classes. In
Rice, S. and Newman, J., editors, Empirical and Experimental Methods in Cognitive/Functional Research.
CSLI Publications. 9
Mokhtarian, P. L. and Ory, D. T. (2007). Shopping-related attitudes: A factor and cluster analysis of north-
ern california shoppers. Manuscript downloaded 2016-12-14 from http://www.wctrs-society.com/wp/
wp-content/uploads/abstracts/berkeley/D5/149/shoppingAttitudes.bergenfinalwctrsubmit.
070417.doc. 9
Notes
1 Eisen et al, 1998, Proc. Natl. Acad. Sci. USA 95:148638, Fig. 1, non-free, http://gene-quantification.org/
eisen-et-al-cluster-1998.pdf

2 Gries & Stefanowitsch, 2010, Fig. 3, non-free, http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_
STG-AS_ClusteringCollexemes_EmpExpMeth.pdf
3 Mokhtarian & Ory, 2007, Tab. 3, unclear, http://www.wctrs-society.com/wp/wp-content/uploads/abstracts/
berkeley/D5/149/shoppingAttitudes.bergenfinalwctrsubmit.070417.doc
10

Lecture Notes On Clustering

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lecture Notes On Clustering

Transféré par

Droits d'auteur :

Formats disponibles

Lecture Notes on

2 Hard partitional clustering 2

3 Soft partitional clustering 3

4 Agglomerative hierarchical clustering 6

These lecture notes depend on my lecture notes on vector quantization.

2 Hard partitional clustering

this figure by the authors.)

2.2 Davies-Bouldin index

3 Soft partitional clustering

3.1.2 Isotropic Gaussians

3.1.3 Maximum likelihood estimation

3.1.4 Conditions for a local optimum

3.1.6 Practical problems

3.1.7 Unisotropic Gaussians +

Otherwise, the EM algorithm would work just the same.

3.2 Partition coefficient index

P (k|xn ) [0, 1] (since P it is a probability) , (18)

4 Agglomerative hierarchical clustering

Ds (Ck , Cl ) := min d(xn , xm ) , (23)

Dc (Ck , Cl ) := max d(xn , xm ) , (24)

4.2 The hierarchical clustering algorithm

singlelink method completelink method

(k 0 , l0 ) := arg min D(Ck , Cl ) . (25)

4. Go to step 2 unless there is only one cluster left, then stop.

4.3 Validating hierarchical clustering

In this analysis, the expres-

(Gries & Stefanowitsch, 2010, Fig. 3, non-free)

The habbits of Northern Cali-

(Mokhtarian & Ory, 2007, Tab. 3, unclear)

Vous aimerez peut-être aussi