Vous êtes sur la page 1sur 16

Clustering: hierarchical and k-means

Clustering analysis
Need to define; measure of similarity algorithm for using the measure of similarity to discover natural groups in the data The number of ways to divide n items into k clusters: kn/k! Example: 10500/10! = 2.756 10493
T.R.Hvidsten and J. Komorowski

Measure of similarity
What is similar? Euclidean distance

E2

E1
T.R.Hvidsten and J. Komorowski

Hierarchical clustering
INPUT: n genes/experiments Consider each gene/experiment as an individual cluster and initiate an n n distance matrix d Repeat
identify the two most similar clusters in d (i.e. smallest number in d) merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)

OUTPUT: A tree of merged genes/experiments (called a dendrogram)


T.R.Hvidsten and J. Komorowski

Hierarchical clustering
Intercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage

T.R.Hvidsten and J. Komorowski

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g. dEN = 2 dEDu = 7 dSpI = 1 Intercluster strategy: SINGLE LINKAGE
T.R.Hvidsten and J. Komorowski

Iteration 1
E N Da Du G Fr Sp I P H Fi E 0 2 2 7 6 6 6 6 7 9 9 N Da Du G Fr Sp I 0 1 5 4 6 6 6 7 8 9 P H Fi

8 7 6 5 4

0 6 0 5 5 0 6 9 7 0 5 9 7 2 5 9 7 1 6 10 8 5 8 8 9 10 9 9 9 9

0 1 3 10 9

3 0 2 4 0 1 10 10 0 9 9 8 0

Fr

T.R.Hvidsten and J. Komorowski

Iteration 2
I Fr E N Da Du G Sp P H Fi I Fr 0 6 6 5 9 7 1 4 10 9 E N Da Du G Sp P H Fi 0 2 2 7 6 6 7 9 9

8 7 6

0 1 5 4 6 7 8 9

0 6 0 5 5 0 5 9 7 0 6 10 8 3 0 8 8 9 10 10 0 9 9 9 9 9 8

5 4 3 2
0

1 I Fr Da N

T.R.Hvidsten and J. Komorowski

Iteration 3
Da N I Fr 0 Da N 5 0 I Fr 2 6 E 5 9 Du 4 7 G 5 1 Sp 6 4 P 8 10 H 9 9 Fi E Du G Sp P H Fi

8 7 6 5

0 7 6 6 7 9 9

0 5 9 10 8 9

0 7 0 8 3 0 9 10 10 0 9 9 9 8

4 3 2
0

1 I Fr Sp Da N

T.R.Hvidsten and J. Komorowski

Iteration 4
Sp I Fr Da N E Du G P H Fi Sp I 0 5 6 9 7 3 10 9 Da 0 2 5 4 6 8 9 E Du G P H Fi

8 7 6 5 4 3 2 1 I Fr Sp Da N E

0 7 0 6 5 0 7 10 8 0 9 8 9 10 0 9 9 9 9 8

T.R.Hvidsten and J. Komorowski

Iteration 5
E Da Sp I N Fr Du G E Da N Sp I Fr Du G P H Fi 0 5 5 4 6 8 9 0 9 7 3 10 9

8
P H Fi

7 6 5 4

0 5 10 8 9

3
0 8 0 9 10 0 9 9 8

2 1
0

Fr Sp P Da N E

T.R.Hvidsten and J. Komorowski

Iteration 6
P Sp E Da I Fr N Du G P Sp 0 I Fr E Da 5 N 9 Du 7 G 10 H 9 Fi

8
H Fi

7 6 5 4

0 5 4 8 9

0 5 8 9

3
0 9 9

2
0 8

1
0

Fr Sp P Da N E

T.R.Hvidsten and J. Komorowski

Iteration 7
GE Da P Sp N I Fr Du H GE Da N P Sp I Fr Du H Fi

8 7
Fi

6 5 4 3

0 5 5 8 9 0 9 10 9

0 8 9

2
0 8

1
0

Fr Sp P Da N E

G Du

T.R.Hvidsten and J. Komorowski

Iteration 8
Du GE Da P Sp N I Fr H Du GE Da N P Sp I Fr H Fi

8 7
Fi

6 5 4

0 5 8 9 0 10 9

3 2
0 8

1
0

Fr Sp P Da N E

G Du

T.R.Hvidsten and J. Komorowski

Iteration 9
8
P Sp I Fr Du G E Da N H P Sp I Fr Du G E Da N H Fi

7 6
Fi

5 4 3

0 8 9

0 8

2
0

1 I Fr Sp P Da N E G Du H

T.R.Hvidsten and J. Komorowski

Iteration 10
P Sp I Fr Fi Du G E Da N H Fi H P Sp I Fr Du G E Da N 0
8 7 6 5 4 3 2 1

Fr Sp P Da N E

G Du

H Fi

T.R.Hvidsten and J. Komorowski

Any data mining result needs to be consistent BOTH with the data and current knowledge!

T.R.Hvidsten and J. Komorowski

Evaluation of clusters
Clusters may be evaluated according to how well they describe current knowledge
Roman Slavic Germanic
Ugro-Finnish
T.R.Hvidsten and J. Komorowski

8 7 6 5 4 3 2 1 I Fr Sp P Da N E G Du H Fi

Hierarchical clustering: properties


Huge memory requirements: stores the n n matrix Running time: O(n3) Deterministic: produces the same clustering each time Nice visualization: dendrogram Number of clusters can be selected using the dendrogram
T.R.Hvidsten and J. Komorowski

K-means clustering
Split the data into k random clusters Repeat
calculate the centroid of each cluster (re-)assign each gene/experiment to the closest centroid stop if no new assignments are made

T.R.Hvidsten and J. Komorowski

10

Example of K-means: two dimensions


Initial clusters K=2

T.R.Hvidsten and J. Komorowski

Iteration 1
Calculate centroids

T.R.Hvidsten and J. Komorowski

11

Iteration 1
(Re-)assign

T.R.Hvidsten and J. Komorowski

Iteration 2
Calculate centroids
x

T.R.Hvidsten and J. Komorowski

12

Iteration 2
(Re-)assign

T.R.Hvidsten and J. Komorowski

Iteration 3
Calculate centroid
x

T.R.Hvidsten and J. Komorowski

13

Iteration 3
(Re-)assign No new assignments! STOP

T.R.Hvidsten and J. Komorowski

K-means: properties
Low memory usage Running time: O(n) Improves iteratively: not trapped in previous mistakes Non-deterministic: will in general produce different clusters with different initializations Number of clusters must be decided in advance
T.R.Hvidsten and J. Komorowski

14

Hierarchical vs. k-means


Hierarchical clustering:
computationally expensive -> relatively small data sets nice visualization, no. of clusters can be selected deterministic cannot correct early mistakes computationally efficient -> large data sets predefined no. of clusters non-deterministic -> should be run several times iterative improvement

K-means:

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

T.R.Hvidsten and J. Komorowski

Example 1
96 normal and malignant lymphocyte samples Almost 20 000 cDNA clones Two sub-clusters of DLBCL were shown to include patients with significantly different expected survival time!

Alizadeh et al., Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling, Nature, 403:503-511, T.R.Hvidsten and J. Komorowski 2000.

15

Example 2
Expression clusters

Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999.

The mRNA level of 8613 human genes were measured in fibroblasts at 12 time points from 0 minutes to 24 hours. 517 genes whose expression changed substantially in response to serum was selected.

Functional clusters

T.R.Hvidsten and J. Komorowski

Example 3
Transcriptional profiling of the cell cycle in human fibroblasts using 6,800 genes every other hour from 0 to 24 hours. Biological process with a significantly higher representation in certain clusters than what would be expected by chance
Cho et al., Transcriptional regulation and function during the human cell cycle, Nature Genetics, 27: 48-54, 2001.

T.R.Hvidsten and J. Komorowski

16

Vous aimerez peut-être aussi