Clustering

Clustering: hierarchical and k-means
Clustering analysis
Need to define; measure of similarity algorithm for using the measure of similarity to discover natural groups in the data The number of ways to divide n items into k clusters: kn/k! Example: 10500/10! = 2.756 10493
T.R.Hvidsten and J. Komorowski
Measure of similarity
What is similar? Euclidean distance
E2
E1
Hierarchical clustering
INPUT: n genes/experiments Consider each gene/experiment as an individual cluster and initiate an n n distance matrix d Repeat
identify the two most similar clusters in d (i.e. smallest number in d) merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)
OUTPUT: A tree of merged genes/experiments (called a dendrogram)

Hierarchical clustering
Intercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage
Example of hierarchical clustering: languages of Europe
Distance: Frequency of numbers with different first letter e.g. dEN = 2 dEDu = 7 dSpI = 1 Intercluster strategy: SINGLE LINKAGE
Iteration 1
E N Da Du G Fr Sp I P H Fi E 0 2 2 7 6 6 6 6 7 9 9 N Da Du G Fr Sp I 0 1 5 4 6 6 6 7 8 9 P H Fi
8 7 6 5 4
0 6 0 5 5 0 6 9 7 0 5 9 7 2 5 9 7 1 6 10 8 5 8 8 9 10 9 9 9 9
0 1 3 10 9
3 0 2 4 0 1 10 10 0 9 9 8 0
Fr
Iteration 2
I Fr E N Da Du G Sp P H Fi I Fr 0 6 6 5 9 7 1 4 10 9 E N Da Du G Sp P H Fi 0 2 2 7 6 6 7 9 9
8 7 6
0 1 5 4 6 7 8 9
0 6 0 5 5 0 5 9 7 0 6 10 8 3 0 8 8 9 10 10 0 9 9 9 9 9 8
5 4 3 2
0
1 I Fr Da N
Iteration 3
Da N I Fr 0 Da N 5 0 I Fr 2 6 E 5 9 Du 4 7 G 5 1 Sp 6 4 P 8 10 H 9 9 Fi E Du G Sp P H Fi
8 7 6 5
0 7 6 6 7 9 9
0 5 9 10 8 9
0 7 0 8 3 0 9 10 10 0 9 9 9 8
4 3 2
0
1 I Fr Sp Da N
Iteration 4
Sp I Fr Da N E Du G P H Fi Sp I 0 5 6 9 7 3 10 9 Da 0 2 5 4 6 8 9 E Du G P H Fi
8 7 6 5 4 3 2 1 I Fr Sp Da N E
0 7 0 6 5 0 7 10 8 0 9 8 9 10 0 9 9 9 9 8
Iteration 5
E Da Sp I N Fr Du G E Da N Sp I Fr Du G P H Fi 0 5 5 4 6 8 9 0 9 7 3 10 9
8
P H Fi
7 6 5 4
0 5 10 8 9
3
0 8 0 9 10 0 9 9 8
2 1
0
Fr Sp P Da N E
Iteration 6
P Sp E Da I Fr N Du G P Sp 0 I Fr E Da 5 N 9 Du 7 G 10 H 9 Fi
8
H Fi
7 6 5 4
0 5 4 8 9
0 5 8 9
3
0 9 9
2
0 8
1
0
Fr Sp P Da N E
Iteration 7
GE Da P Sp N I Fr Du H GE Da N P Sp I Fr Du H Fi
8 7
Fi
6 5 4 3
0 5 5 8 9 0 9 10 9
0 8 9
2
0 8
1
0
Fr Sp P Da N E
G Du
Iteration 8
Du GE Da P Sp N I Fr H Du GE Da N P Sp I Fr H Fi
8 7
Fi
6 5 4
0 5 8 9 0 10 9
3 2
0 8
1
0
Fr Sp P Da N E
G Du
Iteration 9
8
P Sp I Fr Du G E Da N H P Sp I Fr Du G E Da N H Fi
7 6
Fi
5 4 3
0 8 9
0 8
2
0
1 I Fr Sp P Da N E G Du H
Iteration 10
P Sp I Fr Fi Du G E Da N H Fi H P Sp I Fr Du G E Da N 0
8 7 6 5 4 3 2 1
Fr Sp P Da N E
G Du
H Fi
Any data mining result needs to be consistent BOTH with the data and current knowledge!
Evaluation of clusters
Clusters may be evaluated according to how well they describe current knowledge
Roman Slavic Germanic
Ugro-Finnish
8 7 6 5 4 3 2 1 I Fr Sp P Da N E G Du H Fi
Hierarchical clustering: properties

Huge memory requirements: stores the n n matrix Running time: O(n3) Deterministic: produces the same clustering each time Nice visualization: dendrogram Number of clusters can be selected using the dendrogram
K-means clustering
Split the data into k random clusters Repeat
calculate the centroid of each cluster (re-)assign each gene/experiment to the closest centroid stop if no new assignments are made
10
Example of K-means: two dimensions

Initial clusters K=2
Iteration 1
Calculate centroids
11
Iteration 1
(Re-)assign
Iteration 2
Calculate centroids
x
12
Iteration 2
(Re-)assign
Iteration 3
Calculate centroid
x
13
Iteration 3
(Re-)assign No new assignments! STOP
K-means: properties
Low memory usage Running time: O(n) Improves iteratively: not trapped in previous mistakes Non-deterministic: will in general produce different clusters with different initializations Number of clusters must be decided in advance
14
Hierarchical vs. k-means

Hierarchical clustering:
computationally expensive -> relatively small data sets nice visualization, no. of clusters can be selected deterministic cannot correct early mistakes computationally efficient -> large data sets predefined no. of clusters non-deterministic -> should be run several times iterative improvement
K-means:
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!
Example 1
96 normal and malignant lymphocyte samples Almost 20 000 cDNA clones Two sub-clusters of DLBCL were shown to include patients with significantly different expected survival time!
Alizadeh et al., Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling, Nature, 403:503-511, T.R.Hvidsten and J. Komorowski 2000.
15
Example 2
Expression clusters
Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999.
The mRNA level of 8613 human genes were measured in fibroblasts at 12 time points from 0 minutes to 24 hours. 517 genes whose expression changed substantially in response to serum was selected.
Functional clusters
Example 3
Transcriptional profiling of the cell cycle in human fibroblasts using 6,800 genes every other hour from 0 to 24 hours. Biological process with a significantly higher representation in certain clusters than what would be expected by chance
Cho et al., Transcriptional regulation and function during the human cell cycle, Nature Genetics, 27: 48-54, 2001.
16

Clustering

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clustering

Transféré par

Droits d'auteur :

Formats disponibles

Clustering: hierarchical and k-means

OUTPUT: A tree of merged genes/experiments (called a dendrogram)

T.R.Hvidsten and J. Komorowski

Example of hierarchical clustering: languages of Europe

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

Hierarchical clustering: properties

T.R.Hvidsten and J. Komorowski

Example of K-means: two dimensions

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

Hierarchical vs. k-means

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

T.R.Hvidsten and J. Komorowski

Vous aimerez peut-être aussi