Académique Documents
Professionnel Documents
Culture Documents
Clustering analysis
Need to define; measure of similarity algorithm for using the measure of similarity to discover natural groups in the data The number of ways to divide n items into k clusters: kn/k! Example: 10500/10! = 2.756 10493
T.R.Hvidsten and J. Komorowski
Measure of similarity
What is similar? Euclidean distance
E2
E1
T.R.Hvidsten and J. Komorowski
Hierarchical clustering
INPUT: n genes/experiments Consider each gene/experiment as an individual cluster and initiate an n n distance matrix d Repeat
identify the two most similar clusters in d (i.e. smallest number in d) merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)
Hierarchical clustering
Intercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage
Distance: Frequency of numbers with different first letter e.g. dEN = 2 dEDu = 7 dSpI = 1 Intercluster strategy: SINGLE LINKAGE
T.R.Hvidsten and J. Komorowski
Iteration 1
E N Da Du G Fr Sp I P H Fi E 0 2 2 7 6 6 6 6 7 9 9 N Da Du G Fr Sp I 0 1 5 4 6 6 6 7 8 9 P H Fi
8 7 6 5 4
0 6 0 5 5 0 6 9 7 0 5 9 7 2 5 9 7 1 6 10 8 5 8 8 9 10 9 9 9 9
0 1 3 10 9
3 0 2 4 0 1 10 10 0 9 9 8 0
Fr
Iteration 2
I Fr E N Da Du G Sp P H Fi I Fr 0 6 6 5 9 7 1 4 10 9 E N Da Du G Sp P H Fi 0 2 2 7 6 6 7 9 9
8 7 6
0 1 5 4 6 7 8 9
0 6 0 5 5 0 5 9 7 0 6 10 8 3 0 8 8 9 10 10 0 9 9 9 9 9 8
5 4 3 2
0
1 I Fr Da N
Iteration 3
Da N I Fr 0 Da N 5 0 I Fr 2 6 E 5 9 Du 4 7 G 5 1 Sp 6 4 P 8 10 H 9 9 Fi E Du G Sp P H Fi
8 7 6 5
0 7 6 6 7 9 9
0 5 9 10 8 9
0 7 0 8 3 0 9 10 10 0 9 9 9 8
4 3 2
0
1 I Fr Sp Da N
Iteration 4
Sp I Fr Da N E Du G P H Fi Sp I 0 5 6 9 7 3 10 9 Da 0 2 5 4 6 8 9 E Du G P H Fi
8 7 6 5 4 3 2 1 I Fr Sp Da N E
0 7 0 6 5 0 7 10 8 0 9 8 9 10 0 9 9 9 9 8
Iteration 5
E Da Sp I N Fr Du G E Da N Sp I Fr Du G P H Fi 0 5 5 4 6 8 9 0 9 7 3 10 9
8
P H Fi
7 6 5 4
0 5 10 8 9
3
0 8 0 9 10 0 9 9 8
2 1
0
Fr Sp P Da N E
Iteration 6
P Sp E Da I Fr N Du G P Sp 0 I Fr E Da 5 N 9 Du 7 G 10 H 9 Fi
8
H Fi
7 6 5 4
0 5 4 8 9
0 5 8 9
3
0 9 9
2
0 8
1
0
Fr Sp P Da N E
Iteration 7
GE Da P Sp N I Fr Du H GE Da N P Sp I Fr Du H Fi
8 7
Fi
6 5 4 3
0 5 5 8 9 0 9 10 9
0 8 9
2
0 8
1
0
Fr Sp P Da N E
G Du
Iteration 8
Du GE Da P Sp N I Fr H Du GE Da N P Sp I Fr H Fi
8 7
Fi
6 5 4
0 5 8 9 0 10 9
3 2
0 8
1
0
Fr Sp P Da N E
G Du
Iteration 9
8
P Sp I Fr Du G E Da N H P Sp I Fr Du G E Da N H Fi
7 6
Fi
5 4 3
0 8 9
0 8
2
0
1 I Fr Sp P Da N E G Du H
Iteration 10
P Sp I Fr Fi Du G E Da N H Fi H P Sp I Fr Du G E Da N 0
8 7 6 5 4 3 2 1
Fr Sp P Da N E
G Du
H Fi
Any data mining result needs to be consistent BOTH with the data and current knowledge!
Evaluation of clusters
Clusters may be evaluated according to how well they describe current knowledge
Roman Slavic Germanic
Ugro-Finnish
T.R.Hvidsten and J. Komorowski
8 7 6 5 4 3 2 1 I Fr Sp P Da N E G Du H Fi
K-means clustering
Split the data into k random clusters Repeat
calculate the centroid of each cluster (re-)assign each gene/experiment to the closest centroid stop if no new assignments are made
10
Iteration 1
Calculate centroids
11
Iteration 1
(Re-)assign
Iteration 2
Calculate centroids
x
12
Iteration 2
(Re-)assign
Iteration 3
Calculate centroid
x
13
Iteration 3
(Re-)assign No new assignments! STOP
K-means: properties
Low memory usage Running time: O(n) Improves iteratively: not trapped in previous mistakes Non-deterministic: will in general produce different clusters with different initializations Number of clusters must be decided in advance
T.R.Hvidsten and J. Komorowski
14
K-means:
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!
Example 1
96 normal and malignant lymphocyte samples Almost 20 000 cDNA clones Two sub-clusters of DLBCL were shown to include patients with significantly different expected survival time!
Alizadeh et al., Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling, Nature, 403:503-511, T.R.Hvidsten and J. Komorowski 2000.
15
Example 2
Expression clusters
Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999.
The mRNA level of 8613 human genes were measured in fibroblasts at 12 time points from 0 minutes to 24 hours. 517 genes whose expression changed substantially in response to serum was selected.
Functional clusters
Example 3
Transcriptional profiling of the cell cycle in human fibroblasts using 6,800 genes every other hour from 0 to 24 hours. Biological process with a significantly higher representation in certain clusters than what would be expected by chance
Cho et al., Transcriptional regulation and function during the human cell cycle, Nature Genetics, 27: 48-54, 2001.
16