Vous êtes sur la page 1sur 12

Clustering

Whats clustering?

Clustering is a process of partitioning a set of data (or


objects) in a set of meaningful sub-classes, called clusters.

Whats the relation of the Data?

Similarity (Distance) Matrix

Paralogs, family of proteins from the same species

the E-value of every pair of the protein sequences from the


BLASTp

How to do the job?

Graph-based clustering algorithms


Partitioning Algorithms

Graph-Based Clustering algorithms

Whats the basic idea?

Transfer the Similarity matrix to Graph by threshold

If the similarity of a protein pair is greater than the threshold, there is


a high probability that they belongs to the same paralog

By enumerating all the subgraphs to get clusters

Whats subgraphs we use?

Clique Method(Complete link)

a clique is a completely connected subgraph; each maximal clique in


the graph becomes a cluster
all items in a cluster must be within the similarity threshold

clusters many overlap

generally produces small but very tight clusters

Single Link method

A maximal connected subgraph becomes a cluster


any item in a cluster must be within the similarity threshold of at least
one other item in that cluster
produces larger but weaker clusters

Examples of clique and single-link methods

Partitioning Algorithms

Whats the basic idea

perform Multidimensional Scaling(MDS) to embed protein


sequences into higher dimensional vector space
Using partitioning Algorithms(classic machine learning/ data
mining algorithm) to generate clusters

Construct a partition of a database D of n objects into a set of


k clusters

We use k-means to do the job!

Given a k which mean how many clusters you expected


Start with an initial assignment of items to clusters and then
move items from clusters to obtain an improved partitioning
Each cluster is represented by the center of the cluster

One small problem of 14 proteins


.86
.42 .50
.42 .44 .81
.18 .22 .47 .54
.06 .09 .17 .25 .61
.07 .07 .10 .10 .31 .62
.04 .07 .08 .09 .26 .45 .73
.02 .02 .02 .02 .07 .14 .22 .33
.07 .04 .01 .01 .02 .08 .14 .19 .58
.09 .07 .02 .00 .02 .02 .05 .04 .37 .74
.12 .11 .01 .01 .01 .02 .02 .03 .27 .50 .76
.13 .13 .05 .02 .02 .02 .02 .02 .20 .41 .62 .85

2-dimensional data points


1
0.5
0
-1.5

-1

-1
-1.5

.16 .14 .03 .04 .00 .01 .00 .02 .23 .28 .55 .68 .76

k=7 {0,1} {2,3} {4,5}, {6,7}, {8,9}, {10,11 }, {12, 13} k=6Clique (.10)
{0,1} {2,3} {4,5}, {6,7}, {8,9}, {10,11,12, 13}
k=5 { 0,1 } {2, 3} {4, 5}, {6,7}, {8, 9, 10, 11, 12, 13}
K=2{ 0, 1, 2, 3, 4, 5, 6, 7} , {8, 9, 10 , 11, 12, 13 }

-0.5 -0.5 0

8 9 10 11 12 13
8976
5678
02143
0 1 11 12 13
23456
4576

0.5

1.5

Discussion of the clustering algorithms

Comparison

Number of Clusters
Performance

Preciseness? Will the biologist satisfied with the results?


Essential problems

domain/problem dependent, could we test the results by some


mathmatical model?
Simlarity could not precisely express homology
Algorithm complexity (time complexity of CLIQUE is NP)

Some Ideas

Probe some probability models into similarity matrix


More kinds of subgraphs and empirical algorithms
Parallel computing

Results

CLIQUE returned 68 clusters of > 10


members

SINGLE-LINK returned 10 clusters > 10


members

Results

Sample output form CLIQUE


CLUSTER #55
members: 11
1:
U77095
2:
S57131
3:
Z14089
4:
AF263826
5:
M73836
6:
X71423
7:
AV611045
8:
X82440
9:
X54980
10:
U96722
11:
X16086

Bos taurus guanylate cyclase precursor (GC-E) mRNA, complete


155 kda myosin light chain kinase homolog [cattle, stomach,
B.taurus mRNA for extracellular signal-regulated kinase (ERK
Bos taurus breed Hereford C-KIT protein mRNA, partial cds
Bos taurus rhodopsin kinase, complete cds
B.taurus Tie 1 mRNA
AV611045 Bos taurus cDNA, 5' end
B.taurus mRNA for tau protein kinase II
B.taurus mRNA for insulin-like growth factor-1 receptor, bet
Bos taurus Cdc42-associated tyrosine kinase ACK-2 mRNA, comp
Bovine mRNA for cGMP-dependent protein kinase (isoform I alp

Results

Sample output form SINGLE-LINK


CLUSTER #1
members: 268
1:
X54962
2:
L26547
3:
X02870
4:
X16086
5:
X62882
6:
X75911
7:
X97645
8:
Y09205
9:
AJ242522
10:
U97485
11:
U02292

B.taurus pCK1 mRNA for casein kinase II alpha subunit


Bos taurus cyclin-dependent kinase 1 (cdk
Bovine gene for cytokeratin VIb
Bovine mRNA for cGMP-dependent protein kinase (isoform I alp
B.taurus mRNA for LECAM-1
B.taurus mRNA for lung surfactant protein D
B.taurus MHC class I gene, exon 2 (and joined CDS)
B.taurus MHC class 1 protein molecule D18.1
Bos taurus partial stat5A gene, exons 1-4 and joined CDS
Bos taurus transforming growth factor-beta receptor type I (
Bos taurus link protein mRNA, complete cds

Future Work

K-Means clustering.

Human Unigene.

Overlapping clustering.

Overlapp.

Overlapping clustering:
a. iterative scan algorithm
b. rank removal algorithm

Thank You!

Vous aimerez peut-être aussi