Project 4

Clustering
Whats clustering?
Clustering is a process of partitioning a set of data (or

objects) in a set of meaningful sub-classes, called clusters.
Whats the relation of the Data?
Similarity (Distance) Matrix
Paralogs, family of proteins from the same species
the E-value of every pair of the protein sequences from the

BLASTp
How to do the job?
Graph-based clustering algorithms

Partitioning Algorithms
Graph-Based Clustering algorithms
Whats the basic idea?
Transfer the Similarity matrix to Graph by threshold
If the similarity of a protein pair is greater than the threshold, there is

a high probability that they belongs to the same paralog
By enumerating all the subgraphs to get clusters
Whats subgraphs we use?
Clique Method(Complete link)
a clique is a completely connected subgraph; each maximal clique in

the graph becomes a cluster
all items in a cluster must be within the similarity threshold
clusters many overlap
generally produces small but very tight clusters
Single Link method
A maximal connected subgraph becomes a cluster

any item in a cluster must be within the similarity threshold of at least
one other item in that cluster
produces larger but weaker clusters
Examples of clique and single-link methods
Partitioning Algorithms
Whats the basic idea
perform Multidimensional Scaling(MDS) to embed protein

sequences into higher dimensional vector space
Using partitioning Algorithms(classic machine learning/ data
mining algorithm) to generate clusters
Construct a partition of a database D of n objects into a set of

k clusters
We use k-means to do the job!
Given a k which mean how many clusters you expected

Start with an initial assignment of items to clusters and then
move items from clusters to obtain an improved partitioning
Each cluster is represented by the center of the cluster
One small problem of 14 proteins

.86
.42 .50
.42 .44 .81
.18 .22 .47 .54
.06 .09 .17 .25 .61
.07 .07 .10 .10 .31 .62
.04 .07 .08 .09 .26 .45 .73
.02 .02 .02 .02 .07 .14 .22 .33
.07 .04 .01 .01 .02 .08 .14 .19 .58
.09 .07 .02 .00 .02 .02 .05 .04 .37 .74
.12 .11 .01 .01 .01 .02 .02 .03 .27 .50 .76
.13 .13 .05 .02 .02 .02 .02 .02 .20 .41 .62 .85
2-dimensional data points

1
0.5
0
-1.5
-1
-1
-1.5
.16 .14 .03 .04 .00 .01 .00 .02 .23 .28 .55 .68 .76
k=7 {0,1} {2,3} {4,5}, {6,7}, {8,9}, {10,11 }, {12, 13} k=6Clique (.10)
{0,1} {2,3} {4,5}, {6,7}, {8,9}, {10,11,12, 13}
k=5 { 0,1 } {2, 3} {4, 5}, {6,7}, {8, 9, 10, 11, 12, 13}
K=2{ 0, 1, 2, 3, 4, 5, 6, 7} , {8, 9, 10 , 11, 12, 13 }
-0.5 -0.5 0
8 9 10 11 12 13
8976
5678
02143
0 1 11 12 13
23456
4576
0.5
1.5
Discussion of the clustering algorithms
Comparison
Number of Clusters
Performance
Preciseness? Will the biologist satisfied with the results?

Essential problems
domain/problem dependent, could we test the results by some

mathmatical model?
Simlarity could not precisely express homology
Algorithm complexity (time complexity of CLIQUE is NP)
Some Ideas
Probe some probability models into similarity matrix

More kinds of subgraphs and empirical algorithms
Parallel computing
Results
CLIQUE returned 68 clusters of > 10

members
SINGLE-LINK returned 10 clusters > 10

members
Results
Sample output form CLIQUE

CLUSTER #55
members: 11
1:
U77095
2:
S57131
3:
Z14089
4:
AF263826
5:
M73836
6:
X71423
7:
AV611045
8:
X82440
9:
X54980
10:
U96722
11:
X16086
Bos taurus guanylate cyclase precursor (GC-E) mRNA, complete

155 kda myosin light chain kinase homolog [cattle, stomach,
B.taurus mRNA for extracellular signal-regulated kinase (ERK
Bos taurus breed Hereford C-KIT protein mRNA, partial cds
Bos taurus rhodopsin kinase, complete cds
B.taurus Tie 1 mRNA
AV611045 Bos taurus cDNA, 5' end
B.taurus mRNA for tau protein kinase II
B.taurus mRNA for insulin-like growth factor-1 receptor, bet
Bos taurus Cdc42-associated tyrosine kinase ACK-2 mRNA, comp
Bovine mRNA for cGMP-dependent protein kinase (isoform I alp
Results
Sample output form SINGLE-LINK

CLUSTER #1
members: 268
1:
X54962
2:
L26547
3:
X02870
4:
X16086
5:
X62882
6:
X75911
7:
X97645
8:
Y09205
9:
AJ242522
10:
U97485
11:
U02292
B.taurus pCK1 mRNA for casein kinase II alpha subunit

Bos taurus cyclin-dependent kinase 1 (cdk
Bovine gene for cytokeratin VIb
Bovine mRNA for cGMP-dependent protein kinase (isoform I alp
B.taurus mRNA for LECAM-1
B.taurus mRNA for lung surfactant protein D
B.taurus MHC class I gene, exon 2 (and joined CDS)
B.taurus MHC class 1 protein molecule D18.1
Bos taurus partial stat5A gene, exons 1-4 and joined CDS
Bos taurus transforming growth factor-beta receptor type I (
Bos taurus link protein mRNA, complete cds
Future Work
K-Means clustering.
Human Unigene.
Overlapping clustering.
Overlapp.
Overlapping clustering:
a. iterative scan algorithm
b. rank removal algorithm
Thank You!

Project 4

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Project 4

Transféré par

Droits d'auteur :

Formats disponibles

Clustering

Clustering is a process of partitioning a set of data (or

Whats the relation of the Data?

Similarity (Distance) Matrix

Paralogs, family of proteins from the same species

the E-value of every pair of the protein sequences from the

How to do the job?

Graph-based clustering algorithms

Graph-Based Clustering algorithms

Whats the basic idea?

Transfer the Similarity matrix to Graph by threshold

If the similarity of a protein pair is greater than the threshold, there is

By enumerating all the subgraphs to get clusters

Whats subgraphs we use?

Clique Method(Complete link)

a clique is a completely connected subgraph; each maximal clique in

clusters many overlap

generally produces small but very tight clusters

Single Link method

A maximal connected subgraph becomes a cluster

Examples of clique and single-link methods

Whats the basic idea

perform Multidimensional Scaling(MDS) to embed protein

Construct a partition of a database D of n objects into a set of

We use k-means to do the job!

Given a k which mean how many clusters you expected

One small problem of 14 proteins

2-dimensional data points

Discussion of the clustering algorithms

Preciseness? Will the biologist satisfied with the results?

domain/problem dependent, could we test the results by some

Probe some probability models into similarity matrix

CLIQUE returned 68 clusters of > 10

SINGLE-LINK returned 10 clusters > 10

Sample output form CLIQUE

Bos taurus guanylate cyclase precursor (GC-E) mRNA, complete

Sample output form SINGLE-LINK

B.taurus pCK1 mRNA for casein kinase II alpha subunit

Vous aimerez peut-être aussi