Vous êtes sur la page 1sur 58

ProClust:

Improved clustering of protein


sequences with an extended
graph-based approach

Ying Jin, Jonathan Michael Nowacki


Nov. 21, 2003
What in this presentation
• Papers:
– SCOP: a Structural Classification of Proteins
database
link
– Clustering protein sequences – structure
prediction by transitive homology
link
– Improved clustering of protein sequences with
an extended graph-based approach
link
Part I

SCOP: a Structural Classification


of Proteins database
The main idea
• A database that provides a detailed and
comprehensive description of all known
protein structures
The Novel
• The distinction between evolutionary
relationships and those that arise from the
physics and chemistry of proteins
• The classification of proteins in SCOP has
been constructed by visual inspection and
comparison of structures. Believed better
than purely automatic methods.
The organizational Basics
• By three traits
– Family: near evolutionary relationships
• Based on one of two criteria that imply having common
evolutionary origin: significant sequence similarity, and
functional/structural similarity.
– Super Family: far evolutionary relationships
• Low sequence identity, but whose structures and in many cases,
functional features suggest that a common evolutionary origin is
probable, i.e. variable and constant domains of immunoglobulins.
– Fold: geometrical relationships
• If proteins have the same major secondary structures in the
same arrangement and with the same topological connections.
– Others: classes. Domain, PDB, literature reference
More on folds
All-alpha: essentially all alpha
All-beta: essentially all beta
Alpha/beta: mix of alpha and beta
Alpha + beta: helices and strands are
segregated
Multi-domain: no known homologues
PDB at a Glance
The PDB structure entries, consisting of a collection of files
having nondescript names, cannot be easily grasped in a
biochemically meaningful context. Manually organizing the
structures based on the descriptive information in the files is
becoming less and less practical as the database expands. A
chemically or biologically meaningful context can be provided by
the user in the form of a search keyword (e.g. hemoglobin), but
the range of available contexts cannot be predetermined from the
database itself--users must know, in general, what they are
looking for. Although searching is an extremely useful approach
for locating specific PDB entries, the scope of the database is
best ascertained by browsing a set of predetermined contexts.
Useful contexts include molecular classes (e.g. "cytochrome"),
secondary/tertiary structural classes (e.g. "globin fold") functional
classes (e.g. "binding protein"), species of origin, and
experimental determination method. The descriptive information
in the PDB files is distributed between a set of fields (e.g.
"HEADER").
Other advantages of PDB
• PDB entry viewer links PDB entries to
various graphical view, external databases
and SCOP itself.
• Links to
– images of structure
– Interactive molecular views
– Atomic co-ordinates
– Data on functional conformational changes
– Sequence data
– Homologues
– MEDLINE abstracts
Access Methods
Main url:
http://scop.mrc-lmb.cam.ac.uk/scop/index.html
Numerous mirrors:
Europe
East Coast USA
Japan
Isreal
Taiwan
China
Australia
The Root Down Method:
Example Pic
Chime
Search Engine
3d Search
In Conclusion
• SCOP is an easy way to access data and
images.
• SCOP has a powerful generic purpose
interface to the PDB
• Excellent overview of the diversity of
protein structures which can aid
researchers and students alike.
Part II

Clustering protein sequences – structure


prediction by transitive homology
Main Idea of the Paper
A graph-based clustering approach using
transitivity; handling multi-domain proteins and
cluster comparison algorithms.
- determined all pair-wise similarities for the
sequences in the SwissProt database using the
Smith-Waterman local alignment algorithm
- transformed the data into a directed graph
vertices – protein sequence
directed edges – sequence A to B if
score(A, B)/ score(A, A) > T
- the clustering process using transitivity;
SCOP was used as an evaluation data set
Motivation
• Finding the three-dimensional structure of proteins
is one of the fundamental problems in molecular
biology.
• X-ray diffraction analysis can’t keep up with the
ever-increasing speed at which proteins are
sequenced.
• Desirable method: predict structure from the
sequence data. The main idea:
The sequence similarity => homology
=> similar structure
=> function virtue
(Note: same structure or function does not imply a common
ancestor)
Motivation (cont.)
• The relation of sequence similarity obtained
by pair-wise alignment.
• Rule-of-thumb is that 30% identity over
aligned regions (T)
• A widely accepted approach:
– Score(A, B) > T, implies structural similarity of
sequence A and B
– This is a sufficient, but not a necessary condition
Example:
Histogram of pair-wise alignment scores for all pairs
from the same super-family in the SCOP1 data set
• Detecting those distant homologues,
bringing light into the so-called twilight zone
of low similarity.

? What other criteria can be used to identify


remote homologues
Graph-based Approach
• A graph-based clustering approach using
the transitivity concept.

Transitivity
• In mathematics: if A=B and B=C then A=C
• In biology: for given three sequences A, B and C, if A
and B as well as B and C have a common ancestor,
then A and C have a common ancestor
Use of Transitivity
• The concept of
transitivity can be
used to detect remote
homologues.

However
- It is not fully understood if transitivity
always holds and whether transitivity
can be extended ad infinitum.
- Multi-Domain Problem
Multi-Domain Problem

If use an undirected graph, then solid black


edges provide a path from #1-#4. In the
directed case, the grey edges avoid this
possible problem.
Algorithm (1)
• Computing pair-wise similarities
– A complete undirected graph G
– Given edge between sequence P and Q,
the weight of the edge = raw(P, Q)
raw(P, Q) is the raw Smith-Waterman local
alignment score

As mentioned above, there is the multi-domain


problem with this approach – the unwanted
‘bridges’ connecting clearly unrelated proteins
Algorithm (2)
• Directing the edges
– Aim to solve the multi-domain problem
there has to be a difference in length between sequences if
multi-domain proteins cause a problem.

G Gd

Note: Raw self similarity score raw(P, P) is approximately


proportional to the length of P
Algorithm (3)
• Clustering in a threshold graph
– Remove all the edges from Gd if w(P, Q) < T,
resulting graph Gd(t)
– Using SCCs as clusters
Definition 1 of SCC: In a directed graph G, a
Strongly Connected Component (SCC) is a
maximal set C of nodes of G, such that for every
pair of nodes p and q in C there is one directed
path in G from p and one from q to p.
– Complexity: O(n + e), while n is the number of
nodes and e the number of edges
An example of a SCC in SwissProt
The grey nodes are not
part of the SCC, but
are clearly related.

No edge present
between nodes
P03480 and P03475.
The transitivity applied.

Threshold = 32%
Implementation and Evaluation
• The algorithm implemented in C++
• Own implementation of the Simith-
Waterman local alignment algorithm for
computing sequence similarity.
• The substitution matrix: BLOSUM80
• Gap opening (gop) = 90
Gap extension penalties (gep) = 9
Data (1)
• SwissProt (SP) excluded all sequences
with less than 40 amino acids (a.a.),
resulting in a set of 86494 protein
sequences
– The total running time for the pair-wise Smith-
Waterman alignment was on the order of 14000
cpu-days
• The evaluation data: SCOP database
– Three levels are used: family, super-family and fold
Data (2)
• SCOP1: set of 2692 sequences
– Contains all non-identical sequences from SCOP
– No sequences shorter than 40 a.a.
– No sequences from classes 8, 9, 10
– 65464 pairs of homologue sequences; i.e. pairs where both
sequences are in the same super-family and 3556622 pairs
where the sequences are in distinct super-families.
• SCOP1 + SP: 85961 sequences
– All sequences are from SCOP1 and SwissProt
• SCOP2: 609 randomly chosen sequences from
SCOP
– Including sequences shorter than 40 a.a.
– no sequences from classes 8, 9, 10.
Performance measure
• Sensitivity: specifies the
proportion of identified
homologue pairs

• Specificity: the proportion of


errors among the pairs
predicted to be homologues

Sens = spec = 1 means the most highly desired


performance
Discussion (1)
Threshold = 32

sens = 55.6%

spec = 100%

TP due to intermediate
linking = 8%

noise floor lifting off at


threshold 23

Sensitivity, specificity and the percentage of


indirectly linked true positives versus clustering
threshold for the SCOP1 data set
Discussion (2)
Threshold = 32
sens = 57.9%
spec = 99.8%
Indirect TP = 11.6%

absolute increase in
sens = 2.3
relative increase in
sens = 4.1%
absolute increase in
indirect TP = 3.6%

The noise floor is


higher

Sensitivity, specificity and the percentage of


indirectly linked true positives versus clustering
threshold for the SCOP1 + SP data set
Discussion (3)

SCOP1

SCOP1 + SP

# of SCOP
super-families
Total number of SCC clusters, and SCC clusters of sizes
1, 2-5, 6-10 etc. for varying thresholds from 25 to 50%.
Discussion (4)
Comparison with
algorithm by Arvestad:
employs only pair-wise
sequence comparisons,
their approach uses a
more involved scoring
method, optimized
substitution matrices,
and gap penalties, to
achieve a substantial
improvement over
straight-forward pair-
wise sequence
comparisons.

24% better sensitivity at


virtually equal
specificity.
Sensitivity versus specificity for the SCOP2 data set
on the fold, super-family and family level
Links
• http://promoter.mi.uni-koeln.de/~proclust/
Part III

Improved clustering of protein


sequences with an extended graph-
based approach
The goal
• To detect structural homology through
sequence similarity, by increasing
sensitivity through transitive homology
heuristics.
Some Alternatives
Altenatve approaches using the concept of transitivity for large
scale analysis of protein sequences

• Iterated BLAST or FASTA search for computing clusters,


which are subsequently merged and processed further. Does
not explicitly deal with multi-domain problems.
• Protomap Graph based approach, uses a combination of
BLAST, FASTA and Smith-Waterman E-Values to create a
hierarchy of clusters. Has problems with multi-domain
proteins which cause cluster splitting.
• All against All BLAST search and ignore all hits below a
specified threshold yielding a (0,1) similarity matrix. Extensive
post processing is required to symmetrize the matrix and to
deal with multi-domain proteins.
• Build clusters of orthologous groups (COG’s) starting with
proteins from seven different species. Tries to compensate for
multi-domain proteins with an iterative merging process.
A new solution
• Extended graph-based approach is
designed to provide clustering as an
aid in finding remote homologues;
the multi-domain problem is directly
addressed, although is not fully
solved. Sensitivity is increased
without a significant loss of
specificity.
Different Symmetries
• Symmetric similary
– Does not distinguish between two proteins being
globally similar and one protein being similar to an
individual domain of a multi-domain protein. Can
lead to incorrect links
• Asymmetric similarity
– Can be employed to distinguish between global and
non-global similarity.
Limiting factors
• Large random similarities can cause super-
clusters, which will connect large parts of
the sequence space. This can be
countered by using more stringent criteria.
• Multi-domain proteins
– Domains are the compact semi-independent
structural units of proteins, which often appear
highly conserved in a number of multi-domain
proteins.
An example run
• The dataset
– SCOP v1.53
• All sequences with less than 40 amino acids were removed
• Filtered for low complexity regions using seg with the parameters
of 12, 1.8, 2.0 –x
• Sequences containing masked amino acids as well as duplicate
sequences were removed.
– SPROT
• Release 39
• Processed analogously to SCOP
The Filtering by Significance
• Extremal value distribution
– The maximum scores of a large number of
alignments between random sequences of equal
length tends to have an extreme value distribution.
Used to estimate maximal scores observable with
the Smith-Waterman alogrithm for random
sequences of given lengths.
• Pruning consists of removing edges (P,Q)
from graph if the significance of the score
w(P,Q) was below the chosen significance
threshold.
The Algorithm
• Compute a complete undirected graph
• Replace each undirected edge with two directed
edges

• Proceed to threshold graph by removing all edges of


weight less than the threshold
• Compute all strongly connected components SCC’s
Post Processing: Merging Clusters
• Clusters with at least 20 sequences were selected
• Multiple alignment was built for each set of
sequences with the ClustalW.
• Profiles were built and calibrated with the HMMER
package using default parameters.
• For each such cluster profile all sequences not
contained in the cluster were scored using the
profile and the E-value was recorded.
• If a profile of one cluster resulted in an E-value
below threshold against another cluster, those
clusters are merged.
Complexity
• Using C++ software on a Comaq ES40
running Tru64 Unix V5.1
– Smith-Waterman computations needed 70 CPU
days.
– Clustering needed 30 seconds.
– Cluster merging using HMM’s needed 21 CPU
days.
Psi-Blast Flexibility
Multi-Domain Problem
Path Length FP. vs. TP.
Abundance of multi-domain proteins
Multi-domain Problem

Present at 13.1% threshold but disappears if threshold


is raised above 15.4% since d1m1da1 vanishes
Larger Multi-Domain Problem

Threshold of 21.3% and no edges between P12715 and P33497


More Laddering of Proteins

– False positives caused by just the right increase in length


of proteins. None of the edges are removed when going
over to threshold graph.
Extended Graph vs. PSI-BLAST
Conlcusion
• Sensitivity 63.5% @ 99.0% specificity
• Improvement of 34% upon PSI-Blast’s
performance of 47.5% sensitivity and
99.0% specificity.
• Performance is gained at the expense of a
much larger computational effort.
• Performance can be further improved by
taking length and position of conserved
regions into account.

Vous aimerez peut-être aussi