Académique Documents
Professionnel Documents
Culture Documents
à
6
Also, this discussion is nice in the sense that it ties together a lot of different
bioinformatic concepts into one unified effort. Some of these concepts are
structural; however, many are not.
§ 6
Manual assignment Mixed (manual + automated) ully automated
Uses both structure and
sequence-based info in
assignments
(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.
(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.
(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.
Ñ Class is determined according to the secondary structure composition and packing within the
structure. Three major classes are recognized; mainly-alpha, mainly-beta and alpha-beta. This last class
(alpha-beta) includes both alternating alpha/beta structures and alpha+beta structures, as originally defined
by Levitt and Chothia (1976). A fourth class is also identified which contains protein domains which have low
secondary structure content.
Ñ This describes the overall shape of the domain structure as determined by the orientations of
the secondary structures but ignores the connectivity between the secondary structures. It is currently
assigned manually using a simple description of the secondary structure arrangement e.g. barrel or 3-layer
sandwich. Reference is made to the literature for well-known architectures (e.g the beta-propellor or alpha
four helix bundle).
6
(old family)Ñ Structures are grouped into fold groups at this level depending on both the overall
shape and connectivity of the secondary structures. This is done using the structure comparison algorithm
SSAP (Taylor & Orengo, 1989) and CATHEDRAL (Harrison et al. 2002, 2003). Parameters for clustering
domains into the same fold family have been determined by empirical trials throughout the databank
(Orengo et al. 1992; Orengo et al. 1993; Harrison et al. 2002, 2003). Structures which have a SSAP score of
70 and where at least 60% of the larger protein matches the smaller protein are assigned to the same T
level or fold group.
Some fold groups are very highly populated (Orengo et al. 1994); Orengo & Thornton, 2005) particularly
within the mainlyÑ
architectures and the
architectures.
CautionÑ Due to how secondary structures are interconnected, varying topologies
can still result in the same overall architecture.
Domain 1 of F-lactamase
-- Notice how different the topology is
6 6
§
Ñ This level groups together protein domains which are thought to share a
common ancestor and can therefore be described as homologous. Similarities are identified either by high
sequence identity or structure comparison using SSAP. Structures are clustered into the same homologous
superfamily if they satisfy one of the following criteriaÑ
Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller.
SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller.
SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related
functions, which is informed by the literature and Pfam protein family database.
Significant(?!) similarity from HMM-sequence searches and HMM-HMM comparisons using SAM,
HMMER and PRC.
§ à (this is actually subdivided toÑ S,O,L,I, D)Ñ Domains within each H-level are sub-
clustered into sequence families using multi-linkage clustering at the levels indicated below. Note that D is
just a counter for different PDB files of the same protein.
6 6
6 6
(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.
(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.
(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.
Dynamic programming (DP) does not actually refer to the way in which
particularly charismatic computer programmers write code.
DP methods are a general class of algorithms that are often seen both in
sequence alignment and other computational problems.
DP algorithms find the best solution by first breaking the original problem into
smaller sub-problems and then solving. 6
The pieces of the larger problem have sequential dependency; that is, the fourth
piece can only be solved with the answer to the third piece, the third can only be
solved with the answer to the second, and so on.
M
M
M M
M M
· · · · · · · · · · ·
· · · · · · · · · · ·
· · · · · · · · ·
· · · · · · ·
M · · · ·
·
· · · ·
· · · · ·
· · ·
6 6
(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.
(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.
§
!
" "
§ rà
#
#
#
#
" "
§
!
" "
#
#
#
#
" "
Often, thus, the coefficients in a Position Weighted Matrix are directly computed as
log-likelihood values. The background probability of nucleotide accounts for the
frequency of the nucleotide in the whole sequences used to derive the matrix.
$ $
$ $ $ $
$ $ $
$ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $
°à
$ $ $ % % $ $ $ $
$ $ $ % % $ $ $ $
$ $ $ $ % $ $ $
$
$ $ $ % $ $ $ $ $
Once a Profile has been derived from a set of functionally related sites, the Profile
can be used to scan a query sequence for the presence of potential sites.
Usually you run a window the length of the matrix along the sequence, and sum the
coefficients from the matrix corresponding to each nucleotide in each position on
the window sequence.
You can use any form of the previous matrix to search for occurrences of the motif
in a given sequence, but if you use the log-likelihood matrix, the scores that you will
obtain are
ratios (which can be summed!).
You can use the next sequence or your own sequence, and see how the scores
along each position in the sequence are calculated.
§
$ $ $ % % $ $ $ $
$ $ $ % % $ $ $ $
$ $ $ $ % $ $ $
$
$ $ $ % $ $ $ $ $
· $ · · · · ·
· · $ $ $ · ·
$ · · $ · · · ·
· · · $ · · $ $
&
'( ) $
$ · · % · · · ·
· $ · % · $ $ $
· · · $ · · ·
· · $ · · · ·
&
'( ) % % *+,
§
-./ $
§
-./ $
§
-./ $
§
-./ $
§
-./ $
'
#
#
#
#
" "
'
#
#
#
#
" "
§Ñ
or
But, adding an integer is too large
#
(and too uniform if you¶re not
assuming equal probabilities)«
#
# Therefore, instead of adding 1, add
#
fractional counts that are dependent
on the probability of each residue.
httpÑ//bioinformatics.weizmann.ac.il/blocks/help/about_impala.html
While CATH uses IMPALA now, see igure13.3 in your text, much can be
learned by considering its more common functional cousin, PSI-BLAST.
Xà§6
01
Xà§6
01 Look in
table for
all similar
words that
0
score well &
then search DB
for matches
Xà§6
01 Look in
table for
all similar
words that
0
score well &
then search DB 0
for matches
Repeat
w/ 4 char.
Xà§6
01 Look in
table for
all similar
words that
0
score well &
then search DB 0
for matches
Repeat
w/ 4 char.
Repeat
w/ 5 char.
6 Xà§6
Xৠ- Allows one to heuristically create the local pairwise alignment of any two pre-
determined sequences.
± Either any two protein or any two nucleic acid sequences
Xà§6 - Uses a protein sequence to search for homologous sequences within a protein sequence
database.
Xà§6 - Uses a nucleic acid sequence to search for homologous sequences within a nucleic acid
database.
$%& )$ §§)
))$'*) ·
+ + + ++ + )) + + + *)
§, §)
)
$
§
))')()
*'*) ·
The power of profile methods can be further enhanced through iteration of the
search procedure.
After a profile is run against a database, new similar sequences can be detected.
A new multiple alignment, which includes these sequences, can be constructed, a
new profile abstracted, and a new database search performed.
The procedure can be iterated as often as desired or until convergence, when no
new statistically significant sequences are detected.
Iterated profile search methods have led to biologically important observations
but, for many years, were quite slow and generally did not provide precise means
for evaluating the significance of their results.
This limited their utility for systematic mining of the protein databases. The
principal design goals in developing the
§
Xà§6°§
Xà§6 program were speed, simplicity and automatic operation.
6 §
Xà§6
2. The program constructs a MSA, and then a profile, from any significant local
alignments found. The original query sequence serves as a template for the
MSA and profile.
§ § °§§
The first dp loop does an all-to-all comparison to identify putative residue pairs.
The second dp loop uses the putative pairs calculated in the first to find the ideal superimposition.
SSAP scores are scaled 0-100; homologous proteins regularly score >80.
A score of >70 is required to assign a protein to an existing fold class. However, even though high SSAP scores support
homology, corroborating evidence is required for assignment at the superfamily level (i.e. PSI-BLAST hit).
Domains are importantÑ Proteins are only classified into existing fold groups if the detected structural similarity extends
for more than 60% of the protein.
We won¶t spend too much time on the details of SSAP. Later, we will learn the
gory details of two other common structural comparison algorithms (DALI and
Combinatorial Extension).
§ § °§§
§ § °§§
³Side´
³Top´
°r6
(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.
°§
(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.
Any structure unclassified by the sequence-based methods are divided into their
constituent domains (when appropriate). The domains are then resubmitted to the
sequence and structure comparison protocols discussed previously.
While there are many automatic domain identification algorithms, most result in
significant numbers of incorrect assessments (20-30% incorrect).
This is mainly due to the fact that there is no unique answer to the question,
³What is a domain?´ or example, one could easily envision various domain
classification schemes based on sequence, phylogeny and/or structure.
X
Approximately one-quarter of all domains contained within CATH are
discontiguous.
vs.
°j
While theoretically unsatisfactory, such consensus-based methods are common in bioinformatics. The picture above shows a
consensus method used in the prediction of protein secondary structure.
Three methods are employed, where they agree within a tolerance of ten
residues, domains are assigned completely automatically.
The underlying principle used is that residue-residue contacts are denser within a
domain than between domains.
To deal with networks with non-unique minimum cuts, the algorithm finds all cuts,
which achieve the minimum cross-edge capacity.
A recent analysis of CATH revealed that ~70% of the domains within multidomain
proteins reoccurred in other multidomain proteins and/or occurred as a single
domain protein.
oRATH is used to compare the secondary structure graph for each putative
multidomain protein against CATH.
or example, 17% of the domain assignments in SCOP and CATH disagree.
(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.
(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.
(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.