Vous êtes sur la page 1sur 59

m m

 
 

à  
6      

  
 

‡ While a hierarchical description of protein structure is conceptually


straightforward, as you will see, automating it is not. Moreover, the domain
boundary problem is actually quite difficult.

‡ Also, this discussion is nice in the sense that it ties together a lot of different
bioinformatic concepts into one unified effort. Some of these concepts are
structural; however, many are not.
§  6  
Manual assignment Mixed (manual + automated) ully automated
Uses both structure and
sequence-based info in
assignments

‡ Structural classification of proteins


‡ Class (EF)
‡ old (TIM beta/alpha-barrel)
‡ Superfamily (Triosephosphate isomerase)
‡ amily (Triosephosphate isomerase)

‡ Class, Architecture, Topology, Homologous Superfamily


‡ Class (EF)
‡ Architecture (EF-barrel)
‡ Topology (TIM barrel)
‡ Homologous Superfamily (Aldolase class 1)
‡ Sequence amily (Isomerase)

‡ Dali Domain Dictionary (we won¶t discuss this here)


‡ Classes, 2nd cousins, cousins, siblings, domains
6  6  

(1.) Close relatives are identified via sequence comparisons.

(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.

(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.

(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.

(5.) inally, and structure(s) remaining unclassified are manually assigned to


existing or new architectures within CATH.
6  6    

Ñ Class is determined according to the secondary structure composition and packing within the
structure. Three major classes are recognized; mainly-alpha, mainly-beta and alpha-beta. This last class
(alpha-beta) includes both alternating alpha/beta structures and alpha+beta structures, as originally defined
by Levitt and Chothia (1976). A fourth class is also identified which contains protein domains which have low
secondary structure content.

   Ñ This describes the overall shape of the domain structure as determined by the orientations of
the secondary structures but ignores the connectivity between the secondary structures. It is currently
assigned manually using a simple description of the secondary structure arrangement e.g. barrel or 3-layer
sandwich. Reference is made to the literature for well-known architectures (e.g the beta-propellor or alpha
four helix bundle).

6   (old family)Ñ Structures are grouped into fold groups at this level depending on both the overall
shape and connectivity of the secondary structures. This is done using the structure comparison algorithm
SSAP (Taylor & Orengo, 1989) and CATHEDRAL (Harrison et al. 2002, 2003). Parameters for clustering
domains into the same fold family have been determined by empirical trials throughout the databank
(Orengo et al. 1992; Orengo et al. 1993; Harrison et al. 2002, 2003). Structures which have a SSAP score of
70 and where at least 60% of the larger protein matches the smaller protein are assigned to the same T
level or fold group.

Some fold groups are very highly populated (Orengo et al. 1994); Orengo & Thornton, 2005) particularly
within the mainlyÑ
  
 architectures and the  
  
 architectures.
CautionÑ Due to how secondary structures are interconnected, varying topologies
can still result in the same overall architecture.

lavodoxin (toplogy = Rossman fold)


-- Notice FEF supersecondary structure

Domain 1 of F-lactamase
-- Notice how different the topology is
6  6    

  § Ñ This level groups together protein domains which are thought to share a
common ancestor and can therefore be described as homologous. Similarities are identified either by high
sequence identity or structure comparison using SSAP. Structures are clustered into the same homologous
superfamily if they satisfy one of the following criteriaÑ

‡ Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller.
‡ SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller.
‡ SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related
functions, which is informed by the literature and Pfam protein family database.
‡ Significant(?!) similarity from HMM-sequence searches and HMM-HMM comparisons using SAM,
HMMER and PRC.

§   à   (this is actually subdivided toÑ S,O,L,I, D)Ñ Domains within each H-level are sub-
clustered into sequence families using multi-linkage clustering at the levels indicated below. Note that D is
just a counter for different PDB files of the same protein.
6  6  
6  6  

°       



      

(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.

(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.

(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.

(5.) inally, and structure(s) remaining unclassified are manually assigned to


existing or new architectures within CATH.
  

‡ Dynamic programming (DP) does not actually refer to the way in which
particularly charismatic computer programmers write code.

‡ DP methods are a general class of algorithms that are often seen both in
sequence alignment and other computational problems.

‡ irst described in the 1950s by Richard Bellman of Princeton University as a


general optimization technique.

‡ DP seems to have been introduced to biological sequence comparison by Saul


Needleman and Christian Wunsch, who were apparently unaware of the
similarity between their method and Bellman¶s.
  

‡ DP algorithms solve optimization problems, problems in which there are a large


number of possible solutions, but only one (or a small number of) best
solution(s).

‡ DP algorithms find the best solution by first breaking the original problem into
smaller sub-problems and then solving. 6     
   
  

‡ The pieces of the larger problem have sequential dependency; that is, the fourth
piece can only be solved with the answer to the third piece, the third can only be
solved with the answer to the second, and so on.

‡ DP works by first solving all these sub-problems, storing each intermediate


solution in a table along with a score, and finally choosing the sequence of
solutions that yields the highest score.

‡ The goal of DP is to maximize the total score for the alignment.

‡ In order to do this, the number of high-scoring residue pairs must be maximized


and the number of gaps and low-scoring pairs must be minimized.
olobal sequence alignment uses the ubiquitous Needleman-Wunsch dynamic
programming algorithm.
6 
§        

M
M
 M M
M       M  
· · · · · · · · · · ·
 · · · · · · · · · · ·
 · · ·  ·  · · · · ·
 · · · ·  ·    · ·
M ·   · · ·     
 · 
 · ·  ·   ·
 · ·     ·  · · 
 · ·     ·   
6  6  

(1.) Close relatives are identified via sequence comparisons.

°§      


         
 
 
  
    

(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.

(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.

(5.) inally, and structure(s) remaining unclassified are manually assigned to


existing or new architectures within CATH.
r     

A regular expression represents a generalization about the range of variability that


occurs in corresponding positions across a family of protein sequences.

Meaning, it represents variability by specifying a group of amino acids permitted in


that position.
 
  
  

     
  
 

        

Sequence patterns using regular expressions (such as PROSITE) have  


 
       
   Ñ As more sequences are
added, the probability that there will be even a few constant or even strongly
conserved sites will diminish. 6    
      .
  

§

   
 
        
       
        
       
        
        
        
       
         
!         
" "
  

§ rà  

   
     
 
          # 
 
 
        #      
          #    
         #    

         " "
        
        
         
     §

          
          
!           

 
    
" "
     

   
 
 # 
 
 
#      
 #    
 #    

" "
  

Often, thus, the coefficients in a Position Weighted Matrix are directly computed as
log-likelihood values. The background probability of nucleotide accounts for the
frequency of the nucleotide in the whole sequences used to derive the matrix.

r    

   
 
 $ $
$ $ $ $
$ $ $
$ $ $ $ $ $ $ $ $
 $ $ $ $ $ $ $ $ $
 $ $ $ $ $ $ $ $ $

°à

   
 



   
 
 $ $  $ % % $  $ $ $
$ $ $ % % $ $ $ $
 $ $ $ $ % $ $ $
 $
 $ $ $ % $ $ $ $ $ 
  

r    

Because we¶re assuming equal nucleotide probability,


 $
$    6 
 $
 $
The probabilities of each nucleotide in each position
are trivial to determine (given in red).
°à

Therefore, the log-odds probability of A is calculated


 $ byÑ
$
 $
 $
ln { P1A / PA } = ln {  / } =  !
§        

Once a Profile has been derived from a set of functionally related sites, the Profile
can be used to scan a query sequence for the presence of potential sites.

Usually you run a window the length of the matrix along the sequence, and sum the
coefficients from the matrix corresponding to each nucleotide in each position on
the window sequence.

You can use any form of the previous matrix to search for occurrences of the motif
in a given sequence, but if you use the log-likelihood matrix, the scores that you will
obtain are   
ratios (which can be summed!).

You can use the next sequence or your own sequence, and see how the scores
along each position in the sequence are calculated.
§            
    

   
 
 $ $  $ % % $  $ $ $
$ $ $ % % $ $ $ $
 $ $ $ $ % $ $ $
 $
 $ $ $ % $ $ $ $ $ 
     
 · $  ·   · · · ·
· · $   $ $ · ·
 $ · · $  · · · ·
 · · ·  $ · · $ $ 
& '( ) $

   
 $ · ·  % · · · ·
· $ · %  · $ $ $
 · · ·   $ · · ·
 · · $   · · · ·
& '( ) % % *+,
§           

      
 
      
 
      
 
     
 
     
 
      
 
     
 
     
 
     
 
     
 
     
 
    
 
    
 
    
 
     -./ $
   
 
   
 
    
 
§           

      
 
      
 
      
 
     
 
     
 
      
 
     
 
     
 
     
 
     
 
     
 
    
 
    
 
    
 
     -./ $
   
 
   
 
    
 
§           

      
 
      
 
      
 
     
 
     
 
      
 
     
 
     
 
     
 
     
 
     
 
    
 
    
 
    
 
     -./ $
   
 
   
 
    
 
§           

      
 
      
 
      
 
     
 
     
 
      
 
     
 
     
 
     
 
     
 
     
 
    
 
    
 
    
 
     -./ $
   
 
   
 
    
 
§           

      
 
      
 
      
 
     
 
     
 
      
 
     
 
     
 
     
 
     
 
     
 
    
 
    
 
    
 
     -./ $
   
 
   
 
'   

 

   
 
 # 
 
 
#      
 #    
 #    

" "
 
        
'   

 

   
 
 # 
 
 
#      
 #    
 #    

" "
 
        

 zero probability (actually negative infinity!)

X, it is definitely appears related«

§  Ñ  
  or    
 
 

   
  But, adding an integer is too large
 # 
 
  (and too uniform if you¶re not
assuming equal probabilities)«
#      
 #     Therefore, instead of adding 1, add
 #    
fractional counts that are dependent
on the probability of each residue.

 Dirichlet mixtures is a lfancier´ way


of doing pseudo-counts, and each
iteration is dependent on prior
   
  observations. This is beyond our
 #       scope and not to be worried with.
#       
 #    

 #       
    

à   °à

‡ IMPALA provides a profile-based approach of searching the BLOCKS db.

‡ More information about the method can be found atÑ

httpÑ//bioinformatics.weizmann.ac.il/blocks/help/about_impala.html

‡ While CATH uses IMPALA now, see igure13.3 in your text, much can be
learned by considering its more common functional cousin, PSI-BLAST.
Xà§6  

01
 




Xà§6  

01 Look in
table for
  all similar
 words that
0
 score well & 
 then search DB 
for matches 




Xà§6  

01 Look in
table for
  all similar
 words that
0
 score well & 
 then search DB  0
for matches 
 Repeat 
 w/ 4 char. 
 
  
 

Xà§6  

01 Look in
table for
  all similar
 words that
0
 score well & 
 then search DB  0
for matches 
 Repeat 
 w/ 4 char. 
  Repeat
   w/ 5 char.
 

6 Xà§6  

‡ Xৠ- Allows one to heuristically create the local pairwise alignment of any two pre-
determined sequences.
± Either any two protein or any two nucleic acid sequences

‡ Xà§6 - Uses a protein sequence to search for homologous sequences within a protein sequence
database.

‡ Xà§6 - Uses a nucleic acid sequence to search for homologous sequences within a nucleic acid
database.

§   


   
 
 !· ·" 
 
 !· " #
 !· "

$%&  '() $§$)  ) 


§ § $ § )' 
$§ * ·
+ + $ + + + +  + + § + + +++' +
$++ *
§, 
( $((
 § *  § (
) ) ) §( ' 
$ )'* 

$%&  )$  §  §)
   ))$  '*) ·
+ + + ++  + )) + + + *) 
§,  §)  )  $ § ))')()
*' *)  ·
         

‡ The power of profile methods can be further enhanced through iteration of the
search procedure.
‡ After a profile is run against a database, new similar sequences can be detected.
‡ A new multiple alignment, which includes these sequences, can be constructed, a
new profile abstracted, and a new database search performed.
‡ The procedure can be iterated as often as desired or until convergence, when no
new statistically significant sequences are detected.
‡ Iterated profile search methods have led to biologically important observations
but, for many years, were quite slow and generally did not provide precise means
for evaluating the significance of their results.
‡ This limited their utility for systematic mining of the protein databases. The
principal design goals in developing the    §  
Xà§6°§
Xà§6 program were speed, simplicity and automatic operation.
6 § Xà§6  

1. PSI-BLAST takes as an input a single protein sequence and compares it to a


protein database, using the standard BLAST program.

2. The program constructs a MSA, and then a profile, from any significant local
alignments found. The original query sequence serves as a template for the
MSA and profile.

3. The profile is compared to the protein database, again seeking local


alignments. After a few minor modifications, the BLAST algorithm can be used
for this directly.

4. PSI-BLAST estimates the statistical significance of the local alignments found.

5. inally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of


times or until convergence.
6 § Xà§6   

Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of


BLAST; the results produced in iterative search steps are comparable to those
produced from the first pass.

Unlike most profile-based search methods, PSI-BLAST runs as one program,


starting with a single protein sequence, and the intermediate steps of multiple
alignment and profile construction are invisible to the user (this will be make more
sense once you learn about things like Blocks/Impala, MEME/MAST, etc).

§ Xà§6         



  


    

 

       
  
 
     
  
  
      
§  §    °§§

‡ SSAP uses a ³double dynamic programming´ approach to compare to protein


structures at an atomic level.
Instead of comparing residue identities, the method compares structure environments of between two proteins.

The first dp loop does an all-to-all comparison to identify putative residue pairs.

The second dp loop uses the putative pairs calculated in the first to find the ideal superimposition.

‡ SSAP scores are scaled 0-100; homologous proteins regularly score >80.
A score of >70 is required to assign a protein to an existing fold class. However, even though high SSAP scores support
homology, corroborating evidence is required for assignment at the superfamily level (i.e. PSI-BLAST hit).

Domains are importantÑ Proteins are only classified into existing fold groups if the detected structural similarity extends
for more than 60% of the protein.

‡ Very computationally expensive --- this is generally true of most DP algorithms.


or example, a 300aa protein can take several days to compare against the entire database using the most powerful
machines currently available.

‡ We won¶t spend too much time on the details of SSAP. Later, we will learn the
gory details of two other common structural comparison algorithms (DALI and
Combinatorial Extension).
§  §    °§§
§  §    °§§

³Side´

³Top´
  
 
  
°r6 

‡ oRATH was designed as a pre-filter for CATH in


order to speed up the structural comparisons. It is
more than a 1000x faster than SSAP

‡ oRATH represents proteins at a very coarse grained


level, specifically secondary structures are
represented as vertices on a graph (edges describe
their orientation and distances between midpoints).

‡ At least one of the top ten oRATH hits is correct


98% of the time. As such, these ten can each be
compared using SSAP individually.
 r 
 
 ° r

‡ CORA extends the idea of a sequence profile to a multiple structural alignment.

‡ The multiple structure alignment is calculated using a progressive alignment


strategy (see below).

‡ After each structure is added, a consensus structure is generated, which is simply


the average (with information on variability) of the considered structures.
6  6  

(1.) Close relatives are identified via sequence comparisons.

(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.

°§ 
     
 
 
 
  
  
  


 

(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.

(5.) inally, and structure(s) remaining unclassified are manually assigned to


existing or new architectures within CATH.

  


 

‡ Any structure unclassified by the sequence-based methods are divided into their
constituent domains (when appropriate). The domains are then resubmitted to the
sequence and structure comparison protocols discussed previously.

‡ While there are many automatic domain identification algorithms, most result in
significant numbers of incorrect assessments (20-30% incorrect).

‡ This is mainly due to the fact that there is no unique answer to the question,
³What is a domain?´ or example, one could easily envision various domain
classification schemes based on sequence, phylogeny and/or structure.

‡ A common structure-based approach is based on straightforward structural


conceptsÑ namely that (globular) proteins have hydrophobic cores, and that these
cores should constitute a (semi)independent folding nucleus.

‡ Thus the automated methods attempt to (maximize, minimize) (intra, inter)-


domain contacts.

‡ uhat about non-globular (i.e. intrinsically disordered or integral) proteins???


6   

 
     

Most automated domain identification methods are


primarily based on this premise. However, as you
might expect, there are myriad ways to implement
such an idea.

 

‡ As of 10/2001, nearly 50% of the PDB entries are multidomain proteins.

‡ Two-thirds of those contained two domains.

‡ X 
Approximately one-quarter of all domains contained within CATH are
discontiguous.

vs.
     
 
 
  °j  
  

While theoretically unsatisfactory, such consensus-based methods are common in bioinformatics. The picture above shows a
consensus method used in the prediction of protein secondary structure.
 

 


 

‡ Three methods are employed, where they agree within a tolerance of ten
residues, domains are assigned completely automatically.

‡ If consensus is not found, the domains are manually assigned.


 

‡ DomainParser (Xu et al, Bioinformatics 2000) uses a graph-theoretic algorithm for


the decomposition of a multi-domain protein into individual structural domains.

‡ The underlying principle used is that residue-residue contacts are denser within a
domain than between domains.

‡ The decomposition problem is recast as a network flow problem, in which each


residue is represented as a node of a network and each residue-residue contact
is represented as an edge with a particular capacity, depending on the type of the
contact.

‡ A two-domain decomposition problem is solved by finding a cut of the network,


which minimizes the total cross-edge capacity (minimum cut).

‡ To deal with networks with non-unique minimum cuts, the algorithm finds all cuts,
which achieve the minimum cross-edge capacity.

‡ A recent analysis of four automatic methods put DomainParser (marginally) at the


top (Holland et al, JMB, 2006) --- In fact, 3/4 were nearly equal depending on the
evaluation criterion.
 

Domain identification is recast as a network flow problem.


Meaning, the method attempts to divide the network into two
interconnected parts in such a way that the edge  
across the division in minimized. (üote, each edge can carry
different weights, or capacities.)

Intuitively, this translates into finding the bottleneck within the


network.

The algorithm works by systematically removing nodes until


domain separation is maximized.

There is a second (post-processing) step that checks the


validity of the domain boundaries using commonsense
metrics like compactness, radius of gyration, number of non-
contiguous segments per domain, and distribution of domain
sizes.

Because the method is based on topology, it is very fast. And,


it scales very well as well O(nm2), where n = # of nodes and
m = # of nodes.
 

Domain identification is recast as a network flow problem.


Meaning, the method attempts to divide the network into two
interconnected parts in such a way that the edge  
across the division in minimized. (üote, each edge can carry
different weights, or capacities.)

Intuitively, this translates into finding the bottleneck within the


network.

6    


    


  


There is a second (post-processing) step that checks the


validity of the domain boundaries using commonsense
metrics like compactness, radius of gyration, number of non-
contiguous segments per domain, and distribution of domain
sizes.

Because the method is based on topology, it is very fast. And,


it scales very well as well O(nm2), where n = # of nodes and
m = # of nodes.
     

  
  
  

°
 
   
A sampling of
DomainParser
predictions vs.
manual assignments.

      

 

‡ A recent analysis of CATH revealed that ~70% of the domains within multidomain
proteins reoccurred in other multidomain proteins and/or occurred as a single
domain protein.

‡ Therefore, a simple domain-detection protocol to search for known domains


within new multidomain proteins can be envisioned.

‡ oRATH is used to compare the secondary structure graph for each putative
multidomain protein against CATH.

‡ This approach has led to significant improvements within domain identification.


   

 

‡ Manual resolution of domain assignments is a highly subjective process.

‡ or example, 17% of the domain assignments in SCOP and CATH disagree.

‡ Manual assessment of domains is one of the most time-consuming steps in


protein classification. It is also one of the most basic and most important.

‡ While domain assignment from structure is hard, domain assignment from


sequence is even harder. As such, many sequence family databases now use
CATH or SCOP boundaries within their assignments.
6  6  

(1.) Close relatives are identified via sequence comparisons.

(2.) Sequence profiles and structure comparison protocols are used to detect more
distant homologies.

(3.) Structures unclassified at this stage are then examined using both automatic
and manual procedures to determine domain boundaries.

(4.) Unclassified domain structures are recomputed using the methods employed in
steps 2 and 3.

° °  


  
 
         6 

Vous aimerez peut-être aussi