Vous êtes sur la page 1sur 24

Scoring Matrices, Database

Searching and Heuristic


Alignment Algorithms

ISSP 2081 / BIOINF 2051


Fall 2002
Lecture #6
3/14/2011 1
Handouts on Weight matrices
r Weight matrices for sequence similarity
scoring by David Wheeler
r Supplement to above by David Wheeler

2
PAM Matrices
r First substitution matrices widely used
r Based on the point-accepted-mutation
(PAM) model of evolution (Dayhoff..1978)
r PAMs are relative measures of
evolutionary distance
r 1 PAM = 1 accepted mutation per 100 AAs
r Does not mean that after 100 PAMs every
AA will be different? Why or why not?

3
PAM Matrices
r If changes were purely random
r Frequency of each possible substitution is
proportional to background frequencies
r In related proteins:
r Observed substitution frequencies called the target
(replacement) frequencies are biased toward those
that do not seriously disrupt the protein¶s function
r These point mutations are ³accepted´ during
evolution
r Log-odds approach:
r Scores proportional to the natural log of the ratio of
target frequencies to background frequencies
4
The Math
r Score matrix entry for time t given by:

Conditional probability that a is


substituted by b in time t
s(a,b|t) = log P(b|a,t)
qb
Frequency of amino acid b

5
PAM Matrices Construction
r Pairs of very closely related sequences used to
collect mutation frequencies corresponding to 1
PAM
r Explicit model
r Two families studied ± immunoglobin, cytochrome C
r Extrapolation of the data to a distance of 250
PAMs
r PAM250 was original Dayhoff matrix
r Family of matrices ± PAM10« PAM200
r Matrix multiplication using PAM-1
6
PAM Matrices: salient points
r Derived from global alignments of closely related
sequences.
r Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
r The number with the matrix (PAM40, PAM100)
refers to the evolutionary distance; greater
numbers are greater distances.
r Does not take into account different evolutionary
rates between conserved and non-conserved
regions.
7
BLOSUM Matrices
r Henikoff, S. & Henikoff J.G. (1992)
r Use blocks of protein sequence fragments from
different families (the BLOCKS database)
r Amino acid pair frequencies calculated by
summing over all possible pairs in block
r Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold = same cluster)

8
BLOSUM Matrices
r Similar idea to PAM matrices
r Probabilities estimated from blocks of
sequence fragments
r Blocks represent structurally conserved
regions

9
BLOSUM Matrices
r Target frequencies are identified directly
instead of extrapolation.
r Sequences more than x% identitical within
the block where substitutions are being
counted, are grouped together and treated
as a single sequence
r BLOSUM 50 : >= 50% identity
r BLOSUM 62 : >= 62 % identity

10
BLOSUM Matrices: Salient points
r Derived from local, ungapped alignments of
distantly related sequences
r All matrices are directly calculated; no
extrapolations are used ± no explicit model
r The number after the matrix (BLOSUM62) refers
to the minimum percent identity of the blocks
used to construct the matrix; greater numbers are
lesser distances.
r The BLOSUM series of matrices generally
perform better than PAM matrices for local
similarity searches (Proteins 17:49).
11
BLOSUM Example
r PSC Tutorial - BLOSUM example
http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html

12
Heuristic Alignment Algorithms
r Database searching vs. sequence alignment
r What is a heuristic? Why use heuristics?
r Approximations to Smith-Waterman
r FASTA [Pearson & Lipman, 1988]
r BLAST [Altschul et al., 1990]
r What are the tradeoffs in terms of search?
r Sensitivity vs. Selectivity

13
BLAST Overview
r BLAST heuristically finds ` ` 
` 
 : highest scoring pair of
identical length segments from 2 sequences
r SP = ungapped, local alignment
r ?  a segment pair (SP) with maximum
score over all segment pairs in S1 and S2

14
BLAST Overview
r Given: query sequence =, word length , word
score threshold , segment score threshold
r Compile a list of ³words´ that score at least when
compared to words from =
r Scan database for matches to words in list
r Extend all matches to seek high-scoring segment
pairs
r Return: segment pairs scoring at least

15
Determining Query Words
r Given:
Query sequence: QLNFSAGW
Word length = 2
Word score threshold = 8

Step 1: Determine all words of length in


query sequence
QL LN NF FS SA AG GW

16
Determining Query Words
Step 2: Determine all words that score at least
 when compared to a word in the query
sequence

QL QL=11, QM=9, HL=8, ZL=9


LN LN=9, LB=8
«.

17
Scanning the database
r Search database for all occurrences of
query words
r Approach:
r Build a DFA that recognizes all query words
r Run DB sequences through DFA
r Remember hits

18
Finding MSPs
r Extend hits in both directions (without
allowing gaps) as long as score of segment
pair increases
r Return segment pairs scoring at least S

19
Choosing Values for and 
r Trade-off: sensitivity vs. running-time
r Choosing a value for
r Small w: many matches to expand
r Big : many words to be generated
r m=4 is a good compromise
r Choosing a value for T
r Small T: greater sensitivity, more matches to
expand

20
BLAST Notes
r May fail to find optimal MSPs
r May miss seeds if T is too stringent
r Extension is greedy
r Empirically, 10 to 50 times faster than
Smith-Waterman
r Large impact: NCBI¶s BLAST server
handles more than 50,000 queries a day

21
Statistics of alignment scores
(or how to choose a value for S)
r [Karlin & Altschul, 1990]
r A model of random sequences
r Ungapped alignments
r All residues drawn independently
r Expected score for a pair of randomly chosen
residues required to be negative ± Why?
r See text for math

22
FASTA
r Heuristic, exclusion method
r http://gcg.nhri.org.tw/fasta.html
r See PSC tutorial for examples:
r www.cbmi.upmc.edu/~vanathi/syllabus.html

23
Readings for next class
r FASTA
r Summary for FASTA paper due

24

Vous aimerez peut-être aussi