Blast & Fasta

Database
Searching
Why do database searches?

Often seems strange that
similarity searching
is so central to bioinformatics
HelixInfoSystems
Predominant application
for database searches
To discover or verify identity of a newly

sequenced gene
To find other members of a multigene family
To classify groups of genes
HelixInfoSystems
Database searching
One
approachuse
algorithm
Smith-Waterman
to find local alignments that compare

query
sequence to each sequence in database.
HelixInfoSystems
Problem?
Databases are huge (GenBank ~30 million
sequences,
Swiss-Prot >> 100,000 sequences)
S-W is slow (O(Nn2)) where, n is the
sequence length
and N is the number of sequences in the
database
HelixInfoSystems
Solution?
Use faster heuristic approaches
FastA Fast Alignment
Blast Basic Local Alignment
Search Tool
HelixInfoSystems
What is Heuristics?
Heuristics - solve a problem that ignores
whether the solution can be proven to be
correct, but which usually produces a good
solution
Solves a simpler problem that contains or
intersects with the solution of the more complex
problem.
Heuristics are intended to gain computational
performance or conceptual simplicity,
potentially at the cost
of accuracy or precision.
HelixInfoSystems
7
FASTA
HelixInfoSystems
FASTA
Developed ~1985 by Lipman and Pearson
(now many
variants/updates/improvements)
Goal: Perform fast, approximate local
alignments to find
sequences in the database that are related to
the query
sequence
HelixInfoSystems
Steps in FastA
1. Choose a value for the ktup parameter:
will look for exact matches of this length
between the query and target sequences
typically ktup=6 for DNA (range 4-6),
ktup=1 (range 1-2) for protein (why the
difference?)
2. Find hot spots (location of matching
ktup-length substrings) in a dot plot
HelixInfoSystems
10
Hot spots with ktup=1
HelixInfoSystems
11
HelixInfoSystems
12
HelixInfoSystems
13
Hashing technique
Computational trick that makes FASTA fast
is
how it locates the hot spots
Uses hashing technique (map a string of 1,
2, or more
characters to an integer
e.g., AAA 0
AAC 1
...
TTT 63 (oversimplified)
HelixInfoSystems
14
Hashing technique
Can preprocess the database and create a
table that stores locations (offsets) of each
possible k-tuple
20k for aminoacids (400 if k=2),
4k for DNA (4096 if k=6),
Then use hash code computed from query
sequence k-tuples to look up these entries
quickly
HelixInfoSystems
15
Example
E.g., in the previous example with ktup=2 and

top sequence from database: gctggaaggcat
Can now scan the query sequenc

by sliding a window along it,
looking up each ktup substring
in the hash table to retrieve the
location(s) in the database seque
HelixInfoSystems
16
Contd.Steps in FastA
3.Find 10 best diagonal runs (sequence of nearby
hot
spots on same diagonal)
FASTA gives each hot spot a positive score, and
each space between consecutive hot spots a
negative score that decreases with distance
Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but does not contain indels because
they are all on the same
diagonal
HelixInfoSystems
17
4. Evaluate each diagonal run using an
appropriate
scoring matrix (PAM-n, BLOSUM-n, etc.)
and find
the best scoring run = init1
Runs with low scores discarded (filtration)
HelixInfoSystems
18
5. Try to find good diagonal runs
from close diagonals by now
allowing indels
good means those having score
exceeding a chosen threshold:
HelixInfoSystems
19
Finding best path?

Find a maximum weight path in this graph;
corresponds to a single local alignment
between the two sequences compared
The score of this path (initn

sum of scores of aligned
individual regions minus ga
penalty for each inserted g
between regions
HelixInfoSystems
20
6. If initn score reaches a threshold value,
get opt score using Smith-Waterman
alignment (dont waste time on this
otherwise)
7. Rank database sequences according to
opt scores; use full Smith-Waterman
method (no band) to align query
sequence against each of the highest
ranking sequences from the database
HelixInfoSystems
21
8. Perform statistical analysis of the
probability that
given level of matching would be
obtained by chance if
sequences were unrelated
HelixInfoSystems
22
FASTA Results
When init1 = init0 = opt:
100% homology over the matched stretch.
When initn > init1:
More than 1 matching region in the
database with poorly matching separating
regions.
When opt > initn:
The matching regions are greatly improved
by adding gaps in one or both of the
sequences.
HelixInfoSystems
23
Basic Local Alignment

Search Tool
Most widely used
computational
and
referenced
biology/bioinformatics resource
HelixInfoSystems
24
BLAST
Improves search speed of
FASTA
Retains sensitivity of
searches
HelixInfoSystems
25
BLAST Algorithm
Step:1
HelixInfoSystems
26
Step 1 - Example
Size w words in the query sequence.
t the query sequence by a moving window of

Example: for a human RBP query
FSGTWYA (query word is in red)
W=3
The moving window of words:
FSG SGT GTW TWY WYA
HelixInfoSystems
27
Step 1: compile a list of words

scoring at least T with query
word
GTW
Word Hits > T
ASW
ATW
NTW
Threshold (T)=11GTY
Word Hits < T
6,5,11
6,1,11
0,5,11
0,5,11
6,5,2
GNW
GAW
HelixInfoSystems
22
18
16
16
13
10
9
28
2. Scan the database for entries

that contains any word from the
compiled hit list.
Exact matches of words from the word list

to the database sequences
HelixInfoSystems
29
3. Extend: when you manage

to find a hitextend the hit in
either
direction.
Keep track of the score (use a scoring matrix)
Stop when the score drops below some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
extend
Hit!
extend
HelixInfoSystems
30
Step3
For each exact word match,

alignment is extended in both
directions to find high score
segments
HelixInfoSystems
31
Blast Word Size

Minimum word size of 9 needed to
detect
similarity (default is 11)

For proteins word size is 3 & match
need not be
exact - less of an issue

HelixInfoSystems
32
Interpreting BLAST
Results
Bit Score
E values and p values
From raw scores to bit

scores
There are two kinds of scores:
Raw scores (calculated from a

substitution matrix)
Bit scores (normalized scores)

S = bit score = (S - lnK) / ln2
HelixInfoSystems
34
E values
Expect value (E) is the number of alignments
with
scores greater than or equal to score S that
are
expected to occur by chance in a database
search.
HelixInfoSystems
35
E-Value
E = Kmn e-S
E is the number of hits you would expect

from your search with scores greater
than S where:
K is a constant
m is the size of the query
n is the size of the database being
searched
scales for the specific scoring matrix
used
HelixInfoSystems
36
Interpreting
scores
% identity is not the best indicator of
homology
Statistical theory in next talk
E-value < 0.001 typically used to infer
homology
E-values > 0.001 may still be homologous
Analysis of conservation of functional
motifs
More sensitive techniques
HelixInfoSystems
37
BLAST family of
programs
blastp - amino acid query sequence
against a protein sequence database
blastn - nucleotide query sequence
against a nucleotide sequence database
blastx - nucleotide query sequence
translated in all reading frames against
a protein database
HelixInfoSystems
38
BLAST family of
programs
tblastn - protein query sequence
against a nucleotide sequence
database dynamically translated in
all reading frames
tblastx - six-frame translations of a
nucleotide query sequence against
the six-frame translations of a
nucleotide sequence database.
HelixInfoSystems
39
Gapped
BLAST
The Gapped Blast algorithm allows

gaps to be
introduces into the alignments.
That means that
similar regions are not broken into
several
segments.
This method reflects biological
HelixInfoSystems
relationships much
40
PSI-BLAST
Position Specific Iterated BLAST
The search can be improved, if the
important parts of the query are known.
The important parts of the query quite
often correspond to conserved regions, or
regions with less mutations, or regions
that define structure and functionality
within a family of proteins.
HelixInfoSystems
41
Position-Specific
Iterated BLAST
Instead of 20 x 20 matrix, use m x 20

matrix, where m is the length of the
query
First, run normal BLAST
Using the results, construct the position
specific score matrix
Next iterations use global alignment and
the position specific score matrix
HelixInfoSystems
42
Position Specific Matrix

- Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A
B
C
D
F
G
H
B D C A A C D F G H N N H D C C
1 3
2
2
2 2
5
2
8
HelixInfoSystems
43
PSI-BLAST
More sequences are found that can
then be added onto the multiple
alignment
Caution should be used with PSIBLAST:
a greedy algorithm is used
most recently added sequences will
influence the next round of
sequences HelixInfoSystems
44
PHIBLAST
Pattern Hit Initiated BLAST

functions in same manner as PSIBLAST except that the query
sequence is first searched for a
regular expression
search for similar sequences is
focused on regions containing the
pattern
HelixInfoSystems
45
PHI-BLAST
One example of a regular expression:
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R[STAQ]-A-x-[LIVMA]-x-[STACV]
HelixInfoSystems
46
Comparisons of BLAST
and FASTA
BLAST
FASTA
It can produce more than

one HSP per database
entry
Produces only one best

alignment.
Better for protein than

nucleotides.
Better for nucleotides

than proteins
Faster than FASTA
Slower than BLAST
Less sensitive when

using default settings
More sensitive. Misses

less homologous
Less separation between

true
and random hits.
More separation between

true homologs and
random hits.
Calculate probabilities
Calculate significance
(sometimes fails entirely
from the given dataset
if some assumptions are
(problems if dataset is
invalid)
small)
HelixInfoSystems
47

Blast &amp; Fasta

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Blast &amp; Fasta

Transféré par

Droits d'auteur :

Formats disponibles

Database

Why do database searches?

To discover or verify identity of a newly

To find other members of a multigene family

To classify groups of genes

to find local alignments that compare

Hot spots with ktup=1

Hot spots with ktup=2

Hot spots with ktup=3

E.g., in the previous example with ktup=2 and

Can now scan the query sequenc

Finding best path?

The score of this path (initn

Basic Local Alignment

t the query sequence by a moving window of

Step 1: compile a list of words

2. Scan the database for entries

Exact matches of words from the word list

3. Extend: when you manage

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

For each exact word match,

Blast Word Size

similarity (default is 11)

exact - less of an issue

From raw scores to bit

Raw scores (calculated from a

Bit scores (normalized scores)

E is the number of hits you would expect

The Gapped Blast algorithm allows

Instead of 20 x 20 matrix, use m x 20

Position Specific Matrix

Pattern Hit Initiated BLAST

It can produce more than

Produces only one best

Better for protein than

Better for nucleotides

Faster than FASTA

Slower than BLAST

Less sensitive when

More sensitive. Misses

Less separation between

More separation between

Vous aimerez peut-être aussi

Blast & Fasta

Blast & Fasta