Vous êtes sur la page 1sur 47

Database

Searching

Why do database searches?


Often seems strange that
similarity searching
is so central to bioinformatics

HelixInfoSystems

Predominant application
for database searches

To discover or verify identity of a newly


sequenced gene

To find other members of a multigene family

To classify groups of genes

HelixInfoSystems

Database searching
One
approachuse
algorithm

Smith-Waterman

to find local alignments that compare


query
sequence to each sequence in database.
HelixInfoSystems

Problem?
Databases are huge (GenBank ~30 million
sequences,
Swiss-Prot >> 100,000 sequences)
S-W is slow (O(Nn2)) where, n is the
sequence length
and N is the number of sequences in the
database
HelixInfoSystems

Solution?
Use faster heuristic approaches
FastA Fast Alignment
Blast Basic Local Alignment
Search Tool
HelixInfoSystems

What is Heuristics?
Heuristics - solve a problem that ignores
whether the solution can be proven to be
correct, but which usually produces a good
solution
Solves a simpler problem that contains or
intersects with the solution of the more complex
problem.
Heuristics are intended to gain computational
performance or conceptual simplicity,
potentially at the cost
of accuracy or precision.
HelixInfoSystems
7

FASTA

HelixInfoSystems

FASTA
Developed ~1985 by Lipman and Pearson
(now many
variants/updates/improvements)
Goal: Perform fast, approximate local
alignments to find
sequences in the database that are related to
the query
sequence

HelixInfoSystems

Steps in FastA
1. Choose a value for the ktup parameter:
will look for exact matches of this length
between the query and target sequences
typically ktup=6 for DNA (range 4-6),
ktup=1 (range 1-2) for protein (why the
difference?)
2. Find hot spots (location of matching
ktup-length substrings) in a dot plot
HelixInfoSystems

10

Hot spots with ktup=1

HelixInfoSystems

11

Hot spots with ktup=2

HelixInfoSystems

12

Hot spots with ktup=3

HelixInfoSystems

13

Hashing technique
Computational trick that makes FASTA fast
is
how it locates the hot spots
Uses hashing technique (map a string of 1,
2, or more
characters to an integer
e.g., AAA 0
AAC 1
...
TTT 63 (oversimplified)
HelixInfoSystems
14

Hashing technique
Can preprocess the database and create a
table that stores locations (offsets) of each
possible k-tuple
20k for aminoacids (400 if k=2),
4k for DNA (4096 if k=6),
Then use hash code computed from query
sequence k-tuples to look up these entries
quickly
HelixInfoSystems
15

Example

E.g., in the previous example with ktup=2 and


top sequence from database: gctggaaggcat

Can now scan the query sequenc


by sliding a window along it,
looking up each ktup substring
in the hash table to retrieve the
location(s) in the database seque

HelixInfoSystems

16

Contd.Steps in FastA
3.Find 10 best diagonal runs (sequence of nearby
hot
spots on same diagonal)
FASTA gives each hot spot a positive score, and
each space between consecutive hot spots a
negative score that decreases with distance
Each diagonal run is composed of matches (hot
spots themselves) and mismatches (interspot
regions) but does not contain indels because
they are all on the same
diagonal
HelixInfoSystems
17

Contd.Steps in FastA
4. Evaluate each diagonal run using an
appropriate
scoring matrix (PAM-n, BLOSUM-n, etc.)
and find
the best scoring run = init1
Runs with low scores discarded (filtration)
HelixInfoSystems

18

Contd.Steps in FastA
5. Try to find good diagonal runs
from close diagonals by now
allowing indels
good means those having score
exceeding a chosen threshold:
HelixInfoSystems

19

Finding best path?


Find a maximum weight path in this graph;
corresponds to a single local alignment
between the two sequences compared

The score of this path (initn


sum of scores of aligned
individual regions minus ga
penalty for each inserted g
between regions
HelixInfoSystems

20

Contd.Steps in FastA
6. If initn score reaches a threshold value,
get opt score using Smith-Waterman
alignment (dont waste time on this
otherwise)
7. Rank database sequences according to
opt scores; use full Smith-Waterman
method (no band) to align query
sequence against each of the highest
ranking sequences from the database
HelixInfoSystems

21

Contd.Steps in FastA
8. Perform statistical analysis of the
probability that
given level of matching would be
obtained by chance if
sequences were unrelated
HelixInfoSystems

22

FASTA Results
When init1 = init0 = opt:
100% homology over the matched stretch.
When initn > init1:
More than 1 matching region in the
database with poorly matching separating
regions.
When opt > initn:
The matching regions are greatly improved
by adding gaps in one or both of the
sequences.
HelixInfoSystems
23

Basic Local Alignment


Search Tool
Most widely used
computational

and

referenced

biology/bioinformatics resource

HelixInfoSystems

24

BLAST
Improves search speed of
FASTA
Retains sensitivity of
searches

HelixInfoSystems

25

BLAST Algorithm
Step:1

HelixInfoSystems

26

Step 1 - Example
Size w words in the query sequence.

t the query sequence by a moving window of


Example: for a human RBP query
FSGTWYA (query word is in red)
W=3
The moving window of words:
FSG SGT GTW TWY WYA
HelixInfoSystems

27

Step 1: compile a list of words


scoring at least T with query
word
GTW
Word Hits > T
ASW
ATW
NTW
Threshold (T)=11GTY
Word Hits < T

6,5,11
6,1,11
0,5,11
0,5,11
6,5,2

GNW
GAW
HelixInfoSystems

22
18
16
16
13
10
9
28

2. Scan the database for entries


that contains any word from the
compiled hit list.

Exact matches of words from the word list


to the database sequences
HelixInfoSystems

29

3. Extend: when you manage


to find a hitextend the hit in
either
direction.
Keep track of the score (use a scoring matrix)
Stop when the score drops below some cutoff.

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)


MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

extend

Hit!

extend
HelixInfoSystems

30

Step3

For each exact word match,


alignment is extended in both
directions to find high score
segments
HelixInfoSystems

31

Blast Word Size


Minimum word size of 9 needed to
detect

similarity (default is 11)


For proteins word size is 3 & match
need not be

exact - less of an issue


HelixInfoSystems

32

Interpreting BLAST
Results
Bit Score
E values and p values

From raw scores to bit


scores
There are two kinds of scores:

Raw scores (calculated from a


substitution matrix)

Bit scores (normalized scores)


S = bit score = (S - lnK) / ln2
HelixInfoSystems

34

E values
Expect value (E) is the number of alignments
with
scores greater than or equal to score S that
are
expected to occur by chance in a database
search.

HelixInfoSystems

35

E-Value
E = Kmn e-S

E is the number of hits you would expect


from your search with scores greater
than S where:
K is a constant
m is the size of the query
n is the size of the database being
searched
scales for the specific scoring matrix
used
HelixInfoSystems

36

Interpreting
scores
% identity is not the best indicator of
homology
Statistical theory in next talk
E-value < 0.001 typically used to infer
homology
E-values > 0.001 may still be homologous
Analysis of conservation of functional
motifs
More sensitive techniques
HelixInfoSystems

37

BLAST family of
programs
blastp - amino acid query sequence
against a protein sequence database
blastn - nucleotide query sequence
against a nucleotide sequence database
blastx - nucleotide query sequence
translated in all reading frames against
a protein database
HelixInfoSystems

38

BLAST family of
programs
tblastn - protein query sequence
against a nucleotide sequence
database dynamically translated in
all reading frames
tblastx - six-frame translations of a
nucleotide query sequence against
the six-frame translations of a
nucleotide sequence database.
HelixInfoSystems

39

Gapped
BLAST

The Gapped Blast algorithm allows


gaps to be
introduces into the alignments.
That means that
similar regions are not broken into
several
segments.
This method reflects biological
HelixInfoSystems
relationships much

40

PSI-BLAST
Position Specific Iterated BLAST
The search can be improved, if the
important parts of the query are known.
The important parts of the query quite
often correspond to conserved regions, or
regions with less mutations, or regions
that define structure and functionality
within a family of proteins.
HelixInfoSystems

41

Position-Specific
Iterated BLAST

Instead of 20 x 20 matrix, use m x 20


matrix, where m is the length of the
query
First, run normal BLAST
Using the results, construct the position
specific score matrix
Next iterations use global alignment and
the position specific score matrix
HelixInfoSystems

42

Position Specific Matrix


- Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A
B
C
D
F
G
H

B D C A A C D F G H N N H D C C
1 3
2
2

2 2
5

2
8

HelixInfoSystems

43

PSI-BLAST
More sequences are found that can
then be added onto the multiple
alignment
Caution should be used with PSIBLAST:
a greedy algorithm is used
most recently added sequences will
influence the next round of
sequences HelixInfoSystems
44

PHIBLAST

Pattern Hit Initiated BLAST


functions in same manner as PSIBLAST except that the query
sequence is first searched for a
regular expression
search for similar sequences is
focused on regions containing the
pattern
HelixInfoSystems

45

PHI-BLAST
One example of a regular expression:
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R[STAQ]-A-x-[LIVMA]-x-[STACV]

HelixInfoSystems

46

Comparisons of BLAST
and FASTA
BLAST

FASTA

It can produce more than


one HSP per database
entry

Produces only one best


alignment.

Better for protein than


nucleotides.

Better for nucleotides


than proteins

Faster than FASTA

Slower than BLAST

Less sensitive when


using default settings

More sensitive. Misses


less homologous

Less separation between


true
and random hits.

More separation between


true homologs and
random hits.

Calculate probabilities
Calculate significance
(sometimes fails entirely
from the given dataset
if some assumptions are
(problems if dataset is
invalid)
small)
HelixInfoSystems

47

Vous aimerez peut-être aussi