Académique Documents
Professionnel Documents
Culture Documents
Dr Clare Sansom
works part time at Birkbeck
College, London, and part
time as a freelance computer
consultant and science writer.
At Birkbeck she coordinates
an innovative graduate-level
Advanced Certificate course,
Principles of Protein
Structure, which is taught
entirely using the Internet.
Abstract
Keywords: database
searching, DNA sequences,
protein sequences, scoring
matrices, BLAST, FASTA
This review of sequence database searching aims to set out current practice in the area, in
order to give practical guidelines to the experimental biologist. It describes the basic principles
behind the programs and enumerates the range of databases available in the public domain.
Of these, the most important are the equivalent DNA databases European Molecular Biology
Laboratory (EMBL), GenBank and DNA Databank of Japan (DDBJ), and the protein databases
Swiss-Prot and TrEMBL. The commonly used BLAST and FASTA algorithms are described in
detail and alternative approaches mentioned briefly. Scoring matrices used to compare amino
acid types during protein database searches are compared, with an emphasis on the PAM and
BLOSUM series of observed substitution matrices.
Introduction
Sequence databases,
Internet
Dr Clare Sansom
Department of Crystallography,
Birkbeck College,
London, WC1E 7HX UK
22
04-sansom.p65
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
22
2/7/00, 10:38 AM
Table 1: Comparison of some of the most popular programs for searching sequence
databases
Program
Sensitivity
Speed
Sequence types
BLAST
FASTA
Blitz
SSEARCH
PSI-BLAST
Fairly sensitive
Sensitive
Very sensitive
Very sensitive
Extremely sensitive
Very fast
Fast
Fairly fast
Slower
Slower
DNA, protein
DNA, protein
DNA, protein
DNA, protein
Protein
DNA sequence
databases, EMBL,
GenBank, DDBJ
Choice of databases
Choosing an algorithm is only one
aspect of the design of a database
mining strategy. The importance of the
choice of an appropriate database is
self-evident.
Protein sequence
databases, Swiss-Prot,
TrEMBL
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
04-sansom.p65
23
2/7/00, 10:38 AM
23
Sansom
METHODOLOGY
Statistical significance
It is important to realise that any
database search will extract close
matches based on calculated similarity
between strings of letters. Biologists are
likely to be interested in extracting
those sequences that can be assumed
from similarity to be evolutionarily
related to their test sequence. Such
sequences, which will have derived
from a common ancestor, are defined to
be homologous. Extracting this
information from a purely numerical
measure of similarity is difficult. The
most practicable simple guide to the
likelihood of a hit in a database search
being evolutionarily related to a test
sequence is a statistical measure of how
likely the match is to have occurred
purely by chance. Such values are
calculated and quoted in the results
generated by database searching
algorithms.
The most common measure of
probability is the so-called Expect
value, or E-value. This is the number of
alignments with a given score that
would be expected to occur at random
in the database that was searched. Thus,
an E-value of 1.00 for a match between
a database sequence and a test sequence
would indicate that exactly one random
sequence in a database of that size
would be likely to match the test
sequence as well as the current one.
E-values are independent of the lengths
of the sequences. Values as low as 1050
are not uncommon in well-conserved
families. With large databases, values
between about 0.01 and 10 can be said
to represent a grey area; it may be
useful to analyse sequences matching at
this level in more detail.
Almost all sequence alignment
programs of which programs for
database searching are a subset use a
scoring approach. Each position of
each alignment is given a score, which
is positive for a good match and
negative for either a poor match or a
Homology
Expect value
Gap penalties
Scoring matrices
24
04-sansom.p65
Gap penalties
Clearly, aligning a residue or group of
residues in one sequence with a null
character (a gap) in another should be
penalised. Since a single point mutation
may introduce many more than one
residue into a sequence, long gaps are
usually penalised only slightly more
than short ones. This is achieved by
using two separate negative scores: a
large penalty for introducing a gap and
a much smaller one for extending an
existing one.
Scoring matrices
For comparisons between DNA
sequences, the choice of scoring matrix
is generally trivial. A high score is given
for a match between bases and zero, or
a negative score, to any mismatch. A
match between A and not-A is scored
as a mismatch under all circumstances.
Comparisons at the protein level are
much more complex. All algorithms
comparing protein sequences give
matches between amino acids thought
of as similar such as leucine and
isoleucine, or phenylalanine and
tyrosine intermediate scores between
those of identical amino acids and those
of amino acids with no similarity.
Researchers have used different criteria
to assign scores to each of the 210
possible pairs of amino acids.
Genetic code schemes
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
24
2/7/00, 10:38 AM
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
04-sansom.p65
25
2/7/00, 10:38 AM
25
Sansom
C 2 4 12 5 5 4 3 3 2 5 6 5 4 3 5 4
4 3
2 7 4
3 2
2 7 4
4 5 4 6 5
5 2
F
G
H 1
2 2
1 2 2 2 2
K 1
1 1 1 2 1
1
3 2
2 3 2 4 3
6
3 2
5 2
2 2
6 3
2 5 3
2 2 8
4 5 5 4 3 3 1
0
2
1 1 3
0
0
0
7 5
1 7 5 1
1 1 2 3
2 2 2 2 1
1
0 5
5 1 2
2 3 4
2 3 6 4 3
4 2
3 3 2 3 3 2
2 1 3
M 1 2 5 3 2
3 2
2 2 1
4 2 2
2 4 2
1 3 1 1 5 1
2 1 3 2 1
1 6 5
5 1
2 1
1 1 2 5 4
R 2 1 4 1 1 4 3
3 2
2 1
1 2
1 2 3
1 1
3 2
2 1
1 1
5 3 1
2 2 2 2 1 1 2
2 1 2 2 1
6 2 2
W 6 5 8 7 7
7 3 5 3 2 4 4 6 5
Y 3 3
1 4 1 2 2 5 4 4 3 3 2
5 1
0
5
4 4
3
3 2
2 5 6 17
0 6
10 4
1 2 6 4
Low complexity
sequence
Filtering
The statistics used in database searching
assume that the arrangement of bases or
amino acids in unrelated sequences is
essentially random. This is not the case
in practice. In particular, some regions
of DNA and protein sequences consist
of long repeated runs of a single
Filtering
26
04-sansom.p65
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
26
2/7/00, 10:38 AM
Algorithms
ALGORITHMS
BLAST
BLAST24 Basic Local Alignment
Search Tool is a simple, but extremely
fast, algorithm for finding the highest
scoring locally optimal matches
BLAST
BLAST Algorithm
(1) For the query find the list of high scoring words of length w.
Query sequence of length L
Maximum of Lw+1 words
(typically w = 3 for proteins)
(2) Compare the word list to the database and identify exact matches.
Database
Sequences
04-sansom.p65
27
2/7/00, 10:38 AM
27
Sansom
FASTA
PSI-BLAST
The
Each
FASTA
28
04-sansom.p65
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
28
2/7/00, 10:38 AM
FASTA Algorithm
(a)
(b)
Sequence B
Sequence A
Sequence A
Sequence B
(c)
(d)
Sequence B
Sequence A
Sequence A
Sequence B
Table 3: Names of programs in the BLAST family. Similar options are available with
FASTA
Program name
Function
Blastp
Blastn
Blastx
Translates a nucleic acid query sequence in all six reading frames and compares
each conceptual translation against a protein database
Tblastn
Compares a protein query sequence against a nucleic acid database where each
sequence has been dynamically translated in all six reading frames
Tblastx
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
04-sansom.p65
29
2/7/00, 10:38 AM
29
Sansom
PSI-BLAST
Profile
Other programs
Some other programs for database
searching, which are used less
frequently, can be very useful in some
circumstances. Most of these are
procedures for searching a database
with a single sequence that are more
rigorous, but slower, than either BLAST
or FASTA. They include Blitz and
SSEARCH,20 which implements the
SmithWaterman algorithm often used
in pairwise sequence alignment
calculations to perform a rigorous
comparison of the test sequence with
each sequence in the database.
Some of the most interesting recent
developments in the field of sequence
analysis algorithms involve the use of a
class of statistical methods termed
hidden Markov models (HMMs) to
describe alignments of related
sequences. Hidden Markov models have
been used to model other linear
systems, and applied to problems such as
speech recognition, for many years; a full
description of HMM theory is far
beyond the scope of this review.
Methods such as PSI-BLAST4 use
profiles derived from many aligned
members of sequence families to
http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
http://www.expasy.ch/
http://www.rcsb.org/
http://www.tigr.org/tdb/mdb/mdb.html
Search algorithms
BLAST website
http://www.ncbi.nlm.nih.gov/BLAST/
FASTA website
http://www2.ebi.ac.uk/fasta3/
PSI-Blast server
http://www.ncbi.nlm.nih.gov/cgibin/BLAST/nph-psi_blast/
http://www.hgmp.mrc.ac.uk/
* Formerly http://www.pdb.bnl.gov.
General bioinformatics service; registration is necessary (free for academics).
EBI, European Bioinformatics Institute; NCBI, National Center for Biotechnology Information.
30
04-sansom.p65
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
30
2/7/00, 10:38 AM
References
1. Doolittle, R. F. (1997), Some reflections on
the early days of sequence searching,
J. Mol. Med., Vol. 75, pp. 239241.
2. Altschul, S. F., Gish, W., Miller, W., Myers,
E. W., and Lipman, D. J. (1990), Basic local
alignment search tool, J. Mol. Biol., Vol.
215, pp. 403410.
3. Altschul, S. F. and Gish, W. (1996), Local
alignment statistics, Methods Enzymol., Vol.
266, pp. 460480.
4. Altschul, S. F., Madden, T. L., Schaffer, A. A.,
Zhang, J., Zhang, Z., Miller, W. and
Lipman, D. J. (1997), Gapped BLAST and
PSI-BLAST: a new generation of protein
database search programs, Nucleic Acids
Res., Vol. 25, pp. 33893402.
5. Pearson, W. R. and Lipman, D. J. (1988),
Improved tools for biological sequence
comparison, Proc. Nat. Acad. Sci. USA, Vol.
85, pp. 24442448.
6. Lipman, D. J. and Pearson, W. R. (1985),
Rapid and sensitive protein similarity
searches, Science, Vol. 227, pp. 14351441.
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
04-sansom.p65
31
2/7/00, 10:38 AM
31
Sansom
APPENDIX 1
Practical Guidelines for Database Searching: A Short Summary
32
04-sansom.p65
HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 1. 22-32. FEBRUARY 2000
32
2/7/00, 10:38 AM