Académique Documents
Professionnel Documents
Culture Documents
net/publication/230503487
BLAST Algorithm
CITATIONS READS
4 1,930
1 author:
Stephen F Altschul
National Institutes of Health
106 PUBLICATIONS 188,952 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Stephen F Altschul on 14 April 2014.
Stephen F Altschul, National Center for Biotechnology Information, Bethesda, Article contents
Maryland, USA
BLAST Statistics
Basic Algorithm
BLAST is an acronym for basic local alignment search tool; the BLAST family of database
BLAST Variants
search programs takes as input a query DNA or protein sequence, and search DNA or
Implementations and Servers
protein sequence databases for similarities that may indicate homology. The programs
implement variations of the BLAST algorithm, which is a heuristic method for rapidly doi: 10.1038/npg.els.0005253
finding local alignments with scores sufficiently high to be statistically significant.
Deoxyribonucleic acid (DNA) or protein sequences nucleotides or amino acids chosen independently, with
are aligned with one another for a wide variety of fixed background residue frequencies, the number of
reasons, for example, to infer homology, to predict distinct local alignments with score at least S expected
protein structure and function, to reconstruct phylog- to arise from the comparison of two sequences of
eny and to locate exons and introns. The type of lengths m and n is well approximated by the formula
sequence alignment that makes the fewest prior
E ¼ Kmn elS
assumptions is local alignment, which seeks similar
segments of unspecified length from the two sequences where K and l are calculable parameters (Karlin and
being compared. Local sequence similarity is generally Altschul, 1990). No such theory has been established
measured with the aid of a nucleotide or amino acid for gapped local alignments, but computational
substitution matrix and associated gap costs. With experiments strongly suggest that the formula still
sequence similarity so defined, a rigorous method for applies in this more general case (Altschul and Gish,
finding optimal local alignments is provided by the 1996). However, in the gapped case, the statistical
Smith–Waterman algorithm, which requires time parameters l and K can no longer be derived
proportional to the product of the lengths of the analytically but must rather be estimated by random
sequences it compares. However, using this algorithm, simulation.
a similarity search of a large nucleic acid sequence The BLAST programs convert raw local alignment
database or protein database with a single query scores to ‘E-values’ using the above equation, m being
sequence of moderate length may require an hour on the length in residues of the query sequence, and n the
a modern work station. Accordingly, rapid heuristic length of the database to which it is compared. The
algorithms such as FASTA and basic local alignment original BLAST programs (Altschul et al., 1990)
search tool (BLAST) have been developed that can sought only ungapped local alignments and therefore
perform these searches up to two orders of magnitude could calculate l and K analytically. The gapped
faster than the Smith–Waterman algorithm, but at the BLAST programs (Altschul and Gish, 1996; Altschul
cost of possibly missing an occasional similarity, that et al., 1997) allow users to choose among a set of
is, local alignment, of interest. substitution and gap costs for which the statistical
parameters have been preestimated. Only local
alignments with an E-value lower than a set threshold,
BLAST Statistics usually 10 by default, are reported.
ENCYCLOPEDIA OF LIFE SCIENCES & 2005, John Wiley & Sons, Ltd. www.els.net 1
BLAST Algorithm
techniques for identifying perfectly matching Thirdly, consider each hit a nascent alignment
substrings of two input sequences. The heuristic and extend it in each direction by aligning additional
BLAST algorithm uses these techniques to locate, pairs of amino acids from the query and database
with strong probability, optimal local alignments, even sequences. Terminate an extension if the alignment
when they represent relatively diffuse similarities. score drops more than a fixed amount X below the best
Let a ‘word’ be a set-length string of adjacent letters score yet found seeded by this hit. Report the optimal
within a sequence. Given two sequences and a alignment for this hit if its score has a sufficiently low
threshold score T, let a ‘hit’ be a pair of words, one E-value.
from each sequence, whose aligned score is at least T. BLAST achieves its speed in large measure
The central idea behind the BLAST algorithm is that by scanning the database only for exact matches
for an appropriately chosen T, any local alignment to words in the table created in the first step
between the query and a database sequence with an above. However, because it generates this table by
E-value worth reporting is highly likely to contain an expanding each query word into a set of similar words,
aligned pair of words constituting a hit (Altschul et al., BLAST is able to detect fairly diffuse similarities. For
1990). Therefore, a rapid search for hits involving example, six of the 11 hits generated by ‘MVD’ do not
query and database sequences can be used to locate contain two adjacent pairs of matching amino acids
candidate local alignments for reporting. (Table 1).
For the purpose of illustration, consider the original A key feature of the BLAST algorithm is its tunable
ungapped BLAST algorithm (Altschul et al., 1990) trade-off between speed and the probability of missing
applied to protein sequence comparison using word relevant alignments. If the hit threshold score is raised
length 3, and the threshold T ¼ 11 for hits scored using from 11 to 12, the number of words generated by
the BLOSUM-62 amino acid substitution matrix ‘MVD’, for example, falls from 11 to 5 (Table 1). This
(Henikoff and Henikoff, 1992). Omitting most details, results in many fewer alignment extensions and a
BLAST follows the following steps when supplied with corresponding decrease in execution time, but it also
a query sequence and database. renders it more likely that an alignment of interest will
Firstly, construct a table of all possible words that contain no hit and therefore be overlooked. Gapped
would form a hit when aligned with some word in the versions of BLAST have additional conditions on and
query sequence, and tag each word in the table with the procedures for generating gapped alignments
locations of all query sequence words that generated it. (Altschul and Gish, 1996; Altschul et al., 1997),
For example, if the word ‘MVD’ appeared somewhere which will not be discussed here.
in the query, it would generate the 11 words shown in
Table 1, each with a record of the location of ‘MVD’ in
the query. BLAST Variants
Secondly, scan the database for any words that
appear in the table. Each such database word then Variations on the BLAST algorithm have been
constitutes a hit in conjunction with one or more implemented for use in a variety of contexts.
known query sequence words.
Table 1 High-scoring words that align
to ‘MVD’
Protein sequence comparison: BLASTP
BLASTP is used to compare a protein query sequence
Word Score
to a database of protein sequences; recent versions
MVD 15 allow gapped alignments (Altschul and Gish, 1996;
MID 14 Altschul et al., 1997).
LVD 12
MLD 12
MMD 12
IVD 11 DNA sequence comparison: BLASTN and
VVD 11 MEGABLAST
MAD 11
MTD 11 BLASTN is used to compare a DNA query sequence
MVE 11 to a database of DNA sequences. In lieu of the score-
LID 11 based hit definition used for protein comparisons, hits
are defined as runs of matching pairs of nucleic acids of
Alignment scores are computed using the a specified minimum length. MEGABLAST (Zhang
BLOSUM-62 amino acid substitution matrix
(Henikoff and Henikoff, 1992). Amino acids et al., 2000) is a faster variant on BLASTN best used
matching those to which they are aligned in for very long query sequences and for finding
‘MVD’ are shown in bold. alignments of very closely related sequences.
2
BLAST Algorithm
3
BLAST Algorithm