Vous êtes sur la page 1sur 4

Introduction to Pairwise

Alignment
Sequence alignment is a way of comparing two primary sequences of DNA,
RNA, or protein

In principle:
two sequences are written out, one on top of the other

gacctaatcgtgaccatttgcgcgcttaaaatccgtta
attgacctaaatcgtgaccatgcgcgcttaaaatccgttaaaaa

then one sequence is moved with respect to the other, and gaps are inserted in each
sequence to maximize identical pairs of bases (or similar pairs of amino acids) lining up
on the top of each other.

gaccta-atcgtgaccatttgcgcgcttaaaatccgtta
attgacctaaatcgtgaccat--gcgcgcttaaaatccgttaaaaa

gap

Alignment can be done by hand using any word processor or text editor to
move lines of text and insert spaces.
It is a relatively straightforward computational problem.

Examples:
identical sequences ACACACTA
ACACACTA

A C A C A C T A
A
C
A
C
A
C
T
A

different sequences ACACACTA


AGCACACA A gap has to be inserted

A C A C A C T A
A
G
C A gap has to be inserted
A
C
A
C
A

alignments: A-CACACTA ACACACT-A


AGCACAC-A AG-CACACA

7 identities, 5 identities,
2 gaps 2 gaps,
2 mismatches

more different sequences ATGCGTCGTT (longer)


ATCCGCGAT (shorter)

alignments: X Y
ATGC-GTCGTT ATGCGTCGTT
AT-CCG-CGAT ATCCG-CGAT

7 identities, 7 identities,
3 gaps 1 gap
1 mismatch 2 mismatches
„Evaluation” of alignments
The overall quality of the alignment is evaluated based on formulas that count the number
of identical (or similar) pairs, mismatches and gaps.

In the formulas the mismatches and gaps are penalized.

- gaps are penalized significantly more than mismatches


since an indiscriminate use of gaps can force an alignment between virtually any pair of
sequences – leading to a meaningless result.

- the penalty assigned to a gap is (usually) not proportional to the size of the gap
because, from biological perspective, larger insertions/deletions (indels) are almost as
common as single-base indels.

To „evaluate” the alignments, two approaches can be used:


- distance is measured → distance index
- similarity is measured → similarity index

The two methods give the same results in most cases

Let us consider the distance method in more detail:


D: the distance between the two sequences

D = Min(w1β + w2γ)

Min: minimum value of (w1β + w2γ) among all possible alignments


(α: the number of pairs of matched elements)
β: the number of pairs of mismatched elements
γ: the number of gaps irrespective of gap length
w1: arbitrarily penalty for a mismatch (e.g. 1)
w2: arbitrarily penalty for a gap (e.g. 4)
(w1 is usually smaller than w2 because deletion and insertion occur less
frequently than substitution)

(let’s apply this equation to the above example:)


For X: D = 1x1 + 3x4 = 13
For Y: D = 2x1 + 1x4 = 6

6 is smaller than 13, and thus alignment Y is considered better than X

For many possible alignments we can use the following formula:

Kmax
D = Min(nd + Σw n ) k k
i=1
nd: the number of different (mismatched) elements
wk: the penalty for a gap of k nucleotides
nk: the number of gaps with length k
kmax: the maximum gap length allowed
The distance method is a protocol for finding the alignment with the smallest
D

For the similarity method a similar formula can be used:

Kmax
S = Max(nm - Σw n ) k k
i=1
nm: the number of identical(matched) elements
The similarity method is a protocol of finding an alignment with the
maximum S

Introduction to Bioinformatics
Matthias Sipiczki, Department of Genetics, University of Debrecen
Comments to: lipovy@tigris.unideb.hu

Vous aimerez peut-être aussi