Académique Documents
Professionnel Documents
Culture Documents
Dr Avril Coghlan
alc@sanger.ac.uk
N -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
D -1 -1 -1 1 -1 Letter
-1 -1 -1 b
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1
C
A-1 -1 -1 -1
C -1 -1 -1
G-1 -1 -1 -1
T -1 -1
Substitution matrix
Q -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Letter a
A +1-1 -1 -1-1 -1
for DNA E -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
alignments G -1 Letter a
-1 -1 -1 C -1 -1 -1 -11 -1 -1 +1 -1
-1 -1 -1 -1-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1 -1 -1
H
G -1-1 1 -1 -1
-1 -1 -1 -1
+1-1 -1 -1 -1
-1 -1 -1
I -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
T -1-1 -1 -1-1 +1
Substitution matrix
L -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1
for protein K -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1
M -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1
alignments
F -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
P -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
S -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
T -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1
W -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
Y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1
V -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1
N-W: Initialising table T
To align 2 sequences S1 & S2 of lengths m & n, N-W starts by building a
table T with m+1 columns & n+1 rows:
eg. for S1=TGGTG & S2=ATCGT, m=5 and n=5:
T G G T G
A
T
Table T
C
G
T i=0 i=1 i=2 i=3 i=4 i=5
T G G T G
We number the columns i=0,1,2,....m j=0
We number the rows j=0,1,2,...n
j=1 A
j=2 T
j=3 C
j=4 G
j=5 T
T(i, j) is the cell at the intersection of column i and row j
eg. for S1=TGGTG & S2=ATCGT, m=5 and n=5:
i=0 i=1 i=2 i=3 i=4 i=5
T G G T G
j=0
j=1 A
Table T j=2 T T(3,2)
j=3 C
j=4 G
j=5 T
The N-W algorithm starts by initialising (setting the initial value of) T(0,0)
to zero:
T G G T G
0
A
T
C
G
T
The table T is then filled in using the recurrence relation:
T(i-1, j-1) + (S1(i), S2(j)) This will be explained
T(i, j) = max T(i-1, j) + gap penalty in a minute...
T(i, j-1) + gap penalty
The table is filled in from left to right, and from top to bottom
The value of T(0,0) is set to zero at the start (initialised to 0)
We first calculate the values of T(i, j) for row 0 of the table, from left to
right
We then calculate the values of T(i, j) for row 1 of the table from left to
right, then rows 2, 3, 4 .... row n of the table
TT G
G G
G TT G
G
0 x x x x x
AA x x x x x x
TTT x x x x x x
C
CCC x x x x x x
G
G G
G
G x x x x x x
TTT
TT
x x x x x x
The table T is then filled in using the recurrence relation:
T(i-1, j-1) + (S1(i), S2(j)) 1
T(i, j) = max T(i-1, j) + gap penalty 2
T(i, j-1) + gap penalty 3
This means that the value in cell T(i, j) is set to be the maximum of the
three possibilities 1 , 2 , 3 , where:
T(i-1, j-1) is the value in the previous column & row
T(i-1, j) is the value in the previous column & same row
T(i, j-1) is the value in the same column & previous row
C -1 +1 -1 -1
DNA G -1 -1 +1 -1
alignments T -1 -1 -1 +1
For example, say we decide to use +1 for matches, -1 for mismatches, and
-2 for an insertion/deletion (gap)
The N-W algorithm starts by initialising (setting the initial value of) T(0,0)
to zero
We next calculate the value of T(1, 0)
The value of T(1,0) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty = 0 2 = -2
T(i, j-1) + gap penalty
Not defined here
We calculate this to be -2, so set T(1, 0) to -2
We record which previous cell was used to set the value of T(1, 0) :
T G G T G
0 -2
?
A
T
C
G
T
We next calculate the value of T(2, 0)
The value of T(2,0) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty
= -2 2 = -4
T(i, j-1) + gap penalty
We calculate this to be -4, so set T(2, 0) to -4 Not defined here
We record which previous cell was used to set the value of T(2, 0) :
T G G T G
0 -2 ?-
A 4
A
T
TC
C
G
G
T
T
We next calculate the value of T(3, 0)
The value of T(3,0) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty
T(i, j-1) + gap penalty
= -4 2 = -6
We calculate this to be -6, so set T(3, 0) to -6 Not defined here
We record which previous cell was used to set the value of T(3, 0) :
T G G T G
0 -2 - -6
?
4
A
T
C
G
T
We next calculate the value of T(4, 0)
The value of T(4,0) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty
T(i, j-1) + gap penalty
= -6 2 = -8
We calculate this to be -8, so set T(4, 0) to -8 Not defined here
We record which previous cell was used to set the value of T(4, 0) :
T G G T G
0 -2 - -6 -8
?
4
A
T
C
G
T
We next calculate the value of T(5, 0)
The value of T(5,0) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty = -8 2 = -10
T(i, j-1) + gap penalty Not defined here
We calculate this to be -10, so set T(5, 0) to -10
We record which previous cell was used to set the value of T(5, 0) :
T G G T G
0 -2 - -6 -8 -10
?
4
A
T
C
G
T
We next calculate the value of T(0, 1)
The value of T(0,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) Not defined here
T(i, j) = max T(i-1, j) + gap penalty Not defined here
T(i, j-1) + gap penalty
= 0 2 = -2
We calculate this to be -2, so set T(0, 1) to -2
We record which previous cell was used to set the value of T(0, 1) :
T G G T G
-10
0 -2 - -6 -8
4
A -2
?
TT
C
C
G
G
TT
We next calculate the value of T(1, 1)
The value of T(1,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) = 0 1 = -1
T(i, j) = max T(i-1, j) + gap penalty = -2 -2 = -4
T(i, j-1) + gap penalty
= -2 -2 = -4
We calculate this to be -1, so set T(1, 1) to -1
We record which previous cell was used to set the value of T(1, 1) :
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1
?
T
C
G
T
We next calculate the value of T(2, 1)
The value of T(2,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) = -2 -1 = -3
T(i, j) = max T(i-1, j) + gap penalty = -1 -2 = -3
T(i, j-1) + gap penalty
= -4 -2 = -6
We calculate this to be -3, so set T(2, 1) to -3
We record which previous cells were used to set the value of T(2, 1) (two
different cells here):
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 ?-
T 3
TC
C
G
G
T
T
We next calculate the value of T(3, 1)
The value of T(3,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) = -4 -1 = -5
T(i, j) = max T(i-1, j) + gap penalty = -3 -2 = -5
T(i, j-1) + gap penalty
= -6 -2 = -8
We calculate this to be -5, so set T(3, 1) to -5
We record which previous cells were used to set the value of T(3, 1) (two
different cells here):
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 - -5
?
3
T
C
G
T
We next calculate the value of T(4, 1)
The value of T(4,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) = -6 -1 = -7
T(i, j) = max T(i-1, j) + gap penalty = -5 -2 = -7
T(i, j-1) + gap penalty
= -8 -2 = -10
We calculate this to be -7, so set T(4, 1) to -7
We record which previous cells were used to set the value of T(4, 1) (two
different cells here):
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 - -5 -7
?
3
T
C
G
T
We next calculate the value of T(5, 1)
The value of T(5,1) is set to the maximum of 3 possibilities:
T(i-1, j-1) + (S1(i), S2(j)) = -8 -1 = -9
T(i, j) = max T(i-1, j) + gap penalty = -7 -2 = -9
T(i, j-1) + gap penalty
= -10 -2 = -12
We calculate this to be -9, so set T(5, 1) to -9
We record which previous cells were used to set the value of T(5, 1) (two
different cells here):
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 - -5 -7 -9
?
3
T
C
G
T
Problem
Fill in the next row of matrix T
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 - -5 -7 -9
3
T ? ? ? ? ? ?
C
G
T
Answer
Fill in the next row of matrix T
T G G T G
-10
0 -2 - -6 -8
4
A -2 -1 - -5 -7 -9
3
T -4 -1 ? -4
? ?- -4 ? -6
?
C 2
C
G
G
T
T
N-W: the traceback step
When we have filled in the whole of matrix T, it looks like:
T G G T G
0 -2 - -6 -8 -10
4
A -2 -1 - -5 -7 -9
3
T -4 -1 - -4 -4 -6
2
C -6 -3 - -3 -5 -5
2
In the traceback step we use the filled-in matrix T to work out the best
G -8 -5 - -1 -3 -4
alignment between the two sequences S1 & S2
2 matrix T
We start at the bottom right cell of
T -10to -7
We then follow the arrow - -3 0 cell
the previous -2used to calculate the best
value for that cell 4
From there, follow the arrow to the previous cell... and so on..
The path through matrix T is the traceback (in pink here):
sequence S1
T G G T G
-10
0 -2 - -6 -8
4
- T G G T G
sequence S2
A -2 -1 - -5 -7 -9 | | |
3 A T C G T -
T -4 -1 - -4 -4 -6
2
C
-6 -3 - -3 -5 -5
2
To workG out the best alignment, follow the traceback from top left to
-8 -5 - -1 -3 -4
bottom right, & look at the letters aligned in each cell
2
st
Here the 1 cell doesnt correspond to any letter
T -10 -7 - -3 0 -2
nd
The 2 cell is A in sequence S2 but nothing in sequence S1
The 3rd cell is T in4sequence S2 and T in sequence S1
The 4th cell is C in sequence S2 and G in sequence S1
The 5th cell is G in sequence S2 and G in sequence S1
The 6th cell is T in sequence S2 and T in sequence S1
The 7th cell is nothing in sequence S2 and G in sequence S1
Problem
The traceback is shown in pink in the matrix T
below. What is the best alignment?
A C C T
x x x x x
C x x x x x
T x x x x x
G x x x x x
Answer
The traceback is shown in pink in the matrix T
below. What is the best alignment?
A C C T
x x x x x
C x x x x x
T x x x x x
G x x x x x
It is: A C C T -
| |
- C - T G
The Needleman-Wunsch algorithm uses an approach
called dynamic programming (d.p.)
d.p. algorithms solve problems by breaking a large problem into smaller
easy problems of a similar type
The N-W algorithm works by progressively building optimal
alignments of longer and longer subsequences of S1 & S2
T G G T G
0 -2 -4 -6 -8 -10 The best alignment has:
3 matches (score +3)
A -2 -1 -3 -5 -7 -9
- T G G T G 1 mismatch (score -1)
T -4 -1 -2 -4 -4 -6 2 gaps (score -4)
| | |
C -6 -3 -2 -3 -5 -5 A T C G T - Score = 3-1-4 = -2
G -8 -5 -2 -1 -3 -4
T -10
-7 -4 -3 0 -2
Software for making alignments
For Needleman-Wunsch pairwise alignment
pairwiseAlignment() in the Biostrings R library
the EMBOSS (emboss.sourceforge.net/) needle program
Problem
How many times faster is it to find the best
alignment for sequences RQQEPVRSTC &
QQESGPVRST using the Needleman-Wunsch
algorithm, compared to assessing each possible
alignment one-by-one?
Answer
How many times faster is it to find the best
alignment for sequences RQQEPVRSTC &
QQESGPVRST using the N-W algorithm, compared
to assessing each possible alignment one-by-one?
The sequence length, n, is 10 here
This means it will take time proportional to n2=100 to find the best
alignment using N-W
2n
It will take time proportional to ( n ) = 184,756 to find the best
alignment by assessing each possible alignment one-by-one
We can find the best alignment about 1848 times (=184756/100)
faster by using N-W
Problem
Find the best alignment between the sequences
WHAT and WHY, using the Needleman-Wunsch
algorithm, with +1 for a match, -1 for a mismatch,
and -2 for a gap.
Answer
Find the best alignment between WHAT & WHY
using N-W with match:+1, mismatch:-1, gap:-2
Matrix T looks like this, giving 2 possible tracebacks:
W H A T W H A T W H A T
0 -2 -4 -6 -8 0 -2 -4 -6 -8 0 -2 -4 -6 -8
W -2 1 -1 -3 -5 W -2 1 -1 -3 -5 W -2 1 -1 -3 -5
H -4 -1 2 0 -2 H -4 -1 2 0 -2 H -4 -1 2 0 -2
Y -6 -3 0 1 -1 Y -6 -3 0 1 -1 Y -6 -3 0 1 -1
The two possible tracebacks give two equally good best alignments:
W H A T W H A T
| | | |
W H - Y W H Y -
(Pink traceback) (Orange traceback)
Further Reading
Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
Chapter 6 in Deonier et al Computational Genome Analysis
Practical on pairwise alignment in R in the Little Book of R for
Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter4.html