Académique Documents
Professionnel Documents
Culture Documents
Burrows-Wheeler Transform
Transform
DNA sequencing
How we obtain the sequence of nucleotides of a species
ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT
DNA Sequencing
Goal:
Find the complete sequence of A, C, G, Ts in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives the
complete sequence as output
Can only sequence ~150 letters at a time
Resequencing
Read Mapping
CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...
GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT
GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......
TGCTGAGA
TGCCGAGA
Novel Sequence
Inversion
Mobile Element or
Pseudogene Insertion
Translocation
Tandem Duplication
Microdeletion
Large Deletion
TGC - - AGA
TGCCGAGA
TGCTCGGAGA
TGC - - - GAGA
Transposition
Novel Sequence
at Breakpoint
TGC
Read Mapping
CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...
GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT
GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......
reversible permutation of S
that enables the search for a
Burrows-Wheeler Transform
ANA
X=
BANANA
BANANA
ANANA
NANA
ANA
NA
A
suffixes of
BANANA
BANANA
ANANA
NANA
ANA
NA
A
Burrows-Wheeler Transform
ANA
X=
BANANA$
BANANA$
ANANA$
NANA$
ANA$
NA$
A$
$
BANANA$
ANANA$
NANA$
ANA$
NA$
A$
$
Burrows-Wheeler Transform
ANA
X=
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
Burrows-Wheeler Transform
ANA
X=
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Burrows-Wheeler Transform
ANA
X=
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
Burrows-Wheeler Transform
ANA
X=
BANANA$
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
BWT(BANANA) = ANNB$AA
Suffix Arrays
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
1
2
3
4
5
6
7
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT(X)
Reconstructing BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
A
N
N
B
$
A
A
A$
NA
NA
BA
$B
AN
AN
$
A
A
A
B
N
N
sort
append
BWT
$B
A$
AN
AN
BA
NA
NA
sort
A$B
NA$
NAN
BAN
$BA
ANA
ANA
append
BWT
$BA
A$B
ANA
ANA
BAN
NA$
NAN
sort
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
A $BANAN
N A$BANA
N ANA$BA
B ANANA$
$ BANANA
A NA$BAN
A NANA$B
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
A $BANAN
N A$BANA
N ANA$BA
B ANANA$
$ BANANA
A NA$BAN
A NANA$B
A$BANAN
ANA$BAN
ANANA$B
Same words,
same sorted order
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
LF[] = [2, 6, 7, 5, 1, 3, 4]
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
LF[] = [2, 6, 7, 5, 1, 3, 4]
Computing LF() is easy:
Let C(a): # of characters smaller than a
Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5
Let row r end with the i-th occurrence of a in last column
Then, LF(r) = C(a) + i
(why?)
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA
C()
C() copied
for convenience
index i
indicating this is
i-th occurrence of c
LF()
LF() = C() + i
Reconstruct BANANA:
S := ; r := 1; c := BWT[r];
UNTIL c = $ {
S := cS;
r := LF(r);
c := BWT(r); }
Credit: Ben Langmead thesis
Example:
L(NA) = 6
U(NA) = 7
Lemma (prove as exercise)
L(aW) = C(a) + i +1,
where i = # as up to L(W) 1 in BWT(X)
U(aW) = C(a) + j,
where j = # as up to U(W) in BWT(X)
Example:
L(ANA) = C(A) + # As up to (L(NA) 1) + 1
= 1 + (# As up to 5) + 1
=1+1+1=3
U(ANA) = 1 + # As up to U(NA) = 1 + 3 = 4
ExactMatch(W[1k]) {
a := W[k];
low := C(a) +1;
high := C(a+1); // a+1: lexicographically next char
i := k 1;
while (low <= high && i >= 1) {
a = W[i];
low = LFC(low 1, a) + 1;
high = LFC(high, a);
i := i 1; }
return (low, high);
BWT matrix of
string BANANA
}
Short Read
Alignment
with
Populations
Of Genomes
(ISMB 2013)
Next-Generation Sequencing
Short
Reads
(~125bp)
Known
reference
genome
TTCGATCGTCGAAGGGCCCTTTAAGCTAGACTTTAGTG
AAATTCCG
TCGTTGAAGGACCCTTTAAGC
CGAAGGGCCCTTTAAGCTAGACTTTAGTG
TCGTTGAAGGACCCTTTAAGC
AAATTCCGTTCGATCGT
Reference-Biased
Results
Large
Database of
Genomic Data
Highly
Redundant
Data
Collection
Size
reference representation
2.
Reference Multi-Genome
A
A
TT
G
G
TT
C
C
C
C
A
A
A
A
TT
G
G
TT
A
A
C
C
A
A
A
A
TT
G
G
TT
C
C
C
C
A
A
A
A
C
C
G
G
TT
C
C
C
C
A
A
TT
A
A
TT
A
A
A
A
TT
TT
TT
TT
C
C
G
G
A
A
C
C
C
C
TT
TT
C
C
G
G
A
A
C
C
C
C
TT
TT
G
G
G
G
A
A
C
C
C
C
TT
TT
C
C
G
G
A
A
C
C
C
C
Reference Multi-Genome
A
A
TT
G
G
TT
C
C
C
C
A
A
A
A
TT
G
G
TT
A
A
C
C
A
A
A
A
TT
G
G
TT
C
C
C
C
A
A
A
A
C
C
G
G
TT
C
C
C
C
A
A
TT
A
A
TT
A
A
A
A
TT
TT
TT
TT
C
C
G
G
A
A
C
C
C
C
TT
TT
C
C
G
G
A
A
C
C
C
C
TT
TT
G
G
G
G
A
A
C
C
C
C
TT
TT
C
C
G
G
A
A
C
C
C
C
G
G
A
A
C
C
C
C
collapse
TT
A
A
TT
C
C
G
G
TT
C
C
A
A
A
A
C
C
TT
A
A
TT
C
C
TT
A
A
SNPs
A
A
TT
INDEL bubbles
TT
G
G
A
A
TT
C
C
G
G
TT
C
C
A
A
A
A
C
C
TT
A
A
A
A
TT
C
C
TT
A
A
TT
TT
G
G
G
G
A
A
C
C
C
C
A
A
TT
C
C
G
G
A
A
C
C
TT
C
C
A
A
TT
A
A
TT
A
A
C
C
TT
A
A
TT
G
G
G
G
A
A
C
C
C
C
G
G
A
A
C
C
C
C
TT
1. Encode SNPs
with IUPAC codes
TT
A
A
A
A
Y
Y
G
G
TT
C
C
M
M
TT
A
A
TT
A
A
TT
A
A
TT
S
S
TT
IUPAC
base
C|G|T
A|G|T
A|C|T
G|T
A|C
A|C|G|T
A|G
C|G
A|C|G
A|T
C|T
G
G
Y
Y
TT
C
C
M
M
TT
A
A
TT
A
A
TT
A
A
TT
S
S
G
G
A
A
C
C
C
C
TT
2. Append INDEL branches
padded with surrounding context
A
A
G
G
Y
Y
##
TT
M
M
C
C
M
M
C
C
##
A
A
A
A
TT
A
A
TT
TT
TT
M
M
C
C
A
A
TT
A
A
S
S
TT
TT
TT
##
M
M
S
S
##
C
C
TT
A
A
S
S
A
A
G
G
TT
A
A
TT
C
C
C
C
TT
S
S