Vous êtes sur la page 1sur 42

Burrows-Wheeler

Burrows-Wheeler Transform
Transform

DNA sequencing
How we obtain the sequence of nucleotides of a species

ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT

DNA Sequencing
Goal:
Find the complete sequence of A, C, G, Ts in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives the
complete sequence as output
Can only sequence ~150 letters at a time

Method to sequence longer regions

Two main assembly problems


De Novo Assembly

Resequencing

Read Mapping

CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...
GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT
GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......

Want ultra fast, highly similar alignment


Detection of genomic variation

Human Genome Variation


SNP

TGCTGAGA
TGCCGAGA

Novel Sequence

Inversion

Mobile Element or
Pseudogene Insertion

Translocation

Tandem Duplication

Microdeletion

Large Deletion

TGC - - AGA
TGCCGAGA

TGCTCGGAGA
TGC - - - GAGA

Transposition

Novel Sequence
at Breakpoint

TGC

Read Mapping

CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT...
GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT
GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT......

Modern fast read aligners: BWT, Bowtie, SOAP


" Based on Burrows-Wheeler transform

Burrows-Wheeler Transform (BWT)


The BWT of a string S is a

reversible permutation of S
that enables the search for a

pattern P in S to take linear

time with respect to the


length of P (O(|P|) time,
independent of the length of S)

Burrows-Wheeler Transform

ANA

X=

BANANA

BANANA
ANANA
NANA
ANA
NA
A

suffixes of
BANANA

BANANA
ANANA
NANA
ANA
NA
A

Burrows-Wheeler Transform

ANA

X=

BANANA$

BANANA$
ANANA$
NANA$
ANA$
NA$
A$
$

BANANA$
ANANA$
NANA$
ANA$
NA$
A$
$

Burrows-Wheeler Transform

ANA

X=

BANANA$

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

Burrows-Wheeler Transform

ANA

X=

BANANA$

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

Burrows-Wheeler Transform

ANA

X=

BANANA$

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

Burrows-Wheeler Transform

ANA

X=

BANANA$

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

BWT matrix of
string BANANA
BWT(BANANA) = ANNB$AA

Suffix Arrays
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

1
2
3
4
5
6
7

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

Suffixes are sorted in the BWT matrix


S(i) = j, where Xj Xn is the i-th suffix
lexicographically

BWT(X)

BWT(X) constructed from S:


At each position, take the
letter to the left of the one
pointed by S

Reconstructing BANANA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

A
N
N
B
$
A
A

A$
NA
NA
BA
$B
AN
AN

$
A
A
A
B
N
N

sort

append
BWT

$B
A$
AN
AN
BA
NA
NA

sort

A$B
NA$
NAN
BAN
$BA
ANA
ANA
append
BWT

$BA
A$B
ANA
ANA
BAN
NA$
NAN

sort

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

Lemma. The i-th occurrence of character c in last


column is the same text character as the i-th
occurrence of c in the first column

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

Lemma. The i-th occurrence of character c in last


column is the same text character as the i-th
occurrence of c in the first column

A $BANAN
N A$BANA
N ANA$BA
B ANANA$
$ BANANA
A NA$BAN
A NANA$B

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

Lemma. The i-th occurrence of character c in last


column is the same text character as the i-th
occurrence of c in the first column

A $BANAN
N A$BANA
N ANA$BA
B ANANA$
$ BANANA
A NA$BAN
A NANA$B

A$BANAN
ANA$BAN
ANANA$B

Same words,
same sorted order

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

Lemma. The i-th occurrence of character a in last


column is the same text character as the i-th
occurrence of a in the first column
LF(): Map the i-th occurrence of character a in last
column to the first column
LF(r): Let row r contain the i-th occurrence of a in last
column
Then, LF(r) = r; r: i-th row starting with a

Reconstructing BANANA - faster

LF(r): Let row r be the i-th occurrence of a in last column


Then, LF(r) = r; r: i-th row starting with a
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

LF[] = [2, 6, 7, 5, 1, 3, 4]

Row LF(r) is obtained by rotating row r one


position to the right

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

LF[] = [2, 6, 7, 5, 1, 3, 4]
Computing LF() is easy:
Let C(a): # of characters smaller than a
Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5
Let row r end with the i-th occurrence of a in last column
Then, LF(r) = C(a) + i

(why?)

Reconstructing BANANA - faster

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

C()

C() copied
for convenience

index i

indicating this is
i-th occurrence of c

LF()

LF() = C() + i

Reconstruct BANANA:
S := ; r := 1; c := BWT[r];
UNTIL c = $ {
S := cS;
r := LF(r);
c := BWT(r); }
Credit: Ben Langmead thesis

Searching for ANA


L(W): lowest index in BWT matrix where W is prefix
U(W): highest index in BWT matrix where W is prefix
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA
BWT matrix of
string BANANA

Example:
L(NA) = 6
U(NA) = 7
Lemma (prove as exercise)
L(aW) = C(a) + i +1,
where i = # as up to L(W) 1 in BWT(X)
U(aW) = C(a) + j,
where j = # as up to U(W) in BWT(X)
Example:
L(ANA) = C(A) + # As up to (L(NA) 1) + 1
= 1 + (# As up to 5) + 1
=1+1+1=3
U(ANA) = 1 + # As up to U(NA) = 1 + 3 = 4

Searching for ANA


Let
LFC(r, a) = C(a) + i, where i = #as up to r in BWT
$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

ExactMatch(W[1k]) {
a := W[k];
low := C(a) +1;
high := C(a+1); // a+1: lexicographically next char
i := k 1;
while (low <= high && i >= 1) {
a = W[i];
low = LFC(low 1, a) + 1;
high = LFC(high, a);
i := i 1; }
return (low, high);

BWT matrix of
string BANANA
}

Credit: Ben Langmead thesis

Summary of BWT algorithm


Suffix array of string X:
S(i) = j, where Xj Xn is the j-th suffix lexicographically
BWT follows immediately from suffix array
" Suffix array construction possible in O(n), many good O(n log n) algorithms

Reconstruct X from BWT(X) in time O(n)


Search for all exact occurrences of W in time O(|W|)
BWT(X) is easier to compress than X

BWT-based Aligners In Practice

Inexact matching: allow mismatches and gaps during alignment


=> keep track of a set of SA intervals

Heuristics: seeds, bounds on number of allowed differences,


scoring (gap open/extend, mismatches)
Memory considerations: sampling the suffix array and the
occurrence array, compression
Typical aligner phases
- stage 1: BWT index construction
- stage 2: short-Read Mapping
- stage 3: alignment results reporting/evaluation

Short Read
Alignment
with
Populations
Of Genomes
(ISMB 2013)

Next-Generation Sequencing

Sequencing Protocol (e.g. Illumina)

1000 Genomes Project

Many Other Sequencing Efforts

1000 Plant Genomes Project

Genome 10K (vertebrates)

100K Genome Project


(infectious microorganisms)
i5K Project (arthropods)
1001 Genomes Project
(A. thaliana)
...

Step 1: Short Read Alignment


New
donor
genome

Short
Reads
(~125bp)

Known
reference
genome

TTCGATCGTCGAAGGGCCCTTTAAGCTAGACTTTAGTG

AAATTCCG

TCGTTGAAGGACCCTTTAAGC

Single Reference Alignment

CGAAGGGCCCTTTAAGCTAGACTTTAGTG
TCGTTGAAGGACCCTTTAAGC

AAATTCCGTTCGATCGT

Reference-Biased
Results

Large
Database of
Genomic Data

Nave Multiple Reference Alignment


Time/
Space

Highly
Redundant
Data
Collection
Size

Instead: create a compressed

reference representation

(reference multi-genome) that captures all the variations in the


genome collection and can be used for short-read alignment

Single Reference Aligners


Li H. and Durbin R. Fast and accurate short read alignment with BurrowsWheeler Transform. Bioinformatics, 2009 (BWA)
Li et al. SOAP2: an improved ultrafast tool for short read alignment.
Bioinformatics. 2009
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie
Nature Methods. 2012, 9:357-359.

2.

Use the Burrows-Wheeler Transform (BWT) to efficiently


map reads to a single reference genome
Idea: adapt the BWT to operate on the multi-genome
representation

Reference Multi-Genome
A
A

TT

G
G

TT

C
C

C
C

A
A

A
A

TT

G
G

TT

A
A

C
C

A
A

A
A

TT

G
G

TT

C
C

C
C

A
A

A
A

C
C

G
G

TT

C
C

C
C

A
A

TT
A
A

TT

A
A

A
A

TT

TT

TT

TT

C
C

G
G

A
A

C
C

C
C

TT

TT

C
C

G
G

A
A

C
C

C
C

TT

TT

G
G

G
G

A
A

C
C

C
C

TT

TT

C
C

G
G

A
A

C
C

C
C

Reference Multi-Genome
A
A

TT

G
G

TT

C
C

C
C

A
A

A
A

TT

G
G

TT

A
A

C
C

A
A

A
A

TT

G
G

TT

C
C

C
C

A
A

A
A

C
C

G
G

TT

C
C

C
C

A
A

TT
A
A

TT

A
A

A
A

TT

TT

TT

TT

C
C

G
G

A
A

C
C

C
C

TT

TT

C
C

G
G

A
A

C
C

C
C

TT

TT

G
G

G
G

A
A

C
C

C
C

TT

TT

C
C

G
G

A
A

C
C

C
C

G
G

A
A

C
C

C
C

collapse
TT

A
A

TT
C
C

G
G

TT

C
C
A
A

A
A
C
C

TT

A
A

TT

C
C
TT

A
A

SNPs

A
A

TT

INDEL bubbles

TT

G
G

Linear Reference Multi-Genome


TT

A
A

TT
C
C

G
G

TT

C
C
A
A

A
A
C
C

TT

A
A

A
A

TT

C
C
TT

A
A

TT

TT

G
G

G
G

A
A

C
C

C
C

Linear Reference Multi-Genome


TT

A
A

TT
C
C

G
G

A
A

C
C

TT

C
C

A
A

TT

A
A

TT

A
A

C
C
TT

A
A

TT

G
G

G
G

A
A

C
C

C
C

G
G

A
A

C
C

C
C

TT
1. Encode SNPs
with IUPAC codes
TT

A
A
A
A

Y
Y

G
G

TT

C
C

M
M

TT

A
A

TT

A
A

TT
A
A

TT

S
S

TT

IUPAC

base

C|G|T

A|G|T

A|C|T

G|T

A|C

A|C|G|T

A|G

C|G

A|C|G

A|T

C|T

Linear Reference Multi-Genome


TT
A
A
A
A

G
G

Y
Y

TT

C
C

M
M

TT

A
A

TT

A
A

TT
A
A

TT

S
S

G
G

A
A

C
C

C
C

TT
2. Append INDEL branches
padded with surrounding context

A
A

G
G

Y
Y

##

TT

M
M

C
C

M
M

C
C

##

A
A

A
A

TT

A
A

TT

TT

TT

M
M

C
C

A
A

TT

A
A

S
S

TT

TT

TT

##

M
M

S
S

##

C
C

TT

A
A

S
S

A
A

G
G

TT

A
A

TT

C
C

C
C

TT

S
S

Adapted Backwards Search


A read can now match multiple substrings in the reference:
e.g. read R = AT can match AT, RW, AY, etc
Given the SA interval set of W, SA(W) = {[L1(W), U1(W)], [L2(W), U2(W)], ...},
we can compute SA(W) as:
UNION of SA(sW) for s S (set of IUPAC characters matching )

Vous aimerez peut-être aussi