Multiple Sequence Alignment (MSA)

Multiple Sequence
Alignment (MSA)
Plan
Introduction to sequence alignments
Multiple alignment construction
Traditional approaches
Alignment parameters
Alternative approaches
Multiple alignment main applications

MACSIMS : Multiple Alignment of Complete Sequences
Information Management System
Ecole Phylognomique, Carry le Rouet 2006
Local alignment / Global alignment
Sequence A
Sequence B
Global alignment
Sequence alignment on
their whole length
GGCTGACCACCTT
|||||||
GATCACTTCCATG
Optimal global pairwise alignment :

Needleman and Wunsch, 1970
Local alignment
Alignment of the high

similarity regions
GACCACCTT
|||||||
GATCACTT
Optimal local pairwise alignment :

Smith and Waterman, 1981
Pairwise alignment / Multiple alignment

Query: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229
G G GP E+ Y
LE+ LVF+QY +
AD
I
Sbjct: 193 AGG--GNAGPAFEVLYKG-----------------LEVATLVFMQYKKAPANADPSQVVI 233
Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284
+K
P+ K +DTG GLERLV + Q
+ YD L
E +++ G
++
Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292
Query: 285 AEDA---------DGIDMAYR--------------------------VLADHARTITVAL 309
E++
D D+ Y
+ADH + +T L
Sbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352
What is a multiple alignment?

A representation of a set of sequences, in which equivalent residues (e.g. functional
or structural) are aligned in columns
Conserved residues
Conservation profile
Secondary structure
MACS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)
Key:
SH3
PI-PLC-X
CH
SH2
PI-PLC-Y
rhoGEF
PH
C2
DAG_PE-bind
Why multiple alignments?

Integration of a sequence in the context of the protein family
Applications :
phylogeny
domain organisation
functional residue identification
2D/3D structure prediction
transmembrane prediction
MSA Construction
Multiple alignment construction

Traditional approaches
Optimal multiple alignment
Progressive multiple alignment
Alignment parameters
Residue similarity matrices
Gap penalties
Alternative approaches
Iterative alignment methods
Combinatorial algorithms
PipeAlign : a protein family analysis tool
Traditional
Approaches
Optimal multiple alignment

Is the direct extension of pairwise dynamic programming to N-dimension
(Sankoff, 1975).
Examine all possible alignments to find the optimal alignment
Exemple : alignment of 3 sequences
Problem
The optimised mathematical alignment is not necessarily the biologically optimal alignment
CPU time and memory required are prohibitive for practical purposes (the required time is
proportional to Nk for k sequences with length N) : limited to <10 sequences

Heuristic algorithm which avoids calculating all
alignments, but does not garuantee optimal alignment
possible
Principle :
Progressively align the sequences (or sequence groups) by pair
Problem :
Which sequences begin with ? In which order ?
first align closest sequences
How to estimate the distance between the sequences ?

align all pairs of sequences
calculate distance matrix from the pairwise alignments : distance matrix
construct a guide tree from this distance matrix
progressive multiple alignment following branching order in tree

Example : Alignment of 7 globins (Hbb_human, Hbb_horse, Hba_human,
Hba_horse, Myg_phyca, Glb5_petma and Lgb2_lupla)
Step 1 : Pairwise alignment of all sequences
Hbb_human
Ex : pairwise alignment
of 2 globin sequences
Hbb_horse
Hbb_human
Hba_human
Hba_human
Hbb_horse
1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
| |. |||.|| ||| ||| :|||||||||||||||||||||:||||||
2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
|.| :|. | | |||| . | | ||| |: . :| |. :| | |||
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ...
|| :| | | | ||
| | ||| |: . :| |. :| | |||.
2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
The alignment can be obtained with :

- global or local method
- dynamic programming or heuristic methods
Example : in Clustalx
=> global alignments
=> choice between
- heuristic method (used in Fasta program)
- dynamic programming (Smith & Waterman)
=> faster
=> better

Step 2 : Distance matrix construction
In Clustalx :
distance between 2 sequences = 1- nb of identical residues
nb of compared residues
Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance
Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla
1
2
3
4
5
6
7
.17
.59
.59
.77
.81
.87
1
.60
.59
.77
.82
.86
2
.13
.75
.73
.86
3
.75
.74
.88
4
.80
.93
5
.90
6

Step 3 : Sequential branching / Guide tree construction
Sequential branching
Guide tree
Hbb_human
Hba_human
Hbb_horse
Hba_horse
Hbb_horse
Hba_human
Hbb_human
Hba_horse
Myg_phyca
Glb5_petma
Glb5_petma
Myg_phyca
Lgb2_lupla
Lgb2_lupla
- Join the 2 closest sequences

- Recalculate distances and join the 2 closest
sequences or nodes
- Step 3 is repeated until all sequences are joined

Step 4 : Progressive alignment
The progressive multiple alignment follows the branching order in tree

xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla
H1
H5
H2
H6
H3
H4
H7
Progressive multiple alignment methods

Progressive
Global
Local
SBpima
SB
multal
NJ
UPGMA
ML
clustalx
multalign
pileup
MLpima
SB
UPGMA
ML
NJ
- Sequential Branching
- Unweighted Pair Grouping Method
- Maximum Likelihood
- Neighbor-Joining
Alignment
Parameters

Dynamic programming methods score an alignment using residue similarity
matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum,
Gonnet etc.
PAM 250

Dynamic programming methods score an alignment using residue similarity
matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum,
Gonnet etc.
Matrices are generally constructed by observing the mutations in large sets
of alignments, either sequence-based or structure-based
Matrices range from strict ones for comparing closely related sequences to
soft ones for very divergent sequences.
A single best matrix does not exist!!

ClustalW automatically selects a suitable matrix depending on the observed
pairwise % identity.
Gap penalties
A gap penalty is a cost for introducing gaps into the alignment,
corresponding to insertions or deletions in the sequences
SFGDLSNPGAVMG
HF-DLS-----HG
Fixed penalty :
P = a L with L the length of gap
Linear (or affine) penalty :
P=x+yL
Position specific and residue specific penalties :
x : gap opening penalty (gop)

y : gap extension penalty (gep)
ex : in ClustalW, gap penalties are :

- lowered at existing gaps
- increased close to (less than 8 residues) existing gaps
- lowered in hydrophilic stretches (loops)
otherwise : gap opening penalties are modified according to their observed relative
frequencies adjacent to gaps (Pascarella & Argos, 1992)
Goal is to introduce gaps in sequence segments

corresponding to flexible regions of the protein structure
Alternative
Approaches
Iterative alignment methods

Iterative Alignment e.g. PRRP (Gotoh, 1993)
- refine an initial progressive multiple alignment by iteratively dividing the alignment
into 2 profiles and realigning them.
Genetic Algorithms e.g. SAGA (Notredame et al, 1996)

- iteratively refine an alignment using genetic algorithms (evolves a population of
alignments in a quasi evolutionary manner)
Segment-to-segment alignment: DIALIGN (Morgenstern et al. 1999)

- search for locally conserved motifs in all sequences and compares segments of
sequences instead of single residues
Hidden Markov Models:

- iteratively refine an alignment using HMMs
e.g. HMMER (Eddy, 1998)

SAM (Karplus et al, 2001)
Multiple alignment methods

Progressive
Global
Local
SBpima
SB
multal
NJ
ML
UPGMA
MLpima
multalign
pileup
clustalx
prrp
dialign
Genetic Algo.
HMM
saga
hmmt
Iterative
BAliBASE: objective evaluation of MACS programs

Highqualityalignmentsbasedon3Dstructuralsuperpositionsandmanuallyverified
Alignmentscomparedonlyinreliablecoreblocks,excludingnonsuperposableregions
Separatereferencesetsspecificallydesigned to address distinct alignment problems
BAliBASE1 :Thompson et al. 1999 Bioinformatics
BAliBASE2 : Bahr et al, 2001 Nucl Acids Res.
reference set
description
small number of sequences: divergence, length
a family with one to 3 orphans
several sub-families
long N/C terminal extensions
long insertions
repeats
transmembrane regions
circular permutations
Ecole Phylognomique,8 Carry le Rouet 2006
Comparison of multiple alignment methods

=> Need of reference alignments to evaluate the
alignment programs
BaliBASE (Thompson et al. Bioinformatics. 1999) benchmark database
Alignments based on 3D structure superposition
Alignments must be compared for the superposable regions
Alignments take into account :
- the effect of the number of sequences
- the effect of the sequence length
- the effect of the sequence similarity
- alignment of an orphan sequence with a sequence family
- sub-family alignments
- alignments of sequences with different length (insertions,extensions)
Comparison of multiple alignment methods

> 35% Id : any method
Local / global methods
Colinear sequences
N/C-ter extensions or insertions
=> global methods

=> local methods
Progressive / iterative methods

Iterative algorithms usually improve alignment quality
Problems :
- Can give bad alignment in case of orphan sequences
- Iteratif process can be very long !
Example : alignment of 89 histone sequences (66-92 residues):
ClustalW
PRRP
Dialign
2 mins 41 secs
3 hours 40 mins
3 hours 48 mins
To increase the alignment quality, as many sequences as possible have to be integrated !
DbClustal: local and global algorithm coupling
Blast Database Search
Ballast Anchors
Query Sequence
Query Sequence
Anchors
Database Hits
Domain A
Domain B
Domain C
DbClustal Alignment
ClustalW / DbClustal comparison

ClustalW
DbClustal
Combinatorial algorithms
T-Coffee (Notredame et al. 2000) http://igs-server.cnrs-mrs.fr/Tcoffee/

performs local and global alignments for all pairs of sequences, then combines them in a progressive
multiple alignment, similar to ClustalW.
DbClustal (Thompson et al. 2000)
http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid
designed to align the sequences detected by a database search. Locally conserved motifs are detected
using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as
anchor points.
MAFFT (Katoh et al. 2002)
http://timpani.genome.ad.jp/%7Emafft/server
detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP
and a progressive algorithm
MUSCLE (Edgar, 2004) http://www.drive5.com/muscle

kmer distances and log-expectation scores, progressive and iterative refinement
PROBCONS (Do et al, 2005) http://probcons.stanford.edu

pairwise consistency based on an objective function
MultipleAlignmentQuality
Truncated Alignments
Ref1
V1 (<20%)
V2
40%)
(20-
Ref2
Ref3
Ref4
Ref5
Time
orphans
subgroups
extensions
insertion
s
(sec)
ClustalW1.83
0.42
0.78
0.42
0.52
0.41
0.38
902
Dialign2.2.1
0.31
0.71
0.37
0.39
0.45
0.43
5993
Mafft5.32
0.44
0.78
0.49
0.53
0.47
0.48
96
Maffti5.32
0.54
0.83
0.56
0.60
0.49
0.57
327
Muscle3.51
0.52
0.82
0.50
0.58
0.46
0.54
523
Muscle_fast
0.40
0.77
0.43
0.44
0.35
0.49
34
Muscle_med
0.45
0.80
0.50
0.59
0.44
0.51
219
Tcoffee2.66
0.47
0.84
0.50
0.64
0.54
0.58
216133
Probcons1.1
0.63
0.87
0.60
0.65
0.54
0.63
19035
1. Significant improvement in accuracy/efficiency since 2000

2. Twilight zone still exists
3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE
4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient
muscle_fast : muscle maxiters=1 diags1 sv distance1 kbit20_

muscle_medium : muscle maxiters=2
MultipleAlignmentQuality
Comparison:truncatedversusfulllengthsequences
Ref1
V1 (<20%)
1.
2.
3.
Ref2: orphans
Ref3: subgroups
V2 (20-40%)
Time (sec)
for all refs
FL
FL
FL
FL
FL
ClustalW1.8
3
0.42
0.24
0.78
0.72
0.42
0.20
0.52
0.27
902
2227
Dialign2.2.1
0.31
0.26
0.71
0.70
0.37
0.29
0.39
0.31
5993
12595
Mafft5.32
0.44
0.25
0.78
0.75
0.49
0.35
0.53
0.38
96
312
Maffti5.32
0.54
0.35
0.83
0.80
0.56
0.40
0.60
0.50
327
1409
Muscle3.51
0.52
0.34
0.82
0.79
0.50
0.36
0.58
0.39
523
3608
Muscle_fast
0.40
0.28
0.77
0.72
0.43
0.29
0.44
0.33
34
132
Muscle_med
0.45
0.29
0.80
0.74
0.50
0.34
0.59
0.38
219
1601
Tcoffee2.66
0.47
0.35
0.84
0.82
0.50
0.40
0.64
0.49
216133
341578
Probcons1.1
0.63
0.43
0.87
0.86
0.60
0.41
0.65
0.54
19035
58488
Lossofaccuracyismoreimportantintwilightzone(Ref1V1,orphans,andsubgroups)
Probconsstillscoresbestinalltests
MAFFTstillscoresbetterthanMUSCLEinalltests
Multiple alignment quality

Development of objective functions to estimate multiple alignment quality
Sum-of-pairs (Carrillo, Lipman, 1988)
Sum the scores of all the pair of sequences (based on a similarity matrix and gap penalty)
Relative Entropy:
uses a normalized log-likelihood ratio to measure the degree of conservation for each column
(identical residues only).
MD
(column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar
residues
norMD (Thompson et al, 2001)

- scores by column using a substitution matrix and gap penalties
- normalisation according to the sequences to align (their number, length and the similarity between
them)
Evaluation of Objective Functions using BAliBase
Multiple sequence alignment editors

No automatic method is 100% reliable.
Manual verification and refinement is essential!
SeqLab GCG Wisconsin Package
SeaView (Gaultier et al, 1996) http://pbil.univ-lyon1.fr/software/seaview.html
WEB servers :
GeneAlign (Kurukawa) http://www.gen-info.osaka-u.ac.jp/geneweb2/genealign/
Jalview
(Clamp, 1998) http://www.ebi.ac.uk/~michele/jalview/
CINEMA (Lord et al, 2002) http://www.bioinf.man.ac.uk/dbbrowser/cinema-mx
FASTA format
>O88763 Phosphatidylinositol 3-kinase.
------MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC
SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD-----VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSST
LSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKC
DDKE-YGIVYYE--->Q9W1M7 CG5373-PA (GH13170p).
-----MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH
PSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD-----CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKE
SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVV
DDMYNYAVVYFE--->Q7PMF0 ENSANGP00000002906 (Fragment).
------------LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT
PPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD-----CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AIT
TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYI
HEKL-YSVIHLE--->Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34
MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEIT
VYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH
EDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA
WG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRL
GPTF-YKVVYYEDETK
toto.msf
Name:
Name:
Name:
Name:
MSF: 256
O88763
Q9W1M7
Q7PMF0
Q9TXI7
Type: P
Len:
Len:
Len:
Len:
May 24, 2005 19:34

256
256
256
256
Check:
Check:
Check:
Check:
9443
1161
8095
4716
Check: 3415 ..
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
MSF format
//
O88763
Q9W1M7
Q7PMF0
Q9TXI7
1
......MGEA
.....MDQPD
..........
MIPGMRATPT
EKFHYIYSCD
DHFRYIHSSS
..LRYIGSSS
ESFSFVYSCD
LDINVQLKIG
LHERVQIKVG
LLQKISIKIG
LQTNVQVKVA
SLEGKREQKS
TLEGKKRQPD
TLEGENVGYS
EFEG.....I
50
YKAVLEDPML
YEKLLEDPIL
YEKLIEQPLL
FRDVLN.PVR
O88763
Q9W1M7
Q7PMF0
Q9TXI7
51
KFSGLYQETC
RFSGLYSEEH
KFSGMYTEKT
RLNQLFAEIT
SDLYVTCQVF
PSFQVRLQVF
PPLKVKLQIF
VYCNNQQIGY
AEGKPLALPV
NQGRPYCLPV
DNGEPVGLPV
PVCTSFHTPP
RTSYKPFSTR
TSSYKAFGKR
CTSHKHFTTR
DSSQLARQKL
100
WN.WNEWLKL
WS.WNEWVTL
WS.WNEWVTL
IQKWNEWLTL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
101
PVKYPDLPRN
PLQFSDLPRS
PLRFTDISRT
PIRYSDLSRD
AQVALTIWD.
AMLVLTILD.
AVLGLTIYD.
AFLHITIWEH
.....VYGPG
.....CSGAG
.....CAGGR
EDDEIVNNST
.RAVPVGGTT
.QTTVIGGTS
EQLTVVGGTS
FSRRLVAQSK
150
VSLFGKYGMF
ISMFGKDGMF
ISFFSTNGLF
LSMFSKRGIL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
151
RQGMHDLKVW
RQGMYDLRVW
RQGLYDLKVW
KSGVIDVQMN
PNVEADGSEP
LGVEGDGNFP
PQMEPDGACN
VSTTPDPFVK
TRTPGRTSST
SRTPGK.GKE
SITPGK.AIT
QPETWKYSDA
LSEDQMSRLA
SSKSQMQRLG
TGVHQMQRLS
WG.DEIDLLF
200
KLTKAHRQGH
KLAKKHRNGQ
KLAKKHRNGQ
KQVTRQSRGL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
201
MVKVLDRLTF
VQKVLDRLTF
MEKILDRLTF
VEDVLDPFAS
REIEMINESE
REIEVINERE
RELEVINEME
RRIEMIRAKY
KRSS..NFMY
KRMS..DYMF
KRNS..QFLY
KYSSPDRHVF
LMVEFRCVKC
LMIEFPAIVV
LMVEFPQVYI
LVLEMAAIRL
250
DDKE.YGIVY
DDMYNYAVVY
HEKL.YSVIH
GPTF.YKVVY
O88763
Q9W1M7
Q7PMF0
Q9TXI7
251
YE....
FE....
LE....
YEDETK
Multiple Sequence File
With an editor
PipeAlign : protein family analysis tool

http://bips.u-strasbg.fr/PipeAlign/
Plewniak et al, 2003
PipeAlign
INPUT: single sequence OR set of unaligned sequences
single
sequence
BlastP search
Identify motifs
conservation profile
list of homologs
single
sequence
Build multiple alignment
MACS of user-specified
homologs
multiple
alignment
Refine alignment
Correct alignment errors
refined MACS
multiple
alignment
Remove unrelated seq.
MACS of validated homologs
multiple
alignment
Validate alignment
validated MACS
multiple
alignment
Cluster sequences
Integrated family and sub-family analysis

Identification of key residues, domain organisation, mean predictions of cellular
location, transmembrane regions, 2D/3D structures, phylogeny studies, etc.
MSA Main
Applications
MSA : central role in biology

Comparative genomics
Phylogenetic studies
Hierarchical function annotation:

homologs, domains, motifs
Gene identification, validation
Structure comparison, modelling
MACS
Interaction networks
RNA sequence, structure, function
Human genetics, SNPs
Therapeutics, drug design

insertion domain
DBD
Therapeutics, drug discovery

LBD
binding sites / mutations
MACS : new landscape

High volume & heterogeneity of sequence data
Length: from tens of amino acids or nucleotides to thousands or millions (genomes)
Number: from tens up to thousands of sequences
Variability: from small percent identity to almost identical
Complexity: of the sequences to be aligned
- Family with linear or highly irregular repartition of sequence variability
- Heterogeneity of length, structure or composition (large insertions or
extensions, repeats, circular permutations, transmembrane regions)
Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation)
MACS : new concepts

Distinct objectives imply distinct needs & strategies
Overview of one sequence family to quickly infer and integrate information from a limited
number of closely related, well annotated sequences (reliable and efficient)
Exhaustive analysis of one sequence family for (very high quality)
- homology modeling
- phylogenetic studies
- subfamily-specific features (differentially conserved domains, regions or residues)
Massive analysis of sets of sequences (reliable/high quality and efficient)
- phylogenetic distribution, co-presence and co-absence and structural complex
- genome annotation
- target characterisation for functional genomics studies (transcriptomics)
Residue conservation identification
residues conserved in all sequences in family

structural or functional importance: characteristic motifs
residues conserved within a sub-group of sequences
discriminant residues
Ordered Alignment analysis of TyrRS

Euc
Arc
Euc
Bac
Motif I
Euc
Arc
Euc
Bac
Motif II
Euc
10 aa
Bac
N-terminal extension
EMAP domain
S4 domain
C-terminal extension
Ordered Alignment analysis of TyrRS

Euc
Arc
Euc
Bac
Motif I
Euc
Arc
Euc
Bac
Motif II
Euc
10 aa
Bac
N-terminal extension
EMAP domain
S4 domain
C-terminal extension
Multiple alignments = basis for calculation of the levels of similarity between sequences
Multiple alignments = basis for calculation of sequences evolutionary distances
Multiple alignments = basis for the computation of phylogenetic trees
Creation of high quality phylogenetic tree implies to

work with high quality multiple sequence alignments
PLASM FALC
Wholealignment
ar
Euc
ARABI THAL
SCHI PO MT
DROSO MEGA
SACC CE MT
MYCOP GENI
DROS ME MT
CAEN EL MT
Bacteria +
Mitochondrie
HOMO SAPIE
RATTU NORV
MYCOP PNEU
SCHIZ POMB
SACCH CERE
CANDI ALBI
BORRE BURG
TREPO PALI
MYCOP CAPR
BUCHN
AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE
AQUIF AEOL
THERM MARI
HELIC PYLO
METBA THER
METHA JANN
PYROC KODA
CHLAM TRAC
SYNECHO
SP
AR THA
CHL
BORDE PERT
NEISSMENI
GONO
NEISS
THERM
DEINO
RADI THER
BACIL SUBT
PSEUD AERU
ENTER FAEC
STREP PYOG
YERSISHEWA
PEST PUTR
ESCHE
COLI
SALMO TYPH
CHOL
HAEMO INFL ACTIN VIBRI
ACTI
PYROC HORI
ha
e
CLOST ACET
PORPH GING
CAMPY JEJU
Ar
c
MYCOB LEPR
ya
CAENO ELEG
PLASM FALC
SACCH CERE
ARABI THAL
ya
Bacteria
Archaea
Mito.
ar
Euk
Nterminus
globalgapremoval
SCHIZ POMB
CANDI ALBI
CAENO ELEG
DROSO MEGA
HALOB SALI
HOMO SAPIE
RATTU NORV
PYROC HORI
METBA THER
PYROC KODA
DROS ME MT
METHA JANN
SCHI PO MT
CAEN EL MT
ARCHE FULG
BORRE BURG
SACC CE MT
MYCOP CAPR
BUCHN AFID
PORPH GING
CLOST ACET
DEINO RADI
RICKE PROW
BACIL SUBT
RHODO CAPS
CHLOR TEPI
SYNECHO SP
CHLAM TRAC
BORDE PERT
NEISS MENI NEISS GONO
HELIC PYLO
CAMPY JEJU
MYCOB TUBE
MYCOB LEPR
TREPO PALI
0.1
PSEUD AERU
SHEWA PUTR
ESCHE
ENTER FAEC SALMO TYPH
YERSICOLI
PEST
VIBRI CHOL
HAEMO INFL
AQUIF
AEOL PYOG
ACTIN ACTI
STREP
THERM THER
THERM MARI
AR THA CHL
MYCOP GENI
MYCOP PNEU
SchematicalignmentofAspartyltRNAsynthetases
180
200
220
240
260
280
300
320
Euc
Arc
Eub
Anticodon binding domain
340
360
380
400
420
Euc
440
L Q PQ
460
KQ
480
500
520
540
560
Arc
Eub
Motif I
690
710
730
Flipping
loop
750
Motif II
Catalytic core I
770
790
810
Insertion domain
830
850
870
890
HG
Euc
Arc
Eub
Motif III
Catalytic core II
930
Protein sequence validation

Sequencing / frameshift error detection
Estimation:44%ofpredictedproteinsfrom genome sequencing projects and
31%ofhighthroughputcDNA(HTC)containerrorsintheirintron/exonstructure.
Bianchettietal,2005
Example: transcription TFIIH complex protein
Clustered MACS : Starter

Multiple alignment of complete sequences
Determination of sequence groups
Hierarchical clustering of positions

based on insertion/deletion
Definition of blocs
N-terminal region analysis :
Reference position
Proposed N-terminus : potential start codon
closest to the reference position
MXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXX
MXXXXXXMXXXMXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXX
extension
Reference position
3000 proteins from B. subtilis with wrong randomly generated N-ter. : 82% predicted
For the 3828 proteins from the Vibrio cholera proteome :
817 specific / 1722 valid start codons / 236 wrong (from 1 up to 56 aas)
Clustered MACS : vAlid

http://igbmc.u-strasbg.fr/vALId/
Bianchetti et al. (2005) JBCB
Clustered MACS : DbW

User
sequence
Daily Blastp
Characterization
of the
specificity of
the homologous
sequences
Clustering
-> Filter
Databases : DBWatcher [Plewniak, IGBMC]
- Proteins
- Structures
Automatic Daily Update
Filter
Integration of the
sub-family members
Automatic up-date of more than 300 different protein families

=> 24 AaRS (amino-acid tRNA synhetases), nuclear receptors,
ribosomal proteins, transcription factors
Prigent et al. (2005) BioInformatics
Clustered MACS : GOAnno

GoAnno : find a pertinent level automatically and propagate Gene Ontology to an
unannotated target protein according to clustered MACS
Chalmel et al. (2005) Bioinfomatics
Subfamily of the Query
minV(Verti) = MaxBranch * P
3
regulation of transcription
transcription
signal transduction
nucleobase, nucleoside, nucleotide

and nucleic acid metabolism
cell communication
cellular process
Level 6
16
16 + 3
Level 5
p
12
2 + 16
2 + 19
Level 4
0 + 12
0+2
0 + 18
0 + 21
Level 3
0 + 12
0+2
0 + 18
0 + 21
Level 2
0 + 12
0+ 2
0 + 18
0 + 21
Level 1
18
21
Level 0
metabolism
physiological processes
biological_process
12
Gene_Ontology
989 target proteins from

retinal transcriptome analysis
minV(Horiz) = 21 * F
795 proteins with a GO terms (increase of 47 %)
3085 GO terms (increase of 92 %)
Protein 3D structure prediction

Proteins with similar sequences tend to fold into similar structure
Above 50% identity, pairwise alignment is enough for accurate model
Below 50% identity, multiple alignment is better
Basic steps for comparative (homology) modelling :

1.
2.
3.
4.
5.
6.
Identify a template structure

Align the target sequence to the template sequence
Copy the backbone coordinates from template to the matching residues in the
target sequence
Build the side-chains (copied for identical residues, predicted for non-identical)
Model the loop regions
Optimise (energy refinement)
Applicable to ~60% of proteins from fully sequenced genomes
Protein functional characterisation

By homology : Similar sequences generally share similar structures
and often have similar functions
Propagation of information from a known sequence to an unknown one

e.g. domains, active sites, cellular localisation, post-transcriptional modifications,
1. Database search for homologues e.g. BlastP, PSI-Blast
2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)
3. Multiple alignment construction and analysis e.g. PipeAlign
MSA applications : Summary

Error in ORF
definition
Transmembrane
region
Additional
domain
Phosphorylation
site
1st
FAMILY
Bacteria
Bacteria
2
FAMILY
nd
Archaea
Eucarya
Differential
conservation between
the two families
NLS
Universal
conservation
Intra-group
conservation
domain organization, structural motifs

key functional residues, ORF definition
localization signals, conservation pattern
...
Functional
genomics
Evolutionary
studies
Structure
modeling
Mutagenesis
experiments
Drug design
Lecompte et al Gene. 2001
MACSIMS
MAO:MultipleAlignmentOntology
http://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html
MAO consortium:
- RNA analysis
(Steve HOLBROOK, Berkeley)
- MACS algorithm
(Kazutake KATOH, Kyoto)
- Protein 3D analysis
(Patrice KOEHL, Davis)
- Protein 3D structure
(Dino MORAS, Strasbourg)
- 3D RNA structure
(Eric WESTHOF, Strasbourg)
Thompson et al. (2005) Nucleic Acids

Res.
Also available
from OBO
web site: http://obo.sourceforge.net
Ecole Phylognomique, Carry
le Rouet
2006
MACSIMS
Multiple Alignment of Complete Sequences Information Management System
Thompson et al BMC Bioinformatics 2006
Structural and functional
information is mined
automatically from the
public databases
Homologous
regions
are identified in the
MACS
Mined data is evaluated

and cross-validated
Mined data is propagated

from known to unknown
sequences
with
the
homologous regions
MACSIMS provides a unique environment that facilitates knowledge extraction and

the presentation of the most pertinent information to the biologist
MACSIMS
http://bips.u-strasbg.fr/MACSIMS/
MACSIMS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)
Key:
SH3
PI-PLC-X
CH
SH2
PI-PLC-Y
rhoGEF
PH
C2
DAG_PE-bind
MACSIMS visualisation
JalView II, Coll. G. Barton
MACSIMS
BAliBASE reference 3: aldehyde dehydrogenase-like
*
*GSVPTG
* **
E
* * *
C
GSTKVG
GETRTG
GSTEVG
GSVSAG
GSRDVG
GSRDVG
GSRDVG
GSTNVF
GSTNVF
GSTAVF
Uniprot annotation
NAD binding
Active site
Active site
Summary
Choice of multiple alignment method
traditional progressive method (e.g. clustalw / clustalx)
combined local and global method (e.g. mafft, muscle, dbclustal)
knowledge-based method (e.g. PipeAlign)
Web Server versus Local Installation ?
WARNING: Automatic alignment methods can make mistakes.
Verify alignment quality by automatic methods (e.g. norMD) and visual inspection !
Multiple alignment applications

Traditional applications:
phylogeny
conserved residue / motif identification
Information in multiple alignments also improves accuracy in:
sequence error detection
structure prediction
functional annotation
Laboratory of Integrative Genomics and Bioinformatics

IGBMC, Strasbourg
alternative algorithms
IterativeRefinement
PRRP(Gotoh,1993)refinesaninitialprogressivemultiplealignment
byiterativelydividingthealignmentinto2profilesandrealigning
them.
dividesequences
into2groups
initial
alignment
profile1
pairwise
profile
alignment
refined
alignment
converged?
profile2
no
GeneticAlgorithms
SAGA(Notredame,Higgins,1996)evolvesapopulationofalignmentsinaquasievolutionary
manner,iterativelyimprovingthefitnessofthepopulation
populationn
selectanumberofindividualstobeparents
modifytheparentsbyshufflinggaps,merging2alignmentsetc.
populationn+1
evaluationofthefitnessusingOF
(sumofpairsorCOFFEE)
END
HMM
Probabilisticmodelforsequenceprofiles,visualizedasafinitestate
machine
Foreachcolumnofthealignmentamatchstatemodelsthedistribution
ofresiduesallowed
Insertanddeletestatesateachcolumnallowforinsertionordeletionof
oneormoreresidues
OriginalprofileHMM(Kroghetal,1994)
matchstate
AK
Y
W
L
L
AKY-L-D
--WVLED
insertstate
D
D
delete,
begin,
endstate
MultipleAlignmentusingHMM
generateinitialalignment
(BaumWelchexpectationmaximization)
HMMER(Eddy,unpublished)
produceamodel
SAMT98(Hughey,1996)
generatenewalignment
(Viterbialgorithmor
posteriordecoding)
evaluatealignment
(expectationmaximization)
END
SegmenttosegmentAlignment
Dialign(Morgensternetal.1996)comparessegmentsofsequences
insteadofsingleresidues
1.constructdotplotsofallpossiblepairsofsequences
Sequencei
Sequencej
2.findamaximalsetofconsistentdiagonalsinallthesequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek..........
........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns.......
ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv.....
gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds......
.....tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd
.......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp..
Localalignmentresiduesbetweenthediagonalsarenotaligned

Multiple Sequence Alignment (MSA)

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Multiple Sequence Alignment (MSA)

Transféré par

Droits d'auteur :

Formats disponibles

Multiple Sequence

Multiple alignment main applications

Ecole Phylognomique, Carry le Rouet 2006

Local alignment / Global alignment

Optimal global pairwise alignment :

Alignment of the high

Optimal local pairwise alignment :

Ecole Phylognomique, Carry le Rouet 2006

Pairwise alignment / Multiple alignment

Ecole Phylognomique, Carry le Rouet 2006

What is a multiple alignment?

Ecole Phylognomique, Carry le Rouet 2006

Ecole Phylognomique, Carry le Rouet 2006

Why multiple alignments?

Ecole Phylognomique, Carry le Rouet 2006

Ecole Phylognomique, Carry le Rouet 2006

Multiple alignment construction

Ecole Phylognomique, Carry le Rouet 2006

Ecole Phylognomique, Carry le Rouet 2006

Optimal multiple alignment

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

How to estimate the distance between the sequences ?

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

The alignment can be obtained with :

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

- Join the 2 closest sequences

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

The progressive multiple alignment follows the branching order in tree

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment methods

Ecole Phylognomique, Carry le Rouet 2006

Ecole Phylognomique, Carry le Rouet 2006

Residue similarity matrices

Ecole Phylognomique, Carry le Rouet 2006

Residue similarity matrices

A single best matrix does not exist!!

Ecole Phylognomique, Carry le Rouet 2006

P = a L with L the length of gap

Linear (or affine) penalty :

Position specific and residue specific penalties :

x : gap opening penalty (gop)

ex : in ClustalW, gap penalties are :

Goal is to introduce gaps in sequence segments

Ecole Phylognomique, Carry le Rouet 2006

Iterative alignment methods

Genetic Algorithms e.g. SAGA (Notredame et al, 1996)

Segment-to-segment alignment: DIALIGN (Morgenstern et al. 1999)

Hidden Markov Models:

e.g. HMMER (Eddy, 1998)

Ecole Phylognomique, Carry le Rouet 2006

Multiple alignment methods

Ecole Phylognomique, Carry le Rouet 2006

BAliBASE: objective evaluation of MACS programs

small number of sequences: divergence, length

a family with one to 3 orphans