Vous êtes sur la page 1sur 78

Multiple Sequence

Alignment (MSA)

Plan
Introduction to sequence alignments
Multiple alignment construction
Traditional approaches
Alignment parameters
Alternative approaches

Multiple alignment main applications


MACSIMS : Multiple Alignment of Complete Sequences
Information Management System

Ecole Phylognomique, Carry le Rouet 2006

Local alignment / Global alignment

Sequence A
Sequence B

Global alignment

Sequence alignment on
their whole length
GGCTGACCACCTT
|||||||
GATCACTTCCATG

Optimal global pairwise alignment :


Needleman and Wunsch, 1970

Local alignment

Alignment of the high


similarity regions
GACCACCTT
|||||||
GATCACTT

Optimal local pairwise alignment :


Smith and Waterman, 1981

Ecole Phylognomique, Carry le Rouet 2006

Pairwise alignment / Multiple alignment


Query: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229
G G GP E+ Y
LE+ LVF+QY +
AD
I
Sbjct: 193 AGG--GNAGPAFEVLYKG-----------------LEVATLVFMQYKKAPANADPSQVVI 233
Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284
+K
P+ K +DTG GLERLV + Q
+ YD L
E +++ G
++
Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292
Query: 285 AEDA---------DGIDMAYR--------------------------VLADHARTITVAL 309
E++
D D+ Y
+ADH + +T L
Sbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352

Ecole Phylognomique, Carry le Rouet 2006

What is a multiple alignment?


A representation of a set of sequences, in which equivalent residues (e.g. functional
or structural) are aligned in columns
Conserved residues

Conservation profile

Secondary structure

Ecole Phylognomique, Carry le Rouet 2006

MACS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)

Key:
SH3

PI-PLC-X

CH

SH2

PI-PLC-Y

rhoGEF

PH

C2

DAG_PE-bind

Ecole Phylognomique, Carry le Rouet 2006

Why multiple alignments?


Integration of a sequence in the context of the protein family
Applications :
phylogeny
domain organisation
functional residue identification
2D/3D structure prediction
transmembrane prediction

Ecole Phylognomique, Carry le Rouet 2006

MSA Construction

Ecole Phylognomique, Carry le Rouet 2006

Multiple alignment construction


Traditional approaches
Optimal multiple alignment
Progressive multiple alignment

Alignment parameters
Residue similarity matrices
Gap penalties

Alternative approaches
Iterative alignment methods
Combinatorial algorithms
PipeAlign : a protein family analysis tool

Ecole Phylognomique, Carry le Rouet 2006

Traditional
Approaches

Ecole Phylognomique, Carry le Rouet 2006

Optimal multiple alignment


Is the direct extension of pairwise dynamic programming to N-dimension
(Sankoff, 1975).
Examine all possible alignments to find the optimal alignment
Exemple : alignment of 3 sequences

Problem
The optimised mathematical alignment is not necessarily the biologically optimal alignment
CPU time and memory required are prohibitive for practical purposes (the required time is
proportional to Nk for k sequences with length N) : limited to <10 sequences

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment


Heuristic algorithm which avoids calculating all
alignments, but does not garuantee optimal alignment

possible

Principle :
Progressively align the sequences (or sequence groups) by pair
Problem :
Which sequences begin with ? In which order ?
first align closest sequences

How to estimate the distance between the sequences ?


align all pairs of sequences
calculate distance matrix from the pairwise alignments : distance matrix
construct a guide tree from this distance matrix
progressive multiple alignment following branching order in tree

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment


Example : Alignment of 7 globins (Hbb_human, Hbb_horse, Hba_human,
Hba_horse, Myg_phyca, Glb5_petma and Lgb2_lupla)
Step 1 : Pairwise alignment of all sequences
Hbb_human

Ex : pairwise alignment
of 2 globin sequences

Hbb_horse
Hbb_human
Hba_human
Hba_human
Hbb_horse

1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
| |. |||.|| ||| ||| :|||||||||||||||||||||:||||||
2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
|.| :|. | | |||| . | | ||| |: . :| |. :| | |||
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ...
|| :| | | | ||
| | ||| |: . :| |. :| | |||.
2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

The alignment can be obtained with :


- global or local method
- dynamic programming or heuristic methods
Example : in Clustalx
=> global alignments
=> choice between
- heuristic method (used in Fasta program)
- dynamic programming (Smith & Waterman)

=> faster
=> better

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment


Step 2 : Distance matrix construction

In Clustalx :
distance between 2 sequences = 1- nb of identical residues
nb of compared residues
Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance

Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla

1
2
3
4
5
6
7

.17
.59
.59
.77
.81
.87
1

.60
.59
.77
.82
.86
2

.13
.75
.73
.86
3

.75
.74
.88
4

.80
.93
5

Ecole Phylognomique, Carry le Rouet 2006

.90
6

Progressive multiple alignment


Step 3 : Sequential branching / Guide tree construction
Sequential branching

Guide tree
Hbb_human

Hba_human

Hbb_horse

Hba_horse
Hbb_horse

Hba_human

Hbb_human

Hba_horse
Myg_phyca

Glb5_petma

Glb5_petma

Myg_phyca

Lgb2_lupla

Lgb2_lupla

- Join the 2 closest sequences


- Recalculate distances and join the 2 closest
sequences or nodes
- Step 3 is repeated until all sequences are joined

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment


Step 4 : Progressive alignment

The progressive multiple alignment follows the branching order in tree


xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx

Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment

H1

H5

H2

H6

H3

H4

H7

Ecole Phylognomique, Carry le Rouet 2006

Progressive multiple alignment methods


Progressive
Global

Local

SBpima

SB

multal
NJ
UPGMA

ML

clustalx

multalign
pileup

MLpima

SB
UPGMA
ML
NJ

- Sequential Branching
- Unweighted Pair Grouping Method
- Maximum Likelihood
- Neighbor-Joining

Ecole Phylognomique, Carry le Rouet 2006

Alignment
Parameters

Ecole Phylognomique, Carry le Rouet 2006

Residue similarity matrices


Dynamic programming methods score an alignment using residue similarity
matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum,
Gonnet etc.

PAM 250

Ecole Phylognomique, Carry le Rouet 2006

Residue similarity matrices


Dynamic programming methods score an alignment using residue similarity
matrices, containing a score for matching all pairs of residues
For proteins, a wide variety of matrices exist: Identity, PAM, Blosum,
Gonnet etc.
Matrices are generally constructed by observing the mutations in large sets
of alignments, either sequence-based or structure-based
Matrices range from strict ones for comparing closely related sequences to
soft ones for very divergent sequences.

A single best matrix does not exist!!


ClustalW automatically selects a suitable matrix depending on the observed
pairwise % identity.

Ecole Phylognomique, Carry le Rouet 2006

Gap penalties
A gap penalty is a cost for introducing gaps into the alignment,
corresponding to insertions or deletions in the sequences

SFGDLSNPGAVMG
HF-DLS-----HG

Fixed penalty :

P = a L with L the length of gap

Linear (or affine) penalty :

P=x+yL

Position specific and residue specific penalties :

x : gap opening penalty (gop)


y : gap extension penalty (gep)

ex : in ClustalW, gap penalties are :


- lowered at existing gaps
- increased close to (less than 8 residues) existing gaps
- lowered in hydrophilic stretches (loops)
otherwise : gap opening penalties are modified according to their observed relative
frequencies adjacent to gaps (Pascarella & Argos, 1992)

Goal is to introduce gaps in sequence segments


corresponding to flexible regions of the protein structure

Alternative
Approaches

Ecole Phylognomique, Carry le Rouet 2006

Iterative alignment methods


Iterative Alignment e.g. PRRP (Gotoh, 1993)
- refine an initial progressive multiple alignment by iteratively dividing the alignment
into 2 profiles and realigning them.

Genetic Algorithms e.g. SAGA (Notredame et al, 1996)


- iteratively refine an alignment using genetic algorithms (evolves a population of
alignments in a quasi evolutionary manner)

Segment-to-segment alignment: DIALIGN (Morgenstern et al. 1999)


- search for locally conserved motifs in all sequences and compares segments of
sequences instead of single residues

Hidden Markov Models:


- iteratively refine an alignment using HMMs

e.g. HMMER (Eddy, 1998)


SAM (Karplus et al, 2001)

Ecole Phylognomique, Carry le Rouet 2006

Multiple alignment methods


Progressive
Global

Local

SBpima

SB

multal
NJ

ML

UPGMA

MLpima

multalign
pileup

clustalx

prrp
dialign

Genetic Algo.

HMM

saga

hmmt

Iterative

Ecole Phylognomique, Carry le Rouet 2006

BAliBASE: objective evaluation of MACS programs


Highqualityalignmentsbasedon3Dstructuralsuperpositionsandmanuallyverified
Alignmentscomparedonlyinreliablecoreblocks,excludingnonsuperposableregions
Separatereferencesetsspecificallydesigned to address distinct alignment problems
BAliBASE1 :Thompson et al. 1999 Bioinformatics
BAliBASE2 : Bahr et al, 2001 Nucl Acids Res.

reference set

description

small number of sequences: divergence, length

a family with one to 3 orphans

several sub-families

long N/C terminal extensions

long insertions

repeats

transmembrane regions

circular permutations
Ecole Phylognomique,8 Carry le Rouet 2006

Comparison of multiple alignment methods


=> Need of reference alignments to evaluate the
alignment programs
BaliBASE (Thompson et al. Bioinformatics. 1999) benchmark database
Alignments based on 3D structure superposition
Alignments must be compared for the superposable regions
Alignments take into account :
- the effect of the number of sequences
- the effect of the sequence length
- the effect of the sequence similarity
- alignment of an orphan sequence with a sequence family
- sub-family alignments
- alignments of sequences with different length (insertions,extensions)

Ecole Phylognomique, Carry le Rouet 2006

Comparison of multiple alignment methods


> 35% Id : any method
Local / global methods
Colinear sequences
N/C-ter extensions or insertions

=> global methods


=> local methods

Progressive / iterative methods


Iterative algorithms usually improve alignment quality
Problems :
- Can give bad alignment in case of orphan sequences
- Iteratif process can be very long !
Example : alignment of 89 histone sequences (66-92 residues):

ClustalW
PRRP
Dialign

2 mins 41 secs
3 hours 40 mins
3 hours 48 mins

To increase the alignment quality, as many sequences as possible have to be integrated !

DbClustal: local and global algorithm coupling

Blast Database Search

Ballast Anchors
Query Sequence

Query Sequence

Anchors

Database Hits

Domain A
Domain B
Domain C

DbClustal Alignment

ClustalW / DbClustal comparison


ClustalW

DbClustal

Ecole Phylognomique, Carry le Rouet 2006

Combinatorial algorithms

T-Coffee (Notredame et al. 2000) http://igs-server.cnrs-mrs.fr/Tcoffee/


performs local and global alignments for all pairs of sequences, then combines them in a progressive
multiple alignment, similar to ClustalW.

DbClustal (Thompson et al. 2000)

http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid

designed to align the sequences detected by a database search. Locally conserved motifs are detected
using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as
anchor points.

MAFFT (Katoh et al. 2002)

http://timpani.genome.ad.jp/%7Emafft/server

detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP
and a progressive algorithm

MUSCLE (Edgar, 2004) http://www.drive5.com/muscle


kmer distances and log-expectation scores, progressive and iterative refinement

PROBCONS (Do et al, 2005) http://probcons.stanford.edu


pairwise consistency based on an objective function

Ecole Phylognomique, Carry le Rouet 2006

MultipleAlignmentQuality
Truncated Alignments
Ref1
V1 (<20%)

V2
40%)

(20-

Ref2

Ref3

Ref4

Ref5

Time

orphans

subgroups

extensions

insertion
s

(sec)

ClustalW1.83

0.42

0.78

0.42

0.52

0.41

0.38

902

Dialign2.2.1

0.31

0.71

0.37

0.39

0.45

0.43

5993

Mafft5.32

0.44

0.78

0.49

0.53

0.47

0.48

96

Maffti5.32

0.54

0.83

0.56

0.60

0.49

0.57

327

Muscle3.51

0.52

0.82

0.50

0.58

0.46

0.54

523

Muscle_fast

0.40

0.77

0.43

0.44

0.35

0.49

34

Muscle_med

0.45

0.80

0.50

0.59

0.44

0.51

219

Tcoffee2.66

0.47

0.84

0.50

0.64

0.54

0.58

216133

Probcons1.1

0.63

0.87

0.60

0.65

0.54

0.63

19035

1. Significant improvement in accuracy/efficiency since 2000


2. Twilight zone still exists
3. Probcons scores best in all tests, but is MUCH slower than MAFFT or MUSCLE
4. MAFFTI scores slightly better than MUSCLE in all test, and is more efficient

muscle_fast : muscle maxiters=1 diags1 sv distance1 kbit20_


muscle_medium : muscle maxiters=2

MultipleAlignmentQuality
Comparison:truncatedversusfulllengthsequences
Ref1
V1 (<20%)

1.
2.
3.

Ref2: orphans

Ref3: subgroups

V2 (20-40%)

Time (sec)
for all refs

FL

FL

FL

FL

FL

ClustalW1.8
3

0.42

0.24

0.78

0.72

0.42

0.20

0.52

0.27

902

2227

Dialign2.2.1

0.31

0.26

0.71

0.70

0.37

0.29

0.39

0.31

5993

12595

Mafft5.32

0.44

0.25

0.78

0.75

0.49

0.35

0.53

0.38

96

312

Maffti5.32

0.54

0.35

0.83

0.80

0.56

0.40

0.60

0.50

327

1409

Muscle3.51

0.52

0.34

0.82

0.79

0.50

0.36

0.58

0.39

523

3608

Muscle_fast

0.40

0.28

0.77

0.72

0.43

0.29

0.44

0.33

34

132

Muscle_med

0.45

0.29

0.80

0.74

0.50

0.34

0.59

0.38

219

1601

Tcoffee2.66

0.47

0.35

0.84

0.82

0.50

0.40

0.64

0.49

216133

341578

Probcons1.1

0.63

0.43

0.87

0.86

0.60

0.41

0.65

0.54

19035

58488

Lossofaccuracyismoreimportantintwilightzone(Ref1V1,orphans,andsubgroups)
Probconsstillscoresbestinalltests
MAFFTstillscoresbetterthanMUSCLEinalltests

Multiple alignment quality


Development of objective functions to estimate multiple alignment quality
Sum-of-pairs (Carrillo, Lipman, 1988)
Sum the scores of all the pair of sequences (based on a similarity matrix and gap penalty)

Relative Entropy:
uses a normalized log-likelihood ratio to measure the degree of conservation for each column
(identical residues only).

MD

(column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar
residues

norMD (Thompson et al, 2001)


- scores by column using a substitution matrix and gap penalties
- normalisation according to the sequences to align (their number, length and the similarity between
them)

Ecole Phylognomique, Carry le Rouet 2006

Evaluation of Objective Functions using BAliBase

Ecole Phylognomique, Carry le Rouet 2006

Multiple sequence alignment editors


No automatic method is 100% reliable.
Manual verification and refinement is essential!
SeqLab GCG Wisconsin Package
SeaView (Gaultier et al, 1996) http://pbil.univ-lyon1.fr/software/seaview.html
WEB servers :
GeneAlign (Kurukawa) http://www.gen-info.osaka-u.ac.jp/geneweb2/genealign/
Jalview

(Clamp, 1998) http://www.ebi.ac.uk/~michele/jalview/

CINEMA (Lord et al, 2002) http://www.bioinf.man.ac.uk/dbbrowser/cinema-mx

Ecole Phylognomique, Carry le Rouet 2006

FASTA format
>O88763 Phosphatidylinositol 3-kinase.
------MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC
SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD-----VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSST
LSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKC
DDKE-YGIVYYE--->Q9W1M7 CG5373-PA (GH13170p).
-----MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH
PSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD-----CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKE
SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVV
DDMYNYAVVYFE--->Q7PMF0 ENSANGP00000002906 (Fragment).
------------LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT
PPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD-----CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AIT
TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYI
HEKL-YSVIHLE--->Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34
MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEIT
VYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH
EDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA
WG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRL
GPTF-YKVVYYEDETK

Ecole Phylognomique, Carry le Rouet 2006

toto.msf
Name:
Name:
Name:
Name:

MSF: 256

O88763
Q9W1M7
Q7PMF0
Q9TXI7

Type: P
Len:
Len:
Len:
Len:

May 24, 2005 19:34


256
256
256
256

Check:
Check:
Check:
Check:

9443
1161
8095
4716

Check: 3415 ..
Weight:
Weight:
Weight:
Weight:

1.00
1.00
1.00
1.00

MSF format

//
O88763
Q9W1M7
Q7PMF0
Q9TXI7

1
......MGEA
.....MDQPD
..........
MIPGMRATPT

EKFHYIYSCD
DHFRYIHSSS
..LRYIGSSS
ESFSFVYSCD

LDINVQLKIG
LHERVQIKVG
LLQKISIKIG
LQTNVQVKVA

SLEGKREQKS
TLEGKKRQPD
TLEGENVGYS
EFEG.....I

50
YKAVLEDPML
YEKLLEDPIL
YEKLIEQPLL
FRDVLN.PVR

O88763
Q9W1M7
Q7PMF0
Q9TXI7

51
KFSGLYQETC
RFSGLYSEEH
KFSGMYTEKT
RLNQLFAEIT

SDLYVTCQVF
PSFQVRLQVF
PPLKVKLQIF
VYCNNQQIGY

AEGKPLALPV
NQGRPYCLPV
DNGEPVGLPV
PVCTSFHTPP

RTSYKPFSTR
TSSYKAFGKR
CTSHKHFTTR
DSSQLARQKL

100
WN.WNEWLKL
WS.WNEWVTL
WS.WNEWVTL
IQKWNEWLTL

O88763
Q9W1M7
Q7PMF0
Q9TXI7

101
PVKYPDLPRN
PLQFSDLPRS
PLRFTDISRT
PIRYSDLSRD

AQVALTIWD.
AMLVLTILD.
AVLGLTIYD.
AFLHITIWEH

.....VYGPG
.....CSGAG
.....CAGGR
EDDEIVNNST

.RAVPVGGTT
.QTTVIGGTS
EQLTVVGGTS
FSRRLVAQSK

150
VSLFGKYGMF
ISMFGKDGMF
ISFFSTNGLF
LSMFSKRGIL

O88763
Q9W1M7
Q7PMF0
Q9TXI7

151
RQGMHDLKVW
RQGMYDLRVW
RQGLYDLKVW
KSGVIDVQMN

PNVEADGSEP
LGVEGDGNFP
PQMEPDGACN
VSTTPDPFVK

TRTPGRTSST
SRTPGK.GKE
SITPGK.AIT
QPETWKYSDA

LSEDQMSRLA
SSKSQMQRLG
TGVHQMQRLS
WG.DEIDLLF

200
KLTKAHRQGH
KLAKKHRNGQ
KLAKKHRNGQ
KQVTRQSRGL

O88763
Q9W1M7
Q7PMF0
Q9TXI7

201
MVKVLDRLTF
VQKVLDRLTF
MEKILDRLTF
VEDVLDPFAS

REIEMINESE
REIEVINERE
RELEVINEME
RRIEMIRAKY

KRSS..NFMY
KRMS..DYMF
KRNS..QFLY
KYSSPDRHVF

LMVEFRCVKC
LMIEFPAIVV
LMVEFPQVYI
LVLEMAAIRL

250
DDKE.YGIVY
DDMYNYAVVY
HEKL.YSVIH
GPTF.YKVVY

O88763
Q9W1M7
Q7PMF0
Q9TXI7

251
YE....
FE....
LE....
YEDETK

Multiple Sequence File

With an editor

Ecole Phylognomique, Carry le Rouet 2006

PipeAlign : protein family analysis tool


http://bips.u-strasbg.fr/PipeAlign/
Plewniak et al, 2003

PipeAlign
INPUT: single sequence OR set of unaligned sequences
single
sequence

BlastP search
Identify motifs

conservation profile
list of homologs

single
sequence

Build multiple alignment

MACS of user-specified
homologs

multiple
alignment

Refine alignment
Correct alignment errors

refined MACS

multiple
alignment

Remove unrelated seq.

MACS of validated homologs

multiple
alignment

Validate alignment

validated MACS

multiple
alignment

Cluster sequences

Integrated family and sub-family analysis


Identification of key residues, domain organisation, mean predictions of cellular
location, transmembrane regions, 2D/3D structures, phylogeny studies, etc.

MSA Main
Applications

Ecole Phylognomique, Carry le Rouet 2006

MSA : central role in biology


Comparative genomics

Phylogenetic studies

Hierarchical function annotation:


homologs, domains, motifs

Gene identification, validation

Structure comparison, modelling

MACS

Interaction networks

RNA sequence, structure, function

Human genetics, SNPs

Therapeutics, drug design


insertion domain

DBD

Therapeutics, drug discovery


LBD
binding sites / mutations

MACS : new landscape


High volume & heterogeneity of sequence data
Length: from tens of amino acids or nucleotides to thousands or millions (genomes)
Number: from tens up to thousands of sequences
Variability: from small percent identity to almost identical
Complexity: of the sequences to be aligned
- Family with linear or highly irregular repartition of sequence variability
- Heterogeneity of length, structure or composition (large insertions or
extensions, repeats, circular permutations, transmembrane regions)
Fidelity: from 15-30% errors (sequence, eucaryotic gene prediction, annotation)

MACS : new concepts


Distinct objectives imply distinct needs & strategies
Overview of one sequence family to quickly infer and integrate information from a limited
number of closely related, well annotated sequences (reliable and efficient)
Exhaustive analysis of one sequence family for (very high quality)
- homology modeling
- phylogenetic studies
- subfamily-specific features (differentially conserved domains, regions or residues)
Massive analysis of sets of sequences (reliable/high quality and efficient)
- phylogenetic distribution, co-presence and co-absence and structural complex
- genome annotation
- target characterisation for functional genomics studies (transcriptomics)

Residue conservation identification

residues conserved in all sequences in family


structural or functional importance: characteristic motifs
residues conserved within a sub-group of sequences
discriminant residues

Ecole Phylognomique, Carry le Rouet 2006

Ordered Alignment analysis of TyrRS


Euc
Arc
Euc
Bac
Motif I

Euc
Arc
Euc
Bac
Motif II

Euc

10 aa

Bac

N-terminal extension

EMAP domain

S4 domain

C-terminal extension

Ordered Alignment analysis of TyrRS


Euc
Arc
Euc

Bac

Motif I

Euc
Arc
Euc

Bac
Motif II

Euc

10 aa

Bac

N-terminal extension

EMAP domain

S4 domain

C-terminal extension

Phylogenetic studies
Multiple alignments = basis for calculation of the levels of similarity between sequences

Multiple alignments = basis for calculation of sequences evolutionary distances

Multiple alignments = basis for the computation of phylogenetic trees

Creation of high quality phylogenetic tree implies to


work with high quality multiple sequence alignments

Ecole Phylognomique, Carry le Rouet 2006

Phylogenetic studies
PLASM FALC

Wholealignment

ar
Euc

ARABI THAL

SCHI PO MT

DROSO MEGA

SACC CE MT
MYCOP GENI

DROS ME MT
CAEN EL MT

Bacteria +
Mitochondrie

HOMO SAPIE
RATTU NORV

MYCOP PNEU

SCHIZ POMB
SACCH CERE
CANDI ALBI

BORRE BURG
TREPO PALI

MYCOP CAPR
BUCHN
AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE

AQUIF AEOL
THERM MARI

HELIC PYLO

METBA THER
METHA JANN

PYROC KODA

CHLAM TRAC
SYNECHO
SP
AR THA
CHL

BORDE PERT
NEISSMENI
GONO
NEISS

THERM
DEINO
RADI THER
BACIL SUBT
PSEUD AERU
ENTER FAEC
STREP PYOG
YERSISHEWA
PEST PUTR
ESCHE
COLI
SALMO TYPH
CHOL
HAEMO INFL ACTIN VIBRI
ACTI

PYROC HORI

ha
e

CLOST ACET

PORPH GING

CAMPY JEJU

Ar
c

MYCOB LEPR

ya

CAENO ELEG

Phylogenetic studies
PLASM FALC
SACCH CERE

ARABI THAL

ya

Bacteria
Archaea
Mito.

ar
Euk

Nterminus
globalgapremoval
SCHIZ POMB
CANDI ALBI

CAENO ELEG
DROSO MEGA

HALOB SALI

HOMO SAPIE
RATTU NORV

PYROC HORI
METBA THER
PYROC KODA
DROS ME MT

METHA JANN

SCHI PO MT

CAEN EL MT
ARCHE FULG

BORRE BURG

SACC CE MT

MYCOP CAPR

BUCHN AFID
PORPH GING

CLOST ACET
DEINO RADI

RICKE PROW
BACIL SUBT

RHODO CAPS
CHLOR TEPI

SYNECHO SP

CHLAM TRAC
BORDE PERT
NEISS MENI NEISS GONO

HELIC PYLO
CAMPY JEJU
MYCOB TUBE
MYCOB LEPR
TREPO PALI

0.1

PSEUD AERU
SHEWA PUTR
ESCHE
ENTER FAEC SALMO TYPH
YERSICOLI
PEST
VIBRI CHOL
HAEMO INFL
AQUIF
AEOL PYOG
ACTIN ACTI
STREP
THERM THER
THERM MARI

AR THA CHL

MYCOP GENI

MYCOP PNEU

SchematicalignmentofAspartyltRNAsynthetases
180

200

220

240

260

280

300

320

Euc
Arc
Eub

Anticodon binding domain

340

360

380

400

420

Euc

440
L Q PQ

460

KQ

480

500

520

540

560

Arc
Eub

Motif I

690

710

730

Flipping
loop

750

Motif II
Catalytic core I

770

790

810

Insertion domain

830

850

870

890

HG

Euc
Arc
Eub

Motif III
Catalytic core II

Ecole Phylognomique, Carry le Rouet 2006

930

Ecole Phylognomique, Carry le Rouet 2006

Protein sequence validation


Sequencing / frameshift error detection
Estimation:44%ofpredictedproteinsfrom genome sequencing projects and
31%ofhighthroughputcDNA(HTC)containerrorsintheirintron/exonstructure.
Bianchettietal,2005

Example: transcription TFIIH complex protein

Ecole Phylognomique, Carry le Rouet 2006

Clustered MACS : Starter


Multiple alignment of complete sequences

Determination of sequence groups

Hierarchical clustering of positions


based on insertion/deletion

Definition of blocs
N-terminal region analysis :

Reference position
Proposed N-terminus : potential start codon
closest to the reference position

MXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXX
MXXXXXXMXXXMXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXX

extension
Reference position

3000 proteins from B. subtilis with wrong randomly generated N-ter. : 82% predicted
For the 3828 proteins from the Vibrio cholera proteome :
817 specific / 1722 valid start codons / 236 wrong (from 1 up to 56 aas)

Clustered MACS : vAlid


http://igbmc.u-strasbg.fr/vALId/

Bianchetti et al. (2005) JBCB

Clustered MACS : DbW


User
sequence

Daily Blastp

Characterization
of the
specificity of
the homologous
sequences

Clustering

-> Filter
Databases : DBWatcher [Plewniak, IGBMC]
- Proteins
- Structures
Automatic Daily Update

Filter

Integration of the
sub-family members

Automatic up-date of more than 300 different protein families


=> 24 AaRS (amino-acid tRNA synhetases), nuclear receptors,
ribosomal proteins, transcription factors
Prigent et al. (2005) BioInformatics

Clustered MACS : GOAnno


GoAnno : find a pertinent level automatically and propagate Gene Ontology to an
unannotated target protein according to clustered MACS
Chalmel et al. (2005) Bioinfomatics
Subfamily of the Query
minV(Verti) = MaxBranch * P
3
regulation of transcription
transcription
signal transduction

nucleobase, nucleoside, nucleotide


and nucleic acid metabolism

cell communication
cellular process

Level 6

16

16 + 3

Level 5

p
12

2 + 16

2 + 19

Level 4

0 + 12

0+2

0 + 18

0 + 21

Level 3

0 + 12

0+2

0 + 18

0 + 21

Level 2

0 + 12

0+ 2

0 + 18

0 + 21

Level 1

18

21

Level 0

metabolism
physiological processes

biological_process
12
Gene_Ontology

989 target proteins from


retinal transcriptome analysis

minV(Horiz) = 21 * F
795 proteins with a GO terms (increase of 47 %)
3085 GO terms (increase of 92 %)

Protein 3D structure prediction


Proteins with similar sequences tend to fold into similar structure
Above 50% identity, pairwise alignment is enough for accurate model
Below 50% identity, multiple alignment is better

Basic steps for comparative (homology) modelling :


1.
2.
3.
4.
5.
6.

Identify a template structure


Align the target sequence to the template sequence
Copy the backbone coordinates from template to the matching residues in the
target sequence
Build the side-chains (copied for identical residues, predicted for non-identical)
Model the loop regions
Optimise (energy refinement)

Applicable to ~60% of proteins from fully sequenced genomes

Ecole Phylognomique, Carry le Rouet 2006

Protein functional characterisation


By homology : Similar sequences generally share similar structures
and often have similar functions

Propagation of information from a known sequence to an unknown one


e.g. domains, active sites, cellular localisation, post-transcriptional modifications,
1. Database search for homologues e.g. BlastP, PSI-Blast
2. Domain databases : e.g. Interpro (EBI), CDD (NCBI)
3. Multiple alignment construction and analysis e.g. PipeAlign

Ecole Phylognomique, Carry le Rouet 2006

MSA applications : Summary


Error in ORF
definition

Transmembrane
region

Additional
domain

Phosphorylation
site

1st
FAMILY

Bacteria
Bacteria

2
FAMILY
nd

Archaea
Eucarya
Differential
conservation between
the two families

NLS

Universal
conservation

Intra-group
conservation

domain organization, structural motifs


key functional residues, ORF definition
localization signals, conservation pattern
...

Functional
genomics

Evolutionary
studies

Structure
modeling

Mutagenesis
experiments

Drug design

Lecompte et al Gene. 2001

MACSIMS

Ecole Phylognomique, Carry le Rouet 2006

MAO:MultipleAlignmentOntology

http://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html

MAO consortium:
- RNA analysis
(Steve HOLBROOK, Berkeley)

- MACS algorithm
(Kazutake KATOH, Kyoto)

- Protein 3D analysis
(Patrice KOEHL, Davis)

- Protein 3D structure
(Dino MORAS, Strasbourg)

- 3D RNA structure
(Eric WESTHOF, Strasbourg)

Thompson et al. (2005) Nucleic Acids


Res.

Also available
from OBO
web site: http://obo.sourceforge.net
Ecole Phylognomique, Carry
le Rouet
2006

MACSIMS
Multiple Alignment of Complete Sequences Information Management System
Thompson et al BMC Bioinformatics 2006
Structural and functional
information is mined
automatically from the
public databases

Homologous
regions
are identified in the
MACS

Mined data is evaluated


and cross-validated

Mined data is propagated


from known to unknown
sequences
with
the
homologous regions

MACSIMS provides a unique environment that facilitates knowledge extraction and


the presentation of the most pertinent information to the biologist

MACSIMS
http://bips.u-strasbg.fr/MACSIMS/

MACSIMS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)

Key:
SH3

PI-PLC-X

CH

SH2

PI-PLC-Y

rhoGEF

PH

C2

DAG_PE-bind

MACSIMS visualisation

JalView II, Coll. G. Barton

MACSIMS
BAliBASE reference 3: aldehyde dehydrogenase-like
*

*GSVPTG

* **
E

* * *
C

GSTKVG

GETRTG

GSTEVG
GSVSAG

GSRDVG
GSRDVG
GSRDVG
GSTNVF
GSTNVF
GSTAVF

Uniprot annotation

NAD binding

Active site

Active site

Summary
Choice of multiple alignment method
traditional progressive method (e.g. clustalw / clustalx)
combined local and global method (e.g. mafft, muscle, dbclustal)
knowledge-based method (e.g. PipeAlign)
Web Server versus Local Installation ?
WARNING: Automatic alignment methods can make mistakes.
Verify alignment quality by automatic methods (e.g. norMD) and visual inspection !

Multiple alignment applications


Traditional applications:
phylogeny
conserved residue / motif identification
Information in multiple alignments also improves accuracy in:
sequence error detection
structure prediction
functional annotation

Ecole Phylognomique, Carry le Rouet 2006

Laboratory of Integrative Genomics and Bioinformatics


IGBMC, Strasbourg

Ecole Phylognomique, Carry le Rouet 2006

alternative algorithms
IterativeRefinement
PRRP(Gotoh,1993)refinesaninitialprogressivemultiplealignment
byiterativelydividingthealignmentinto2profilesandrealigning
them.
dividesequences
into2groups
initial
alignment

profile1

pairwise
profile
alignment

refined
alignment
converged?

profile2
no

alternative algorithms
GeneticAlgorithms
SAGA(Notredame,Higgins,1996)evolvesapopulationofalignmentsinaquasievolutionary
manner,iterativelyimprovingthefitnessofthepopulation
populationn

selectanumberofindividualstobeparents
modifytheparentsbyshufflinggaps,merging2alignmentsetc.
populationn+1

evaluationofthefitnessusingOF
(sumofpairsorCOFFEE)
END

alternative algorithms
HMM

Probabilisticmodelforsequenceprofiles,visualizedasafinitestate
machine
Foreachcolumnofthealignmentamatchstatemodelsthedistribution
ofresiduesallowed
Insertanddeletestatesateachcolumnallowforinsertionordeletionof
oneormoreresidues
OriginalprofileHMM(Kroghetal,1994)

matchstate

AK

Y
W

L
L

AKY-L-D
--WVLED

insertstate
D
D

delete,
begin,
endstate

MultipleAlignmentusingHMM
generateinitialalignment
(BaumWelchexpectationmaximization)

HMMER(Eddy,unpublished)
produceamodel

SAMT98(Hughey,1996)

generatenewalignment
(Viterbialgorithmor
posteriordecoding)

evaluatealignment
(expectationmaximization)

END

alternative algorithms
SegmenttosegmentAlignment
Dialign(Morgensternetal.1996)comparessegmentsofsequences
insteadofsingleresidues
1.constructdotplotsofallpossiblepairsofsequences
Sequencei

Sequencej

2.findamaximalsetofconsistentdiagonalsinallthesequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek..........
........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns.......
ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv.....
gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds......
.....tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd
.......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp..

Localalignmentresiduesbetweenthediagonalsarenotaligned

Vous aimerez peut-être aussi