Académique Documents
Professionnel Documents
Culture Documents
structure prediction
Lorenzo Cerutti
Swiss Institute of Bioinformatics
EMBnet course, 2004
Outline
EMBnet 2004
Outline
Introduction
Ab initio methods
Principles: signal detection and coding statistics
Methods to integrate signal detection and coding statistics
Examples of software
Homology methods
Principles
An overview of the homology methods
Examples of software
Introduction
EMBnet 2004
Introduction
EMBnet 2004
Transcription
Replication
DNA
Translation
RNA
Protein
3
Introduction
EMBnet 2004
Introduction
EMBnet 2004
5
3
3
5
Introduction
EMBnet 2004
11
00
00
11
11
00
5
3
3
5
Introduction
EMBnet 2004
Introduction
EMBnet 2004
Ab initio methods
EMBnet 2004
Ab initio methods
EMBnet 2004
Coding region
probability
AAAAA
ATG
GT
{TAA,TGA,TAG}
AG
Promoter signal
AAAAA PolyA
signal
10
EMBnet 2004
11
EMBnet 2004
T [0100]
w1 weights
A [1000]
w2
C [0010]
w3
A [1000]
w4
G [0001]
w5
G [0001]
w6
C [0010]
w7
C [0010]
~ 1=> true
~0 => false
w8
12
EMBnet 2004
13
EMBnet 2004
14
EMBnet 2004
From Guig
o, Genetic Databases, Academic Press, 1999.
15
EMBnet 2004
(1)
(2)
(3)
The probability Pi of the ith reading frame for being the coding region is (i = 1, 2, 3):
Pi =
pi
p1 + p2 + p3
(4)
16
EMBnet 2004
17
EMBnet 2004
18
EMBnet 2004
Intron
Exon
Begin
End
19
EMBnet 2004
spacer
Py tract
intron
GT/GCPhase 0 central
AG
2bp
1bp
Py tract
central
GT/GC
AG
spacer
Py tract
central
GT/GC
spacer
AG
U
TR
exon
3
TR
U
l
na
po
ig
ly
s
er
ot
sig
na
l
om
pr
intragenic
region
20
EMBnet 2004
I0
E init
+
Prom
E term
-
PolyA
-
E single
-
Intragenic
E0
E single
Prom
-
E init
-
I2
-
E2
-
I1
-
E1
-
I0
-
E0
-
5 UTR
+
I1
E1
PolyA
+
E term
I2
E2
3 UTR
-
5 UTR
-
21
EMBnet 2004
GENSCAN output
WEB server: http://genes.mit.edu/GENSCAN.html
Vertebrate, Arabidopsis, Maize
Gn.Ex
----1.00
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
Type
---Prom
Init
Intr
Intr
Intr
Intr
Intr
Intr
Intr
>02:36:44|GENSCAN_predicted_peptide_1|448_aa
MCRAISLRRLLLLLLQLSQLLAVTQGKTLVLGKEGESAELPCESSQKKITVFTWKFSDQR
KILGQHGKGVLIRGGSPSQFDRFDSKKGAWEKGSFPLIINKLKMEDSQTYICELENRKEE
...
22
EMBnet 2004
GENSCAN output
Gn.Ex : gene number, exon number (for reference)
Type : Init = Initial exon (ATG to 5 splice site)
Intr = Internal exon (3 splice site to 5 splice site)
Term = Terminal exon (3 splice site to stop codon)
Sngl = Single-exon gene (ATG to stop)
Prom = Promoter (TATA box / initiation site)
PlyA = poly-A signal (consensus: AATAAA)
S
: DNA strand (+ = input strand; - = opposite strand)
Begin : beginning of exon or signal (numbered on input strand)
End
: end point of exon or signal (numbered on input strand)
Len
: length of exon or signal (bp)
Fr
: reading frame (a forward strand codon ending at x has frame x mod 3)
Ph
: net phase of exon (exon length modulo 3)
I/Ac : initiation signal or 3 splice site score (tenth bit units)
Do/T : 5 splice site or termination signal score (tenth bit units)
CodRg : coding region score (tenth bit units)
P
: probability of exon (sum over all parses containing exon)
Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores)
23
EMBnet 2004
GENSCAN output
GENSCAN predicted genes in sequence 02:36:44
kb
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
8.5
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
13.0
13.5
14.0
14.5
15.0
15.5
16.0
16.5
17.0
17.5
18.0
18.5
19.0
19.5
20.0
kb
5.0
kb
10.0
kbkb
15.0
Key:
Optimal exon
Initial
exon
Internal
exon
Terminal
exon
Single-exon
gene
Suboptimal exon
24
EMBnet 2004
25
EMBnet 2004
HMMgene output
# SEQ: Sequence 20000 (-) A:5406 C:4748 G:4754 T:5092
Sequence HMMgene1.1a firstex 17618 17828 0.578 - 1 bestparse:cds_1
Sequence HMMgene1.1a exon_1
17049 17101 0.560 - 0 bestparse:cds_1
Sequence HMMgene1.1a exon_2
14517 14607 0.659 - 1 bestparse:cds_1
Sequence HMMgene1.1a exon_3
13918 13973 0.718 - 0 bestparse:cds_1
Sequence HMMgene1.1a exon_4
12441 12508 0.751 - 2 bestparse:cds_1
Sequence HMMgene1.1a lastex
7045
7222
0.893 - 0 bestparse:cds_1
Sequence HMMgene1.1a CDS
7045
17828 0.180 - . bestparse:cds_1
Sequence HMMgene1.1a DON
19837 19838 0.001 - 1
Sequence HMMgene1.1a START
19732 19734 0.024 - .
Sequence HMMgene1.1a ACC
19712 19713 0.001 - 0
Sequence HMMgene1.1a DON
19688 19689 0.006 - 1
Sequence HMMgene1.1a DON
19686 19687 0.004 - 0
...
------------ ----- ---position
prob
strand and frame
Symbols: rstex = rst exon; exon n = internal exon; lastex = last exon; singleex = single exon gene; CDS = coding region
26
EMBnet 2004
27
EMBnet 2004
0
0
28
EMBnet 2004
29
EMBnet 2004
FGENES output
Length of sequence:
20000 GC content: 0.48 Zone: 2
Number of predicted genes:
2 In +chain:
2 In -chain:
0
Number of predicted exons: 12 In +chain: 12 In -chain:
0
Predicted genes and exons in var:
2 Max var=
15 GENE WEIGHT:
G Str Feature Start
End
Weight ORF-start ORF-end
1 +
1 +
1 +
2
2
2
2
2
2
2
2
2
2
+
+
+
+
+
+
+
+
+
+
1 CDSf
2 CDSl
PolA
1
2
3
4
5
6
7
8
9
10
CDSf
CDSi
CDSi
CDSi
CDSi
CDSi
CDSi
CDSi
CDSi
CDSl
1032
1835
1.84
0.89
4.64
5266
5562
11490
11899
12424
14623
17203
17859
18264
18630
5.25
3.08
0.76
3.28
2.48
3.26
2.79
1.62
2.53
0.87
990 1578 -
5215
5397
11466
11740
12190
14290
17005
17741
18196
18325
27.3
1031
1832
5265
5561
11489
11898
12423
14622
17202
17857
18264
18627
(CDSf = rst exon; CDSi = internal exon; CDSl = last exon; CDSo = only one exon; PolA = PolyA signal)
30
EMBnet 2004
FGENES output
Predicted proteins:
>FGENES-M 1.5 >MySeq
1 Multiexon gene
990 1835
MSSAFSDPFKEQNPVISLITRTNLNSSSLPVRIYCQPPNMFLYIAPCAVLVLSTSSTPRR
TENGPLRMALNSRFPASFYLLCRDYQYTPPQLGPLHGRCS
>FGENES-M 1.5 >MySeq
2 Multiexon gene
5215 18630
MCRAISLRRLLLLLLQLSQLLAVTQGKTLVLGKEGESAELPCESSQKKITVFTWKFSDQR
100 a Ch+
558 a Ch+
31
EMBnet 2004
http://www.cshl.org/genender
Human, Mouse, Arabidopsis, Fission yeast
32
EMBnet 2004
MZEF output
Internal coding exons predicted by MZEF
Sequence_length: 19920 G+C_content: 0.475
Coordinates P
Fr1
Fr2
Fr3
Orf 3ss
Cds
5ss
5315 - 5482 0.580 0.623 0.528 0.585 122 0.506 0.608 0.552
6475 - 6582
0.752 0.482 0.563 0.558 221 0.505 0.567 0.598
11658 - 11819 0.822 0.476 0.569 0.497 211 0.554 0.560 0.651
14208 - 14543 0.903 0.593 0.619 0.469 212 0.497 0.603 0.575
33
EMBnet 2004
34
EMBnet 2004
YES
d + a < 1.3?
d + a < 5.3?
YES
NO
(6,560)
YES
(18,160)
YES
NO
NO
(5,21)
(23,16)
YES
YES
NO
(9,49)
(142,13)
NO
(151,50)
YES
NO
(24,13)
(1,5)
35
EMBnet 2004
Exon score
Input layer
Hidden layer
Outout layer
36
EMBnet 2004
37
EMBnet 2004
GRAIL
[grail2exons -> Exons]
12345678910-
St
f
f
f
f
f
f
f
f
f
f
Fr
1
0
2
0
0
0
0
0
0
1
Start
479
5176
5395
7063
11827
12188
14288
17003
17751
18212
Score
52.000
82.000
99.000
53.000
74.000
88.000
94.000
100.000
50.000
61.000
Quality
good
excellent
excellent
good
good
excellent
excellent
excellent
good
good
38
Homology methods
EMBnet 2004
39
Homology methods
EMBnet 2004
Pairwise comparison
00000
11111
11111
00000
00
11
11
00
000
111
111
000
00000
11111
00000
11111
00
11
00
11
000
111
000
111
00000
11111
00000
11111
00
11
00
11
000
111
000
111
40
Homology methods
EMBnet 2004
41
Homology methods
EMBnet 2004
http://www.ebi.ac.uk/Wise2/.
42
Homology methods
EMBnet 2004
central
Phase 1 intron
central
Phase 0 intron
central
spacer
Py tract
Py tract
Py tract
match
1bp
spacer
2bp
spacer
insert
delete
43
Homology methods
EMBnet 2004
seq1
249 TDRRIGCLLS
GLDSSLVAATLLK
TDRRIGCLLS
GLDSSLVAATLLK
TDRRIGCLLS
G:G[ggg]
GLDSSLVAATLLK
12930 agaaagtcttGGTGAAGT Intron 4
TAGGGgtgtatgggacta
caggtggttc <1-----[12961:13408]-1> gtacgttccctta
acagtcctaa
cgcccgttctggg
...
Gene 2979 19554
Exon 2979 3227 phase 0
Exon 7315 7552 phase 0
Exon 12416 12601 phase
Exon 12859 12960 phase
Exon 13409 13536 phase
Exon 14999 15125 phase
Exon 16356 16462 phase
Exon 18601 18756 phase
Exon 19348 19554 phase
1
1
1
0
1
0
0
44
Homology methods
EMBnet 2004
seq1
249 TDRRIGCLLS
GLDSSLVAATLLK
TDRRIGCL S
GLDSSLVAATLLK
TDRRIGCL!S
G:G[ggg]
GLDSSLVAATLLK
12930 agaaagtc2tGGTGAAGT Intron 4
TAGGGgtgtatgggacta
caggtggt c <1-----[12960:13407]-1> gtacgttccctta
acagtcct a
cgcccgttctggg
...
Gene 1
Gene 2979 12953
Exon 2979 3227 phase 0
Exon 7315 7552 phase 0
Exon 12416 12601 phase
Exon 12859 12953 phase
Gene 2
Gene 12956 19553
Exon 12956 12959 phase
Exon 13408 13535 phase
Exon 14998 15124 phase
Exon 16355 16461 phase
Exon 18600 18755 phase
Exon 19347 19553 phase
1
1
0
1
0
1
0
0
45
Homology methods
EMBnet 2004
seq1
249 TDRR--CLLS
GLDSSLVAATLLK
TDRR CLLS
GLDSSLVAATLLK
TDRRIGCLLS
G:G[ggg]
GLDSSLVAATLLK
12930 agaaagtcttGGTGAAGT Intron 4
TAGGGgtgtatgggacta
caggtggttc <1-----[12961:13408]-1> gtacgttccctta
acagtcctaa
cgcccgttctggg
...
Gene 1
Gene 2979 19554
Exon 2979 3227 phase 0
Exon 7315 7552 phase 0
Exon 12416 12601 phase
Exon 12859 12960 phase
Exon 13409 13536 phase
Exon 14999 15125 phase
Exon 16356 16462 phase
Exon 18601 18756 phase
Exon 19348 19554 phase
1
1
1
0
1
0
0
46
Homology methods
EMBnet 2004
sim4 performs very well, but needs strong similarity between the sequences.
47
Homology methods
EMBnet 2004
sim4 output
...
1050
.
:
.
:
.
:
.
:
.
:
12123 ATTACAACAGTTCGTG...GTGGTGATCTTCTCTGGAGAAGGATCAGATG
|||||||||||||>>>...>>>||-|||||||||||||||||||||||||
1006 ATTACAACAGTTC
GT ATCTTCTCTGGAGAAGGATCAGATG
1100
.
:
.
:
.
:
.
:
.
:
13453 AACTTACGCAGGGTTACATATATTTTCACAAGGTA...CAGAATGGGATA
||||||||||||||||||||||||||||||||>>>...>>>|||||||||
1046 AACTTACGCAGGGTTACATATATTTTCACAAG
AATGGGATA
...
1-249 (1-249)
100% -> (GT/AG)
4337-4574 (250-487)
100% -> (GT/AG)
9438-9623 (488-673)
100% -> (GT/AG)
9881-9982 (674-775)
100% -> (GT/AG)
10431-10558 (776-903)
100% -> (GT/AG)
12021-12135 (904-1018)
100% -> (GT/AG)
13425-13484 (1019-1077)
98% -> (GT/AG)
15623-15778 (1078-1233)
100% -> (GT/AG)
16370-16576 (1234-1440)
100%
48
Homology methods
EMBnet 2004
Some post-processing is
BLAST can indicate the rough position of exons, but nothing about the gene
structure.
AG
GT
AG
GT
"ideal" BLAST
"real" BLAST
However, BLAST is fast! and can reduce the search space for others programs.
49
Homology methods
EMBnet 2004
cDNA sequence
BLAST vs genomic
111 1111
000
0000 1111
0000
000 1111
111
0000 1111
0000
111 1111
000
0000 1111
0000
000 1111
111
0000 1111
0000
1111111
000
00001111
0000
0001111
111
00001111
0000
GeneWise
11 000
00
111 111
000
00 111
11
000 000
111
1111111
000
00001111
0000
0001111
111
00001111
0000
sim4
11 000
00
111 111
000
00 111
11
000 000
111
50
Evaluation of performances
EMBnet 2004
Evaluation of performances
51
Evaluation of performances
EMBnet 2004
Evaluating performances
TP
FP
TN
FN
TP
FN TN
Real
Predicted
TP
T P +F N
TP
T P +F P
52
Evaluation of performances
EMBnet 2004
Evaluating performances
Measures (contd.):
Correlation coecient CC is a single measure that captures both specicity and
sensitivity:
CC =
(T P T N )(F N F P )
(T P +F N )(T N +F P )(T P +F P )(T N +F N )
ACP =
1
TP
4 T P +F N
TP
T P +F P
TN
T N +F P
TN
T N +F N
53
Evaluation of performances
EMBnet 2004
No. of sequences
195
195
195
195
119
Sn
0.86
0.87
0.95
0.93
0.70
Sp
0.88
0.89
0.90
0.93
0.73
AC
0.84 0.19
0.84 0.18
0.91 0.12
0.91 0.13
0.68 0.21
CC
0.83
0.83
0.91
0.91
0.66
54
Evaluation of performances
EMBnet 2004
55
Evaluation of performances
EMBnet 2004
a
Sn
97
86
96
89
Spa
91
95
83
92
77
b
Sn
93
90
Spb
93
93
91
86
90
88
56
Evaluation of performances
EMBnet 2004
57
Evaluation of performances
EMBnet 2004
The end
58