Académique Documents
Professionnel Documents
Culture Documents
Assignment 03
Due Thursday February 13, 2014 at 2:30 pm
Gene Prediction
PART 1: Prokaryote
1.
On the NCBI site, go to the complete genome page of E coli K12. Download the first 3000 letters
from the complete DNA sequence. There are two genes hidden in this sequence. See if you can
find these.
2
(a) How many start codons are there in the sequence above? Here, for the sake of simplicity,
assume that ATG is the only start codon.
They are 63 ATG start codons in the sequence.
(b) Use the ORF finder tool of NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html ). Record the
output below and list the candidate ORF that you predict could be potential coding regions
The first ROF value is +1 from the very top because its the longest open reading frame.
This one when run in blast it will provide the most information and similarity to other
organism. The rest of the ORF are much shorter. For them to be considered they must
be long enough roughly about 300bp or more and should have amino acid specific for
the give organism and should have codon use specific for the given organism. They can
be used but it wont have a lot of information when its run in blast.
4
(c) Go to the GeneMark /FgeneSB/ Glimmer software and submit your sequence and find the genes.
Compare this prediction with ORF-finder.
When comparing the GeneMark /FgeneSB/ Glimmer software to that of the ORF finder they are
both similar and they predict the same which is 337-2799 for the first the second is also the same
which is 2801-2999. The GeneMark /FgeneSB/ predict 2 genes.
(d) Predict the gene/s based on the results? What are the most probable proteins that are encoded
by the genes you predicted.
The most protein that are encoded are homoserine dehydrogenase and bifunctional a
Spartokinase.
PART 2: Eukaryote
The DNA originates from Caenorhabditis elegans. This is an invertebrate, more precisely a
nematode, or earth worm which is a favored experimental organism because it only has around 1000
cells (also visible in the adult nematode) and 300 neurons. All of the cells and all of the neurons have
been mapped, as well as the complete cellular development from zygote to adult nematode. The
entire genome has been sequenced. If you want to read more about C. elegans you can visit the C.
elegans WWW server.
The DNA you will use for this exercise is available in file : C_elegansDNA.txt
Remember to save all your results (you will soon need them).
From the results find the nucleotide positions of the starts and ends of the exons and write these into
your table.
Use the following methods and answer the questions:
GenScan: http://genes.mit.edu/GENSCAN.html
Predicted genes/exons:
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
1.01 Intr +
5712
5870
159
62
98
140 0.978
11.66
1.02 Intr +
6078
6893
816
87
975 0.937
80.09
1.03 Intr +
6937
7161
225
-10
115
333 0.995
23.46
1.04 Intr +
7619
7782
164
63
97
230 0.971
19.25
1.05 Intr +
7829
7940
112
37
85
131 0.999
7.16
6
1.06 Intr +
7988
8182
195
13
77
199 0.699
9.99
1.07 Intr +
8252
8442
191
14
42
194 0.938
4.86
1.08 Intr +
8526
8688
163
67
55
223 0.999
15.96
1.09 Intr +
8739
9071
333
52
115
299 0.999
23.84
1.10 Intr +
9603
9734
132
51
81
58 0.643
1.32
1.11 Intr +
9799
9971
173
73
87
63 0.771
2.52
1.12 Intr +
10742
10893
152
-7
34
141 0.123
-1.91
1.13 Intr +
11906
12012
107
14
113
81 0.147
2.21
1.14 Intr +
12234
12335
102
74
75
81 0.964
4.85
1.15 Intr +
12381
12590
210
39
92
363 0.999
29.99
1.16 Term +
12843
12929
87
79
41
136 0.987
4.68
1.17 PlyA +
12952
12957
1.05
2.00 Prom +
12973
13012
40
-11.54
2.01 Init +
13065
13315
251
35
16
296 0.866
14.08
2.02 Term +
13366
13723
358
-23
48
252 0.384
3.10
2.03 PlyA +
13892
13897
-0.45
Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
7
>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_1|1106_aa
NKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTTFWRTFFFYALS
FGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVYYRNKSGTDHTV
VANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASSAPTTGLKADDV
ALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVWYAALIIVMSLY
SVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVLVIPPQGCMMYC
DAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFNGTKVLQTKYYK
GQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKVCFDKTGTLTEDGLDFYALR
VVNDAKIGDNIVQIAANDSCQNVVRAIATCHTLSKINNELHGDPLDVIMFEQTGYSLEED
DSESHESIESIQPILIRPPKDSSLPDCQIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSP
EMIMSLCRPETVPENFHDIVEEYSQHGYRLIAVAEKELVVGSEVQKTPRQSIECDLTLIG
LVALENRLKPVTTEVIQKLNEANIRSVMVTGDNLLTALSVARECGIIVPNKSAYLIEHEN
GVVDRRGRTVLTIREKEDHHTERQPKIVDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQ
LVLVCNVFARMAPEQKQLLVEHLQDVGQTVAMCGDGANDCAALKAAHAGISLSEAEASIA
APFTSKGTAIFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYV
TPIQYFLGCLQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKSLKYAVSFLPTP
KFERLPIYNRKAFNFHSSFYSFSIMRAIVFDEKRYFVVDSSSEGLSTMKVETCVYSGYKI
HPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKK
TKKSVQVVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEK
KASQPKTQQKTAKNVKTAAPRVGGKR
>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_2|202_aa
MRTLRIAQYSVLTVGFAIYMYRLIEEIPIDIRNLNSDSLEGIINSDELCDVTVSNRNRGL
LVRNDSLDLDILKAKFTTFFSKRYLTRFLSEQVPFLHVIDEALLVKRFVMCACFMVFCLT
VIWFLVIRRMGNLIKRLSVLNQLEDAESVEWARCIREFTQEKLAVLCFCIVPPFAQTDKL
VSDKIKLFREHKILRIRSVQH
FGENESH_C:
http://linux1.softberry.com/berry.phtml?topic=fgenes_c&group=programs&subgroup=gfs
10