Homework

BTEC 4300
Assignment 03
Due Thursday February 13, 2014 at 2:30 pm
Gene Prediction
PART 1: Prokaryote
1.
On the NCBI site, go to the complete genome page of E coli K12. Download the first 3000 letters
from the complete DNA sequence. There are two genes hidden in this sequence. See if you can
find these.
2
(a) How many start codons are there in the sequence above? Here, for the sake of simplicity,
assume that ATG is the only start codon.
They are 63 ATG start codons in the sequence.
(b) Use the ORF finder tool of NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html ). Record the
output below and list the candidate ORF that you predict could be potential coding regions
The first ROF value is +1 from the very top because its the longest open reading frame.
This one when run in blast it will provide the most information and similarity to other
organism. The rest of the ORF are much shorter. For them to be considered they must
be long enough roughly about 300bp or more and should have amino acid specific for
the give organism and should have codon use specific for the given organism. They can
be used but it wont have a lot of information when its run in blast.
4
(c) Go to the GeneMark /FgeneSB/ Glimmer software and submit your sequence and find the genes.
Compare this prediction with ORF-finder.
When comparing the GeneMark /FgeneSB/ Glimmer software to that of the ORF finder they are
both similar and they predict the same which is 337-2799 for the first the second is also the same
which is 2801-2999. The GeneMark /FgeneSB/ predict 2 genes.
(d) Predict the gene/s based on the results? What are the most probable proteins that are encoded
by the genes you predicted.
The most protein that are encoded are homoserine dehydrogenase and bifunctional a
Spartokinase.
PART 2: Eukaryote
The DNA originates from Caenorhabditis elegans. This is an invertebrate, more precisely a
nematode, or earth worm which is a favored experimental organism because it only has around 1000
cells (also visible in the adult nematode) and 300 neurons. All of the cells and all of the neurons have
been mapped, as well as the complete cellular development from zygote to adult nematode. The
entire genome has been sequenced. If you want to read more about C. elegans you can visit the C.
elegans WWW server.
The DNA you will use for this exercise is available in file : C_elegansDNA.txt
Gene finding using ab initio methods

We will try to find genes in the piece of DNA using different methods.
To facilitate a comparison between the different results, and the elucidation of the correct gene
structure, store all the nucleotide positions for exon start and end sites. Use a table for this
purpose.
Remember to save all your results (you will soon need them).
From the results find the nucleotide positions of the starts and ends of the exons and write these into
your table.
Use the following methods and answer the questions:
GenScan: http://genes.mit.edu/GENSCAN.html
Sequence /tmp/02_13_14-15:09:21.fasta : 13990 bp : 35.93% C+G : Isochore 1 ( 0 - 43 C+G%)
Parameter matrix: HumanIso.smat
Predicted genes/exons:
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
1.01 Intr +
5712
5870
159
62
98
140 0.978
11.66
1.02 Intr +
6078
6893
816
87
975 0.937
80.09
1.03 Intr +
6937
7161
225
-10
115
333 0.995
23.46
1.04 Intr +
7619
7782
164
63
97
230 0.971
19.25
1.05 Intr +
7829
7940
112
37
85
131 0.999
7.16
6
1.06 Intr +
7988
8182
195
13
77
199 0.699
9.99
1.07 Intr +
8252
8442
191
14
42
194 0.938
4.86
1.08 Intr +
8526
8688
163
67
55
223 0.999
15.96
1.09 Intr +
8739
9071
333
52
115
299 0.999
23.84
1.10 Intr +
9603
9734
132
51
81
58 0.643
1.32
1.11 Intr +
9799
9971
173
73
87
63 0.771
2.52
1.12 Intr +
10742
10893
152
-7
34
141 0.123
-1.91
1.13 Intr +
11906
12012
107
14
113
81 0.147
2.21
1.14 Intr +
12234
12335
102
74
75
81 0.964
4.85
1.15 Intr +
12381
12590
210
39
92
363 0.999
29.99
1.16 Term +
12843
12929
87
79
41
136 0.987
4.68
1.17 PlyA +
12952
12957
1.05
2.00 Prom +
12973
13012
40
-11.54
2.01 Init +
13065
13315
251
35
16
296 0.866
14.08
2.02 Term +
13366
13723
358
-23
48
252 0.384
3.10
2.03 PlyA +
13892
13897
-0.45
Suboptimal exons with probability > 1.000
Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------
NO EXONS FOUND AT GIVEN PROBABILITY CUTOFF
Predicted peptide sequence(s):
7
>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_1|1106_aa
NKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTTFWRTFFFYALS
FGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVYYRNKSGTDHTV
VANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASSAPTTGLKADDV
ALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVWYAALIIVMSLY
SVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVLVIPPQGCMMYC
DAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFNGTKVLQTKYYK
GQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKVCFDKTGTLTEDGLDFYALR
VVNDAKIGDNIVQIAANDSCQNVVRAIATCHTLSKINNELHGDPLDVIMFEQTGYSLEED
DSESHESIESIQPILIRPPKDSSLPDCQIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSP
EMIMSLCRPETVPENFHDIVEEYSQHGYRLIAVAEKELVVGSEVQKTPRQSIECDLTLIG
LVALENRLKPVTTEVIQKLNEANIRSVMVTGDNLLTALSVARECGIIVPNKSAYLIEHEN
GVVDRRGRTVLTIREKEDHHTERQPKIVDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQ
LVLVCNVFARMAPEQKQLLVEHLQDVGQTVAMCGDGANDCAALKAAHAGISLSEAEASIA
APFTSKGTAIFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYV
TPIQYFLGCLQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKSLKYAVSFLPTP
KFERLPIYNRKAFNFHSSFYSFSIMRAIVFDEKRYFVVDSSSEGLSTMKVETCVYSGYKI
HPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKK
TKKSVQVVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEK
KASQPKTQQKTAKNVKTAAPRVGGKR
>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_2|202_aa
MRTLRIAQYSVLTVGFAIYMYRLIEEIPIDIRNLNSDSLEGIINSDELCDVTVSNRNRGL
LVRNDSLDLDILKAKFTTFFSKRYLTRFLSEQVPFLHVIDEALLVKRFVMCACFMVFCLT
VIWFLVIRRMGNLIKRLSVLNQLEDAESVEWARCIREFTQEKLAVLCFCIVPPFAQTDKL
VSDKIKLFREHKILRIRSVQH
Q1. How many exons are predicted?

They first gene has 16 exons and the second has 2 exons
Q2. What are the begin and end positions?

For the first it begins at 5712 and ends at 12929. The second one begins at
12973-13723
Q3. For the possible exons, note the probability of each
The possible exons probability of each is suboptimal exons with probability >
1.000
Q4. On which strand (+ or -) is the gene located?
They are both on forward strand since it's on the plus end.
Q5. Write down the first 6 amino acids and the total length of the predicted protein
The first 6 amino acids and the total length of the predicted protein
Pick any one of the other genefinding programs

a. GeneMark http://exon.gatech.edu/eukhmm.cgi
b. GeneID
http://genome.crg.es/software/geneid/geneid.html
c. FGENESH
http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind
Compare your results to those from GenScan

Use any one of the tools below to Find the SPLICE sites
http://spliceport.cbcb.umd.edu/
http://www.umd.be/HSF/
http://wangcomputing.com/assp/index.html
Do the predicted splice site match the ones from your ab-initio predictions?
Gene finding using HOMOLOGY methods
A. Gene finding using EST searches Use blastX
Perform a BlastN at NCBI against C. elegans (or invertebrate) ESTs.
Then select ESTs covering as much as possible of your genomic DNA and try to reconstitute the
entire gene (if this is possible). Retrieve your ESTs from the Blast results. You simply do this by
selecting the ESTs with the click boxes.
Remember to save your ESTs in fasta format (the starting DNA was in the correct format).
Run FGENESH_C using the CDNA from the ESTs
FGENESH_C:
http://linux1.softberry.com/berry.phtml?topic=fgenes_c&group=programs&subgroup=gfs
10
Finding the correct CDS

Go to your table. Look at the different exon start sites and exon end sites.
Are the predictions identical?
Which do you trust the most? Why?
Did any of the gene finding methods arrive at the correct sequence?
From the results choose the exon starts and exon ends that you trust the most and write them in the
last column (My Gene) of the table.
Analyzing the CDS

Perform a BlastP against non redundant protein databases. You can use the GenScan translated
peptides result directly for this.
What kind of protein(s) did you find?

Homework

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Homework

Transféré par

Droits d'auteur :

Formats disponibles

BTEC 4300

Gene finding using ab initio methods

Sequence /tmp/02_13_14-15:09:21.fasta : 13990 bp : 35.93% C+G : Isochore 1 ( 0 - 43 C+G%)

Parameter matrix: HumanIso.smat

Suboptimal exons with probability > 1.000

NO EXONS FOUND AT GIVEN PROBABILITY CUTOFF

Predicted peptide sequence(s):

Q1. How many exons are predicted?

Q2. What are the begin and end positions?

Pick any one of the other genefinding programs

Compare your results to those from GenScan

Finding the correct CDS

Analyzing the CDS

Vous aimerez peut-être aussi