Académique Documents
Professionnel Documents
Culture Documents
D i n e s
5/4/12
www.genomesonline.org
5/4/12
5/4/12
5/4/12
We can design software to scan the genome and identify these features Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes
Its a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.
We tend to do OK finding protein coding regions, but miss a lot of non-coding 5 exons and the like
5/4/12
match to previously annotated cDNA match to EST from same organism similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
(translation works betterwhy?) protein structure prediction match to a PFAM domain associated with recognized promoter sequences, ie TATA box, CpG island known phenotype from mutation of the locus
5/4/12
often not poly-A taileddont end up in cDNA libraries no ORF constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect
5/4/12
5/4/12
5/4/12
A,T,G,C
single exon
Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the 5/4/12 probability that a stretch of sequence Gibson and Muse, A Primer of adapted from is a gene.
Genome Science
Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.
5/4/12
Annotation Characterizing genomic features using computational and experimental methods Four levels of annotation
Genes:
Gene Prediction Where are genes? do they look like? What do the proteins do?
What
Domains Role
1212
5/4/12
1313
Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique
5/4/12
protein-coding genes
known genes, another 2,188 DNA segments predicted to be protein coding genes.
1414
5/4/12
1515
5/4/12
to here,
1616
5/4/12
RNA molecules
rRNA
snRNA
snoRNA
1717
tRNA
(transfer RNA)
Consider
splicing expression
Haemophilus
Operons
One
No
One
Open
One
ORFs begin with start, TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl 1919 NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html 5/4/12 end with stop codon (def.)
Posttranscriptional modification u 5-CAP, polyA tail, splicing n Open reading frames u Mature mRNA contains ORF u All internal exons contain open readthrough u Pre-start and post-stop sequences are UTRs 2020 5/4/12
n
Prokaryotes: Eukaryotes:
Exons
Remain after introns have been removed parts contain non-coding sequence (5- and 3-UTRs)
5/4/12
Flanking
2121
5/4/12