Gen Ant

Genome Annotation
D i n e s
5/4/12
www.genomesonline.org
5/4/12
protein-coding genes, nonproteincoding genes

easier to find than other functional elements why? genes are transcribedwhich means that we can identify them by looking at RNA traditionally this has been done by cDNA or EST sequencing, more recently by microarray, SAGE, MPSS, etc.
5/4/12
protein-coding genes, nonproteincoding genes

we can also find genes ab initio using computational methods this is most suited to protein-coding genes why? protein-coding genes have recognizable features open reading frames (ORFs) codon bias known transcription and translational start and stop motifs (promoters, 3 polyA sites)
5/4/12
ab initio gene discovery

Protein-coding genes have recognizable features
We can design software to scan the genome and identify these features Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes
Its a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.

We tend to do OK finding protein coding regions, but miss a lot of non-coding 5 exons and the like
5/4/12
ab initio gene discoveryvalidating predictions and refining gene models

Standard types of evidence for validation of predictions include:

match to previously annotated cDNA match to EST from same organism similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
(translation works betterwhy?) protein structure prediction match to a PFAM domain associated with recognized promoter sequences, ie TATA box, CpG island known phenotype from mutation of the locus
5/4/12
Finding nonprotein-coding genes

e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs

Harder to find than protein-coding genes Why?
often not poly-A taileddont end up in cDNA libraries no ORF constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect
5/4/12
So, how do we find these?
Finding nonprotein-coding genes

secondary structure homology, especially alignment of related species experimentally
isolation through non-polyA dependent cloning methods microarrays
5/4/12
ab initio gene discoveryapproaches

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to learn how to find a pattern. Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are artificial neural networks (ANNs) and hidden Markov models (HMMs).
5/4/12
ab initio gene discoveryHMMs

An example state diagram for an HMM for gene discovery is this simplified version of one used by Genescan: initial final
5 UTR exon exon exon 3 UTR begin gene region start translation donor splice site intron acceptor splice site stop translation end gene region
A,T,G,C
single exon
Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the 5/4/12 probability that a stretch of sequence Gibson and Muse, A Primer of adapted from is a gene.
Genome Science
Genome Annotationmuch work remains
Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.
5/4/12
The value of genome sequences lies in their annotation
Annotation Characterizing genomic features using computational and experimental methods Four levels of annotation
Genes:
Gene Prediction Where are genes? do they look like? What do the proteins do?
What
Domains Role
What pathway(s) involved in?
1212
5/4/12
How many genes?
1313
Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique
5/4/12
Current consensus (in flux )

20-25000 19,599
protein-coding genes
known genes, another 2,188 DNA segments predicted to be protein coding genes.
1414
5/4/12
How to we get from here
1515
5/4/12
to here,
1616
5/4/12
What are genes? - 1

Complete Products
Proteins Functional
DNA segments responsible to make functional products
RNA molecules
RNAi (interfering RNA) (ribosomal RNA) (small nuclear) (small nucleolar)

5/4/12
rRNA
snRNA
snoRNA
1717
tRNA
(transfer RNA)
What are genes? - 2
Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models modifications
Introns/exons Posttranscriptional Alternative Differential
splicing expression
Genes-in-genes Genes-ad-genes 1818

5/4/12
Prokaryotic gene model: ORF-genes

Small
genomes, high gene density

influenza genome 85% genic
Haemophilus
Operons
One
transcript, many genes gene, one protein
No
introns. reading frames

ORF per gene
One
Open
One
ORFs begin with start, TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl 1919 NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html 5/4/12 end with stop codon (def.)
Eukaryotic gene model: spliced genes
Posttranscriptional modification u 5-CAP, polyA tail, splicing n Open reading frames u Mature mRNA contains ORF u All internal exons contain open readthrough u Pre-start and post-stop sequences are UTRs 2020 5/4/12
n
Expansions and Clarifications

ORFs
Start
triplets stop gene = ORF spliced genes or ORF genes
Prokaryotes: Eukaryotes:
Exons
Remain after introns have been removed parts contain non-coding sequence (5- and 3-UTRs)
5/4/12
Flanking
2121
5/4/12

Gen Ant

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Gen Ant

Transféré par

Droits d'auteur :

Formats disponibles

Genome Annotation

protein-coding genes, nonproteincoding genes

protein-coding genes, nonproteincoding genes

ab initio gene discovery

Protein-coding genes have recognizable features

ab initio gene discoveryvalidating predictions and refining gene models

Finding nonprotein-coding genes

Harder to find than protein-coding genes Why?

So, how do we find these?

Finding nonprotein-coding genes

isolation through non-polyA dependent cloning methods microarrays

ab initio gene discoveryapproaches

ab initio gene discoveryHMMs

Genome Annotationmuch work remains

The value of genome sequences lies in their annotation

What pathway(s) involved in?

How many genes?

Current consensus (in flux )

How to we get from here

What are genes? - 1

DNA segments responsible to make functional products

RNAi (interfering RNA) (ribosomal RNA) (small nuclear) (small nucleolar)

What are genes? - 2

Definition vs. dynamic concept

Prokaryotic vs. eukaryotic gene models modifications

Introns/exons Posttranscriptional Alternative Differential

Genes-in-genes Genes-ad-genes 1818

Prokaryotic gene model: ORF-genes

genomes, high gene density

transcript, many genes gene, one protein

introns. reading frames

Eukaryotic gene model: spliced genes

Expansions and Clarifications

triplets stop gene = ORF spliced genes or ORF genes

Vous aimerez peut-être aussi