Vous êtes sur la page 1sur 22

Genome Annotation

D i n e s

5/4/12

www.genomesonline.org

5/4/12

protein-coding genes, nonproteincoding genes


easier to find than other functional elements why? genes are transcribedwhich means that we can identify them by looking at RNA traditionally this has been done by cDNA or EST sequencing, more recently by microarray, SAGE, MPSS, etc.

5/4/12

protein-coding genes, nonproteincoding genes


we can also find genes ab initio using computational methods this is most suited to protein-coding genes why? protein-coding genes have recognizable features open reading frames (ORFs) codon bias known transcription and translational start and stop motifs (promoters, 3 polyA sites)

5/4/12

ab initio gene discovery


Protein-coding genes have recognizable features

We can design software to scan the genome and identify these features Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes

Its a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.

We tend to do OK finding protein coding regions, but miss a lot of non-coding 5 exons and the like
5/4/12

ab initio gene discoveryvalidating predictions and refining gene models


Standard types of evidence for validation of predictions include:

match to previously annotated cDNA match to EST from same organism similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
(translation works betterwhy?) protein structure prediction match to a PFAM domain associated with recognized promoter sequences, ie TATA box, CpG island known phenotype from mutation of the locus
5/4/12

Finding nonprotein-coding genes


e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs

Harder to find than protein-coding genes Why?

often not poly-A taileddont end up in cDNA libraries no ORF constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect
5/4/12

So, how do we find these?

Finding nonprotein-coding genes


secondary structure homology, especially alignment of related species experimentally

isolation through non-polyA dependent cloning methods microarrays

5/4/12

ab initio gene discoveryapproaches


Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to learn how to find a pattern. Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are artificial neural networks (ANNs) and hidden Markov models (HMMs).

5/4/12

ab initio gene discoveryHMMs


An example state diagram for an HMM for gene discovery is this simplified version of one used by Genescan: initial final
5 UTR exon exon exon 3 UTR begin gene region start translation donor splice site intron acceptor splice site stop translation end gene region

A,T,G,C
single exon

Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the 5/4/12 probability that a stretch of sequence Gibson and Muse, A Primer of adapted from is a gene.
Genome Science

Genome Annotationmuch work remains

Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.

5/4/12

The value of genome sequences lies in their annotation

Annotation Characterizing genomic features using computational and experimental methods Four levels of annotation

Genes:

Gene Prediction Where are genes? do they look like? What do the proteins do?

What

Domains Role

What pathway(s) involved in?

1212

5/4/12

How many genes?

1313

Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique
5/4/12

Current consensus (in flux )


20-25000 19,599

protein-coding genes

known genes, another 2,188 DNA segments predicted to be protein coding genes.

1414

5/4/12

How to we get from here

1515

5/4/12

to here,

1616

5/4/12

What are genes? - 1


Complete Products
Proteins Functional

DNA segments responsible to make functional products

RNA molecules

RNAi (interfering RNA) (ribosomal RNA) (small nuclear) (small nucleolar)


5/4/12

rRNA

snRNA

snoRNA

1717

tRNA

(transfer RNA)

What are genes? - 2

Definition vs. dynamic concept

Consider

Prokaryotic vs. eukaryotic gene models modifications

Introns/exons Posttranscriptional Alternative Differential

splicing expression

Genes-in-genes Genes-ad-genes 1818


5/4/12

Prokaryotic gene model: ORF-genes


Small

genomes, high gene density


influenza genome 85% genic

Haemophilus

Operons
One

transcript, many genes gene, one protein

No

introns. reading frames


ORF per gene

One

Open
One

ORFs begin with start, TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl 1919 NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html 5/4/12 end with stop codon (def.)

Eukaryotic gene model: spliced genes

Posttranscriptional modification u 5-CAP, polyA tail, splicing n Open reading frames u Mature mRNA contains ORF u All internal exons contain open readthrough u Pre-start and post-stop sequences are UTRs 2020 5/4/12
n

Expansions and Clarifications


ORFs
Start

triplets stop gene = ORF spliced genes or ORF genes

Prokaryotes: Eukaryotes:

Exons

Remain after introns have been removed parts contain non-coding sequence (5- and 3-UTRs)
5/4/12

Flanking

2121

5/4/12

Vous aimerez peut-être aussi