Académique Documents
Professionnel Documents
Culture Documents
Annotation
Richard Leyba
Abstract
Methods of sequencing and annotation of genomes are constantly
improving, increasing the rate of raw DNA sequencing output while providing
an abundance of programming tools which allow smaller groups of
researchers to annotate a larger variety of genome types. While genome
annotation is becoming more widespread, the rapidly growing field of
sequencing and annotation via programming and bioinformatics faces new
and unique research and collaborative challenges.
Introduction
Second Generation Sequencing techniques, the newest methods of
gaining DNA from sample material, have largely replaced the methods used
to gain insight into the earliest sequenced and annotated genomes, such as
the human genome and that of Drosophila melanogaster1,2. In comparison to
previous Whole Genome Shotgun(WGS) methods of data collection, which
return fairly long and unbroken sequences of DNA to be aligned, current
sequencing technologies generally return a much larger amount of samples
of an organisms DNA with a smaller sequence length per sample3. Long
sequences, in tandem with a lack of sequencing programs such as those
described in this paper, led to larger research groups compared to modern
Genome Assembly
Current computation methods rely heavily on bioinformatics software
and ontology in order to derive new genomes, as will be outlined in this
paper. Modern methods of gene annotation first require assembly of the
genome before the notation of gene locations can begin.
Assembly involves the taking of collected snippets of DNA data and
running them through programs which rebuild the DNA into one linear
sequence, matching the patterns of nucleotides with each other through high
powered computing4. For further clarification of this described process,
sequences of nucleotides are overlapped wherever they are similar in
nucleotides and are put together into contigs, which are a collection of the
sequences that, when overlapped, make a longer and more whole nucleotide
sequence. This contig is then overlapped with other contigs to produce a
scaffold which, when placed with other scaffolds, displays a complete
sequence of DNA. While this seems straightforward, albeit requiring large
amounts of computer memory, other considerations must be understood
After a quality assembly has been created, annotation can begin. Gene
annotation in the context of computing requires programs capable of ab
initio gene model prediction, which is to find genes based on previously
accumulated data and comparing it with patterns in the DNA sequence10. As
this can be an incredibly complicated process, annotation requires groups of
programs called pipelines to perform this work in a set of steps, such as
Ensembl11. Previously verified gene sequences (referred to as training sets or
training models) are used as a comparison for finding genes within a new
genome. An example of a program that can perform such tasks is JIGSAW,
which uses a complicated set of algorithms and training model statistics to
annotate a genome15.
Evidence from research outside of the gene model prediction can also
be added, such as RNA-seq data12. RNA-seq data is a copy of an organisms
RNA, which is the copy of DNA that is used to make proteins12. Thus,
overlapping the RNA data against the genome can directly provide evidence
to where proteins exist along a genome, which would have been unfeasible
before high powered computing12. This is another valuable asset to modern
research and gene annotation, in that it provides a direct location map of
genes to be discovered along the DNA.
Verification of Previous Annotation
Modern sequencing and annotation is useful as a method of revising
previous genomic data. MAKER, an annotation pipeline, is capable of doing
this via new annotation data and RNA-seq data12. By reviewing previous
published data, such as the maize genome, MAKER and other programs like
it are able to revise incorrect genomic data, and thereby strengthen the
genomic database as a whole13.
Problems and Potential Solutions
While modern annotation shows significant progress and expansion,
the field is faced with challenges regarding organization and data
interoperability. While the alignment of DNA seems straightforward, the
many species on Earth, and in certain cases closely related species, display a
wide variation of genomic information, application, and sequencing14. As
such, a researcher must have a thorough understanding of genomics as well
as an understanding of the pipelines used to annotate information in order to
create an accurate model. A secondary problem within the field of modern
gene annotation is the proliferation of data by multiple laboratories from
across the world. In response to this, the Sequence Ontology Project was
instituted as a means of creating large scale benchmarks of gene annotation
quality, as well as other standardization practices such as within files like
GenBank format14. Such efforts promote interoperability in research data,
ensuring data is usable for further research.
Conclusion
While genomics has taken large leaps with the introduction of higher
scale computing, efficient programming, and new avenues of comparing
genetic information, It has also demonstrated challenges that are unique. In
response, researchers and programmers have adapted their techniques,
Bibliography
1.
2.
3.
4.
5.
Ye, L. et al. A vertebrate case study of the quality of assemblies derived from
next-generation sequences. Genome Biol. 12, R31 (2011).
6.
7.
8.
9.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local
alignment search tool. J. Mol. Biol. 215, 403410 (1990).
10.
11.
http://www.ensembl.org/info/ docs/genebuild/genome_annotation.html
12.
13.
Law, M., et al. Automated update, revision, and quality control of the maize
genome annotations using MAKER-P improves the B73 RefGen_v3 gene
models and identifies new genes. Plant Physiology, 167(1), 2539 (2015).
14.
Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome
annotations. Genome Biol. 6, R44 (2005).
15.