Vous êtes sur la page 1sur 7

The Current State of Genome Assembly and

Annotation
Richard Leyba
Abstract
Methods of sequencing and annotation of genomes are constantly
improving, increasing the rate of raw DNA sequencing output while providing
an abundance of programming tools which allow smaller groups of
researchers to annotate a larger variety of genome types. While genome
annotation is becoming more widespread, the rapidly growing field of
sequencing and annotation via programming and bioinformatics faces new
and unique research and collaborative challenges.
Introduction
Second Generation Sequencing techniques, the newest methods of
gaining DNA from sample material, have largely replaced the methods used
to gain insight into the earliest sequenced and annotated genomes, such as
the human genome and that of Drosophila melanogaster1,2. In comparison to
previous Whole Genome Shotgun(WGS) methods of data collection, which
return fairly long and unbroken sequences of DNA to be aligned, current
sequencing technologies generally return a much larger amount of samples
of an organisms DNA with a smaller sequence length per sample3. Long
sequences, in tandem with a lack of sequencing programs such as those
described in this paper, led to larger research groups compared to modern

genome research labs, as well as a concentration of specific gene sets for


annotation3. In contrast, current methods of DNA sequencing and annotation
as outlined in this paper are now widespread, with many laboratories across
the world researching a wide variety of genomes. This review will explain
those methods of sequencing and annotation, the problems that occur in
assembly and annotation, and the proposed solutions to them.

Genome Assembly
Current computation methods rely heavily on bioinformatics software
and ontology in order to derive new genomes, as will be outlined in this
paper. Modern methods of gene annotation first require assembly of the
genome before the notation of gene locations can begin.
Assembly involves the taking of collected snippets of DNA data and
running them through programs which rebuild the DNA into one linear
sequence, matching the patterns of nucleotides with each other through high
powered computing4. For further clarification of this described process,
sequences of nucleotides are overlapped wherever they are similar in
nucleotides and are put together into contigs, which are a collection of the
sequences that, when overlapped, make a longer and more whole nucleotide
sequence. This contig is then overlapped with other contigs to produce a
scaffold which, when placed with other scaffolds, displays a complete
sequence of DNA. While this seems straightforward, albeit requiring large
amounts of computer memory, other considerations must be understood

before assembly. While a single strand is produced, there is a possibility of


unknown regions of that strand to exist due to little or no DNA sequence data
for that section, as well as a low confidence of some portions if there was not
enough of an amount of reads of that particular portion of the DNA, known as
its coverage5. As a solution to this, sequences are tested via a statistical
method called N50 in order to determine if a DNA sequence is reliable and
accurate enough to begin annotation5. N50, which uses the genome size and
sample size for calculation, gives an easy to follow result that illustrates the
average size of the contig or scaffold sequences, which can be used to
determine the accuracy of the DNA sequence created5.
During assembly, repeat sequences, patterns in the DNA that are
recurring due to gene duplication, are recorded through de novo repeat
finder programs, which find previously unfound patterns based on
programmed parameters, or homology based programs, which use previous
research to find similar patterns6,7. This catalogue of repeat sequences is
interesting, due to their formation during gene duplication, en evolutionary
event in which a gene is copied to another portion of the DNA8. due to a lack
of evolutionary selective pressure, one gene degrades, leaving behind a
similar, but highly mutated, copy of a gene as time goes on8. This repeat
identification, a useful tool for comparative evolutionary biology, can be
further used to research the evolution of similar species via programs such
as BLAST9.
Current Annotation Methods: Genome Annotation

After a quality assembly has been created, annotation can begin. Gene
annotation in the context of computing requires programs capable of ab
initio gene model prediction, which is to find genes based on previously
accumulated data and comparing it with patterns in the DNA sequence10. As
this can be an incredibly complicated process, annotation requires groups of
programs called pipelines to perform this work in a set of steps, such as
Ensembl11. Previously verified gene sequences (referred to as training sets or
training models) are used as a comparison for finding genes within a new
genome. An example of a program that can perform such tasks is JIGSAW,
which uses a complicated set of algorithms and training model statistics to
annotate a genome15.
Evidence from research outside of the gene model prediction can also
be added, such as RNA-seq data12. RNA-seq data is a copy of an organisms
RNA, which is the copy of DNA that is used to make proteins12. Thus,
overlapping the RNA data against the genome can directly provide evidence
to where proteins exist along a genome, which would have been unfeasible
before high powered computing12. This is another valuable asset to modern
research and gene annotation, in that it provides a direct location map of
genes to be discovered along the DNA.
Verification of Previous Annotation
Modern sequencing and annotation is useful as a method of revising
previous genomic data. MAKER, an annotation pipeline, is capable of doing
this via new annotation data and RNA-seq data12. By reviewing previous

published data, such as the maize genome, MAKER and other programs like
it are able to revise incorrect genomic data, and thereby strengthen the
genomic database as a whole13.
Problems and Potential Solutions
While modern annotation shows significant progress and expansion,
the field is faced with challenges regarding organization and data
interoperability. While the alignment of DNA seems straightforward, the
many species on Earth, and in certain cases closely related species, display a
wide variation of genomic information, application, and sequencing14. As
such, a researcher must have a thorough understanding of genomics as well
as an understanding of the pipelines used to annotate information in order to
create an accurate model. A secondary problem within the field of modern
gene annotation is the proliferation of data by multiple laboratories from
across the world. In response to this, the Sequence Ontology Project was
instituted as a means of creating large scale benchmarks of gene annotation
quality, as well as other standardization practices such as within files like
GenBank format14. Such efforts promote interoperability in research data,
ensuring data is usable for further research.
Conclusion
While genomics has taken large leaps with the introduction of higher
scale computing, efficient programming, and new avenues of comparing
genetic information, It has also demonstrated challenges that are unique. In
response, researchers and programmers have adapted their techniques,

ensured compatibility among a wide field of study, and continue to perfect


their approaches toward gene research. As the field continues to produce
new insight into the study of life, researchers will have to continually update
their techniques and approaches toward sequencing and annotation.

Bibliography
1.

Celniker, S. E. et al. Finishing a whole-genome shotgun: release 3 of the


Drosophila melanogaster euchromatic genome sequence. Genome Biol. 3,
research0079 (2002).

2.

Finishing the euchromatic sequence of the human genome. Nature 431,


931945 (2004).

3.

Denoeud, F. et al. Annotating genomes with massive-scale RNA sequencing.


Genome Biol. 9, R175 (2008).

4.

Tsai, I. J., Otto, T. D. & Berriman, M. Improving draft assemblies by iterative


mapping and assembly of short reads to eliminate gaps. Genome Biol. 11,
R41 (2010).

5.

Ye, L. et al. A vertebrate case study of the quality of assemblies derived from
next-generation sequences. Genome Biol. 12, R31 (2011).

6.

Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence


families in sequenced genomes. Genome Res. 12, 12691276 (2002).

7.

Buisine, N., Quesneville, H. & Colot, V. Improved detection and annotation of


transposable elements in sequenced genomes using multiple reference
sequence sets. Genomics 91, 467475 (2008).

8.

Witherspoon, D. J. et al. Alu repeats increase local recombination rates. BMC


Genomics 10, 530 (2009).

9.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local
alignment search tool. J. Mol. Biol. 215, 403410 (1990).

10.

Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene


finding. Nucleic Acids Res. 26, 11071115 (1998).

11.

http://www.ensembl.org/info/ docs/genebuild/genome_annotation.html

12.

Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for


emerging model organism genomes. Genome Res. 18, 188196 (2008).

13.

Law, M., et al. Automated update, revision, and quality control of the maize
genome annotations using MAKER-P improves the B73 RefGen_v3 gene
models and identifies new genes. Plant Physiology, 167(1), 2539 (2015).

14.

Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome
annotations. Genome Biol. 6, R44 (2005).

15.

Allen, J. E. & Salzberg, S. L. JIGSAW: integration of multiple sources of


evidence for gene prediction. Bioinformatics 21, 35963603 (2005).

Vous aimerez peut-être aussi