Vous êtes sur la page 1sur 7

PeRSPecTiveS

including serial analysis of gene expression


I N N O VAT I O N
(SAGE)10,11, cap analysis of gene expression
(CAGE)12–14 and massively parallel signature
RNA-Seq: a revolutionary tool for sequencing (MPSS)15–17. These tag-based
sequencing approaches are high through-
transcriptomics put and can provide precise, ‘digital’ gene
expression levels. However, most are
based on expensive Sanger sequencing
Zhong Wang, Mark Gerstein and Michael Snyder technology, and a significant portion of
the short tags cannot be uniquely mapped
Abstract | RNA-Seq is a recently developed approach to transcriptome profiling to the reference genome. Moreover, only
that uses deep-sequencing technologies. Studies using this method have a portion of the transcript is analysed and
already altered our view of the extent and complexity of eukaryotic isoforms are generally indistinguishable
transcriptomes. RNA-Seq also provides a far more precise measurement of from each other. These disadvantages
levels of transcripts and their isoforms than other methods. This article describes limit the use of traditional sequencing
technology in annotating the structure of
the RNA-Seq approach, the challenges associated with its application, and the
transcriptomes.
advances made so far in characterizing several eukaryote transcriptomes. Recently, the development of novel
high-throughput DNA sequencing meth-
The transcriptome is the complete set of the mapping of transcribed regions to a ods has provided a new method for both
transcripts in a cell, and their quantity, for very high resolution, from several base mapping and quantifying transcriptomes.
a specific developmental stage or physi- pairs to ~100 bp2–5. Hybridization-based This method, termed RNA-Seq (RNA
ological condition. Understanding the approaches are high throughput and sequencing), has clear advantages over
transcriptome is essential for interpreting relatively inexpensive, except for high- existing approaches and is expected to rev-
the functional elements of the genome and resolution tiling arrays that interrogate olutionize the manner in which eukaryotic
revealing the molecular constituents of large genomes. However, these methods transcriptomes are analysed. It has already
cells and tissues, and also for understand- have several limitations, which include: been applied to Saccharomyces cerevisiae,
ing development and disease. The key reliance upon existing knowledge about Schizosaccharomyces pombe, Arabidopsis
aims of transcriptomics are: to catalogue genome sequence; high background levels thaliana, mouse and human cells18–24. Here,
all species of transcript, including mRNAs, owing to cross-hybridization6,7; and a we explain how RNA-Seq works, discuss
non-coding RNAs and small RNAs; to limited dynamic range of detection owing its challenges and provide an overview of
determine the transcriptional structure to both background and saturation of studies that have used this approach, which
of genes, in terms of their start sites, 5′ signals. Moreover, comparing expression have already begun to change our view of
and 3′ ends, splicing patterns and other levels across different experiments is often eukaryotic transcriptomes.
post-transcriptional modifications; and to difficult and can require complicated
quantify the changing expression levels of normalization methods. RNA-Seq technology and benefits
each transcript during development and RNA-Seq uses recently developed deep-
under different conditions. sequencing technologies. In general, a
Various technologies have been RNA-Seq […] is expected population of RNA (total or fractionated,
developed to deduce and quantify the to revolutionize the such as poly(A)+) is converted to a library
transcriptome, including hybridization- of cDNA fragments with adaptors attached
or sequence-based approaches.
manner in which eukaryotic to one or both ends (FIG. 1). Each molecule,
Hybridization-based approaches typically transcriptomes are analysed. with or without amplification, is then
involve incubating fluorescently labelled sequenced in a high-throughput manner
cDNA with custom-made microarrays or to obtain short sequences from one end
commercial high-density oligo microar- In contrast to microarray methods, (single-end sequencing) or both ends
rays. Specialized microarrays have also sequence-based approaches directly deter- (pair-end sequencing).The reads are typi-
been designed; for example, arrays with mine the cDNA sequence. Initially, Sanger cally 30–400 bp, depending on the DNA-
probes spanning exon junctions can sequencing of cDNA or EST libraries8,9 sequencing technology used. In principle,
be used to detect and quantify distinct was used, but this approach is relatively any high-throughput sequencing technol-
spliced isoforms1. Genomic tiling microar- low throughput, expensive and generally ogy 25 can be used for RNA-Seq, and the
rays that represent the genome at high not quantitative. Tag-based methods were Illumina IG18–21,23,24, Applied Biosystems
density have been constructed and allow developed to overcome these limitations, SOLiD22 and Roche 454 Life Science26–28

NATURE REVIEwS | genetics VOLUME 10 | jANUARy 2009 | 57

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

AAAAAAAA mRNA signal because DNA sequences can


been unambiguously mapped to unique
regions of the genome. RNA-Seq does
AAAAAAAA
not have an upper limit for quantifica-
or TTTTTTTT tion, which correlates with the number
RNA fragments cDNA of sequences obtained. Consequently,
it has a large dynamic range of expres-
sion levels over which transcripts can be
EST library
with adaptors detected: a greater than 9,000-fold range
was estimated in a study that analysed 16
million mapped reads in Saccharomyces
ATCACAGTGGGACTCCATAAATTTTTCT cerevisiae 18, and a range spanning five
CGAAGGACCAGCAGAAACGAGAGAAAAA Short sequence reads orders of magnitude was estimated for
GGACAGAGTCCCCAGCGGGCTGAAGGGG 40 million mouse sequence reads20. By
ATGAAACATTAAAGTCAAACAATATGAA
contrast, DNA microarrays lack sensitivity
......
for genes expressed either at low or very
high levels and therefore have a much
ORF smaller dynamic range (one-hundredfold
Coding sequence
to a few-hundredfold) (FIG. 2). RNA-Seq
Exonic reads ...AAAAAAAAA
has also been shown to be highly accurate
...AAAAAA for quantifying expression levels, as deter-
Junction reads poly(A) end reads mined using quantitative PCR (qPCR)18 and
Mapped sequence reads spike-in RNa controls of known concentra-
tion20. The results of RNA-Seq also show
Base-resolution expression profile high levels of reproducibility, for both
RNA expression level

technical and biological replicates18,22.


Finally, because there are no cloning steps,
and with the Helicos technology there is
no amplification step, RNA-Seq requires
less RNA sample.
Nucleotide position
Taking all of these advantages into
account, RNA-Seq is the first sequencing-
Figure 1 | A typical RnA-seq experiment. Briefly, long RNAs are first converted into a library of cDNA based method that allows the entire
fragments through either RNA fragmentation or DNA fragmentation (seeNature main text). Sequencing
Reviews | Genetics transcriptome to be surveyed in a very
adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from
each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned high-throughput and quantitative man-
with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads ner. This method offers both single-base
and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for resolution for annotation and ‘digital’
each gene, as illustrated at the bottom; a yeast ORF with one intron is shown. gene expression levels at the genome scale,
often at a much lower cost than either
tiling arrays or large-scale Sanger EST
systems have already been applied for example, 454-based RNA-Seq has been sequencing.
this purpose. The Helicos Biosciences used to sequence the transcriptome of
tSMS system has not yet been used for the Glanville fritillary butterfly 27. This Challenges for RNA-Seq
published RNA-Seq studies, but is also makes RNA-Seq particularly attractive Library construction. The ideal method
appropriate and has the added advantage for non-model organisms with genomic for transcriptomics should be able to
of avoiding amplification of target cDNA. sequences that are yet to be determined. directly identify and quantify all RNAs,
Following sequencing, the resulting reads RNA-Seq can reveal the precise location small or large. Although there are only
are either aligned to a reference genome of transcription boundaries, to a single- a few steps in RNA-Seq (FIG. 1), it does
or reference transcripts, or assembled base resolution. Furthermore, 30-bp short involve several manipulation stages dur-
de novo without the genomic sequence reads from RNA-Seq give information ing the production of cDNA libraries,
to produce a genome-scale transcription about how two exons are connected, which can complicate its use in profiling
map that consists of both the transcrip- whereas longer reads or pair-end short all types of transcript.
tional structure and/or level of expression reads should reveal connectivity between Unlike small RNAs (microRNas
for each gene. multiple exons. These factors make RNA- (miRNAs), Piwi-interacting RNas (piRNAs),
Although RNA-Seq is still a technology Seq useful for studying complex tran- short interfering RNas (siRNAs) and many
under active development, it offers several scriptomes. In addition, RNA-Seq can also others), which can be directly sequenced
key advantages over existing technologies reveal sequence variations (for example, after adaptor ligation, larger RNA mol-
(Table 1). First, unlike hybridization-based SNPs) in the transcribed regions22,24. ecules must be fragmented into smaller
approaches, RNA-Seq is not limited to A second advantage of RNA-Seq pieces (200–500 bp) to be compatible
detecting transcripts that correspond relative to DNA microarrays is that with most deep-sequencing technologies.
to existing genomic sequence. For RNA-Seq has very low, if any, background Common fragmentation methods include

58 | jANUARy 2009 | VOLUME 10 www.nature.com/reviews/genetics

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

Table 1 | Advantages of RNA-Seq compared with other transcriptomics methods


technology tiling microarray cDnA or est sequencing RnA-seq
Technology specifications
Principle Hybridization Sanger sequencing High-throughput sequencing
Resolution From several to 100 bp Single base Single base
Throughput High Low High
Reliance on genomic sequence Yes No in some cases
Background noise High Low Low
Application
Simultaneously map transcribed regions and gene expression Yes Limited for gene expression Yes
Dynamic range to quantify gene expression level Up to a few-hundredfold Not practical >8,000-fold
Ability to distinguish different isoforms Limited Yes Yes
Ability to distinguish allelic expression Limited Yes Yes
Practical issues
Required amount of RNA High High Low
cost for mapping transcriptomes of large genomes High High Relatively low

RNA fragmentation (RNA hydrolysis or other can be obtained from cDNA libraries currently laborious to produce because they
nebulization) and cDNA fragmentation that have been amplified. These could be require many steps22 or direct RNA–RNA
(DNase I treatment or sonication). Each a genuine reflection of abundant RNA ligation21, which is inefficient. Moreover,
of these methods creates a different bias in species, or they could be PCR artefacts. it is essential to ensure that the antisense
the outcome. For example, RNA fragmen- One way to discriminate between these transcripts are not artefacts of reverse tran-
tation has little bias over the transcript possibilities is to determine whether the scription30. Because of these complications,
body 20, but is depleted for transcript ends same sequences are observed in different most studies thus far have analysed cDNAs
compared with other methods (FIG. 3). biological replicates. without strand information.
Conversely, cDNA fragmentation is Another key consideration concerning
usually strongly biased towards the iden- library construction is whether or not to Bioinformatic challenges. Like other
tification of sequences from the 3′ ends of prepare strand-specific libraries, as has high-throughput sequencing technolo-
transcripts, and thereby provides valuable been done in two studies21,22. These libraries gies, RNA-Seq faces several informatics
information about the precise identity of have the advantage of yielding information challenges, including the development of
these ends18 (FIG. 4). about the orientation of transcripts, which efficient methods to store, retrieve and
Some manipulations during library is valuable for transcriptome annotation, process large amounts of data, which must
construction also complicate the analysis especially for regions with overlapping be overcome to reduce errors in image
of RNA-Seq results. For example, many transcription from opposite directions2,19,29; analysis and base-calling and remove
shorts reads that are identical to each however, strand-specific libraries are low-quality reads.

Low Medium High


6 5 5
Correlation = 0.099 Correlation = 0.509 Correlation = 0.177
Expression levels by tiling array

5
4 4

4
3 3
3
2 2
2

1 1
1

0 0 0
–1 0 1 2 3 4 4 5 6 7 8 8 10 12 14
Expression levels by RNA-Seq (log2)

Figure 2 | Quantifying expression levels: RnA-seq and microarray medium levels of expression (middle), but correlation is very| Genetics
Nature Reviews low for
compared. expression levels are shown, as measured by RNA-Seq genes with either low or high expression levels. The tiling array data
and tiling arrays, for Saccharomyces cerevisiae cells grown in nutrient- used in this figure is taken from ReF. 2, and the RNA-Seq data is taken
rich media. The two methods agree fairly well for genes with from ReF. 18.

NATURE REVIEwS | genetics VOLUME 10 | jANUARy 2009 | 59

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

Once high-quality reads have been only needs to be given to poly(A) tails a junction library that contains all the
obtained, the first task of data analysis is and to a small number of exon–exon known and predicted junction sequences
to map the short reads from RNA-Seq to junctions. Poly(A) tails can be identified and map reads to this library 19,20. A chal-
the reference genome, or to assemble them simply by the presence of multiple As or lenge for the future is to develop computa-
into contigs before aligning them to the Ts at the end of some reads. Exon–exon tionally simple methods to identify novel
genomic sequence to reveal transcription junctions can be identified by the pres- splicing events that take place between two
structure. There are several programs for ence of a specific sequence context (the distant sequences or between exons from
mapping reads to the genome, including GT–AG dinucleotides that flank splice two different genes.
ELAND, SOAP31, MAQ32 and RMAP33 sites) and confirmed by the low expression For large transcriptomes, alignment
(information about these can be found at of intronic sequences, which are removed is also complicated by the fact that a sig-
the Illumina forum and at SEQanswers). during splicing. Transcriptome maps nificant portion of sequence reads match
However, short transcriptomic reads also have been generated in this manner for multiple locations in the genome. One
contain reads that span exon junctions S. cerevisiae 18. For complex transcriptomes solution is to assign these multi-matched
or that contain poly(A) ends — these it is more difficult to map reads that span reads by proportionally assigning them
cannot be analysed in the same way. For splice junctions, owing to the presence of based on the number of reads mapped to
genomes in which splicing is rare (for extensive alternative splicing and trans- their neighbouring unique sequences20,22.
example, S. cerevisiae) special attention splicing. One partial solution is to compile This method has been successful for
low-copy repetitive sequences20. Short
reads that have high copy numbers (>100)
a and long stretches of repetitive regions
RNA fragmentation present a greater challenge. Obtaining
longer sequence reads, for example using
454 technology, should help alleviate the
multi-matching problem. Alternatively, a
paired-end sequencing strategy, in which
short sequences are determined from both
ends of a DNA fragment 25,34,35, extends the
mapped fragment length to 200–500 bp
Tag count

and is expected to be useful in the future.


Sequencing errors and polymorphisms
can present mapping problems for all
genomes, not just for repetitive DNA.
Generally, single base differences are not
problematic, because most mapping
algorithms accommodate one or two
cDNA fragmentation base differences. However, resolving larger
differences will require better reference
genome annotation for polymorphisms
and deeper sequencing coverage.
5′ 3′
Mean count for 5,099 genes
Coverage versus cost. Another important
b issue is sequence coverage, or the percent-
age of transcripts surveyed, which has
implications for cost. Greater coverage
requires more sequencing depth. To detect
a rare transcript or variant, considerable
Tag count

depth is needed. In simple transcriptomes,


such as yeast (both S. pombe and S. cerevi-
siae) for which there is no evidence of alter-
native splicing, 30 million 35-nucleotide
reads from poly(A) mRNA libraries are
sufficient to observe transcription from
most (>90%) genes for cells grown under
5′ Mean count for a single gene, SES1 3′ a single condition (that is, in nutrient-rich
medium)18. This depth is probably more
Figure 3 | DnA library preparation: RnA fragmentation and DnA fragmentation compared.
Nature Reviews | Genetics
a | Fragmentation of oligo-dT primed cDNA (blue line) is more biased towards the 3′ end of the
than sufficient for most purposes, as the
transcript. RNA fragmentation (red line) provides more even coverage along the gene body, but is number of expressed genes detected by
relatively depleted for both the 5′ and 3′ ends. Note that the ratio between the maximum and RNA-Seq reaches 80% coverage at 4 mil-
minimum expression level (or the dynamic range) for microarrays is 44, for RNA-Seq it is 9,560. The lion uniquely mapped reads, after which
tag count is the average sequencing coverage for 5,000 yeast ORFs18. b | A specific yeast gene, SES1 doubling the depth merely increases the
(seryl-tRNA synthetase), is shown. coverage by 10% (FIG. 5). The remaining

60 | jANUARy 2009 | VOLUME 10 www.nature.com/reviews/genetics

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

genes are presumably either not expressed ACT1 ORF YFL040W ORF
under this condition (for example, sporu-
5′ 3′ 3′ 5′
lation genes18) or do not have poly(A)
tails. Analyzing many different conditions
can further increase the coverage; in
S. pombe 122 million reads from six differ- Poly(A) tags on – strand
Local heterogeneity
ent growth conditions detected transcrip-
tion from >99% of annotated genes19.
In general, the larger the genome, the
more complex the transcriptome, the more
sequencing depth is required for adequate
coverage. Unlike genome-sequencing cov- Poly(A) tags
erage, it is less straightforward to calculate on + strand
the coverage of the transcriptome; this
is because the true number and level of
different transcript isoforms is not usually
known and because transcription activity
varies greatly across the genome. One study
Distinct poly(A) sites Site 1 Site 2
used the number of unique transcription
start sites as a measure of coverage in Figure 4 | Poly(A) tags from RnA-seq. A region containing two overlappingNaturetranscripts (ACT1, from
Reviews | Genetics
mouse embryonic cells, and demonstrated the actin gene, and YFL040W, an uncharacterized ORF) from the Saccharomyces cerevisiae genome
that at 80 million reads, the number of start is shown. Arrows point to transcription direction. The poly(A) tags from RNA-Seq experiments are
sites reached a plateau22 (FIG. 5b). However, shown below these transcripts, with arrows indicating transcription direction. The precise location
this approach does not address transcrip- of each locus identified by poly(A) tags reveals the heterogeneity in poly(A) sites, for example, ACT1
tome complexity in alternative splicing and has two big clusters, both with a few bases of local heterogeneity. The transcription direction
transcription termination sites; presumably revealed by poly(A) tags also helps to resolve 3′-end overlapping transcribed regions18.
further sequencing can reveal additional
variants.
can be precisely mapped by searching for many mRNAs with uORFs are transcrip-
New transcriptomic insights poly(A) tags, and introns can be mapped tion factors, suggesting that these regulators
Despite the challenges described above, by searching for tags that span GT–AG are themselves heavily regulated.
the advantages of RNA-Seq have enabled splicing consensus sites. Using these meth- The mapping of transcript boundaries
us to generate an unprecedented global ods the 5′ and 3′ boundaries of 80% and revealed several novel features of eukaryo-
view of the transcriptome and its organi- 85% of all annotated genes, respectively, tic gene organization. Many yeast genes
zation for a number of species and cell were mapped in S. cerevisiae 18. Similarly, were found to overlap at their 3′ ends18.
types. Before the advent of RNA-Seq, in S. pombe many boundaries were defined Using relaxed criteria similar to those
it was known that a much greater than by RNA-Seq data in combination with employed in a recent study18 we found that
expected fraction of the yeast, Drosophila tiling array data19. 808 pairs, approximately 25% of all yeast
melanogaster and human genomes are These two studies led to the discovery ORFs, overlap at their 3′ ends18. Likewise,
transcribed2,4,36, and for yeast and humans of many 5′ and 3′ UTRs that had not antisense expression is enriched in the 3′
a number of distinct isoforms have been been analysed previously. In S. cerevisiae, exons of mouse transcripts22. These features
found for many genes2,4. However, the extensive 3′-end heterogeneity was might confer interesting regulatory proper-
starts and ends of most transcripts and discovered at two levels: first, local ties on the affected genes. For multicellular
exons had not been precisely resolved heterogeneity exists in which a cluster of organisms, antisense transcription could
and the extent of spliced heterogeneity sites are involved, typically within a 10 bp modulate gene expression through the
remained poorly understood. RNA-Seq, window; second, there are distinct regions production of siRNAs or through dsRNa
with its high resolution and sensitivity has of poly(A) addition for 540 genes (FIG. 4). editing39,40. For yeast, which seems to lack
revealed many novel transcribed regions It is plausible that these different 3′ ends siRNA and dsRNA-editing functions,
and splicing isoforms of known genes, and confer distinct properties to the different transcription from one gene might interfere
has mapped 5′ and 3′ boundaries for many mRNA isoforms, such as mRNA localiza- with that from an overlapping gene, or
genes. tion or degradation signals, which in turn coordinate gene expression through other
might be responsible for unique biological mechanisms.
Mapping gene and exon boundaries. The functions18,19. In addition to 3′ heterogene-
single-base resolution of RNA-Seq has ity, the list of upstream ORFs within the 5′ Extensive transcript complexity. RNA-Seq
the potential to revise many aspects UTRs of mRNAs (uORFs) was also greatly can be used to quantitatively examine
of the existing gene annotation, including expanded from 17 to 340 (6% of yeast splicing diversity by searching for reads
gene boundaries and introns for known genes)18; uORFs regulate mRNA transla- that span known splice junctions as well
genes as well as the identification of novel tion37 or stability 38, so these sequences as potential new ones. In humans, 31,618
transcribed regions. 5′ and 3′ boundaries might make a previously underappreciated known splicing events were confirmed
can be mapped to within 10–50 bases by a contribution to the regulatory sophistica- (11% of all known splicing events) and 379
precipitous drop in signal. 3′ boundaries tion of eukaryotic genomes. Interestingly, novel splicing events were discovered24.

NATURE REVIEwS | genetics VOLUME 10 | jANUARy 2009 | 61

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

a b
6,000 90% 100

5,000
Number of ORFs detected

Unique start sites (million)


4,000 10

3,000

2,000 1
ORFs detected
1,000 80% coverage ES
50% coverage EB

0 0.1
0 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80
Number of uniquely mapped tags (million) Number of mapped tags (million)

Figure 5 | coverage versus depth. a | 80% of yeast genes were detected b | The number of unique start sites detected starts to reach a plateau
at 4 million uniquely mapped RNA-Seq reads, and coverage reaches when the depth of sequencing reaches 80 million in two mouse tran-
Nature Reviews | Genetics
a plateau afterwards despite the increasing sequencing depth. scriptomes. eS, embryonic stem cells; eB, embryonic body. Figure is
expressed genes are defined as having at least four independent modified, with permission, from ReF. 22  (2008) Macmillan Publishers
reads from a 50-bp window at the 3′ end. Data is taken from ReF. 18. Ltd. All rights reserved.

Another study of human cells found represented in poly(A)+ RNA2,36,42,43. S. cerevisiae and S. pombe, respectively 18,19;
94,241 junctions, among which 4,096 However, the accuracy of the tiling array for S. cerevisiae half of these were not
were novel, and further demonstrated that results is uncertain owing to concerns about identified using microarrays. Many of
the prevalent form of alternative splicing cross-hybridization (see below). RNA-Seq, these novel transcribed regions in yeast do
is exon skipping 41. In mice, extensive which does not suffer from problems with not seem to encode any protein, and their
alternative splicing was observed for 3,462 background noise, has confirmed that at functions remain to be determined. The
genes20. In addition, 42 splicing events least 75% and perhaps greater than 90% current sequencing depth is not sufficient
that join exons from multiple mouse genes of the S. cerevisiae and S. pombe genomes to define the boundaries of novel transcript
were detected22. are expressed18,19. In addition, results from units in mammals; however, 30–40% of
RNA-Seq suggest the existence of a large reads map to unannotated regions20,22,24.
Novel transcription. Previous studies using number of novel transcribed regions in These novel transcribed regions, combined
transposon tagging and tiling microarrays every genome surveyed, including the with many undiscovered novel splicing
have suggested that in the genomes of A. thaliana21, mouse20,22, human24, S. cerevi- variants, suggest that there is considerably
yeast, D. melanogaster and humans, there siae18 and S. pombe19 genomes. 487 and 453 more transcript complexity than previously
are many novel transcribed regions novel transcripts have been discovered in appreciated.

glossary
Cap analysis of gene expression MicroRNA Sequencing depth
(CaGe). Similar to SaGe, except that 5′-end information of (miRNa). Small RNa molecules that are The total number of all the sequences reads or base
the transcript is analysed instead of 3′-end information. processed from small hairpin RNa (shRNa) pairs represented in a single sequencing experiment or
precursors that are produced from miRNa series of experiments.
Contigs genes. miRNas are 21–23 nucleotides in length
A group of sequences representing overlapping regions and through the RNa-induced silencing complex Serial analysis of gene expression
from a genome or transcriptome. they target and silence mRNas containing imperfectly (SaGe). a method that uses short ~14–20-bp sequence
complementary sequence. tags from the 3′ ends of transcripts to measure gene
dsRNA editing
expression levels.
Site-specific modification of a pre-mRNa by dsRNa-specific
Piwi-interacting RNAs
enzymes that leads to the production of variant mRNa
(piRNa). Small RNa species that are processed Short interfering RNA
from the same gene.
from single-stranded precursor RNas. They (siRNa). RNa molecules that are 21–23 nucleotides long
are 25–35 nucleotides in length and form and that are processed from long double-stranded RNas;
Genomic tiling microarray
complexes with the piwi protein. piRNas are they are functional components of the RNai-induced
a DNa microarray that uses a set of overlapping
probably involved in transposon silencing and silencing complex. siRNas typically target and silence
oligonucleotide probes that represent a subset of or the
stem-cell function. mRNas by binding perfectly complementary sequences
whole genome at very high resolution.
in the mRNa and causing their degradation and/or
Massively parallel signature sequencing Quantitative PCR translation inhibition.
(MPSS). a gene expression quantification method that (qPCR). an application of PCR to determine
determines 17–20-bp ‘signatures’ from the ends of a the quantity of DNa or RNa in a sample. The Spike-in RNA
cDNa molecule using multiple cycles of enzymatic measurements are often made in real time and a few species of RNa with known sequence and quantity
cleavage and ligation. the method is also called real-time PCR. that are added as internal controls in RNa-Seq experiments.

62 | jANUARy 2009 | VOLUME 10 www.nature.com/reviews/genetics

© 2009 Macmillan Publishers Limited. All rights reserved


PersPectives

Defining transcription level Zhong Wang and Michael Snyder are at the Department 24. Morin, R. et al. Profiling the HeLa S3 transcriptome
of Molecular, Cellular and Developmental Biology, and using randomly primed cDNA and massively parallel
As RNA-Seq is quantitative, it can be used short-read sequencing. Biotechniques 45, 81–94
Mark Gerstein is at the Department of Molecular,
to determine RNA expression levels more Biophysics and Biochemistry, Yale University, 219
(2008).
25. Holt, R. A. & Jones, S. J. The new paradigm of flow cell
accurately than microarrays. In principle, Prospect Street, New Haven, Connecticut 06520, USA. sequencing. Genome Res. 18, 839–846 (2008).
it is possible to determine the absolute Correspondence to M.S.
26. Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. &
Schnable, P. S. SNP discovery via 454 transcriptome
quantity of every molecule in a cell e‑mail: michael.snyder@yale.edu sequencing. Plant J. 51, 910–918 (2007).
population, and directly compare results doi:10.1038/nrg2484
27. Vera, J. C. et al. Rapid transcriptome characterization
for a nonmodel organism using 454 pyrosequencing.
between experiments. Several methods Published online 18 November 2008 Mol. Ecol. 17, 1636–1647 (2008).
have been used for quantification. For 28. Emrich, S. J., Barbazuk, W. B., Li, L. & Schnable, P. S.
1. Clark, T. A., Sugnet, C. W. & Ares, M. Jr. Gene discovery and annotation using LCM-454
RNA fragmentation followed by cDNA Genomewide analysis of mRNA processing in yeast transcriptome sequencing. Genome Res. 17, 69–73
synthesis, which gives more uniform cov- using splicing-specific microarrays. Science 296, (2007).
907–910 (2002). 29. Dutrow, N. et al. Dynamic transcriptome of
erage of each exon, gene expression levels 2. David, L. et al. A high-resolution map of transcription Schizosaccharomyces pombe shown by RNA–DNA
can be deduced from the total number in the yeast genome. Proc. Natl Acad. Sci. USA 103, hybrid mapping. Nature Genet. 40, 977–986
5320–5325 (2006). (2008).
of reads that fall into the exons of a gene, 3. Yamada, K. et al. Empirical analysis of transcriptional 30. Wu, J. Q., et al. Systematic analysis of transcribed loci
normalized by the length of exons that activity in the Arabidopsis genome. Science 302, in ENCODE regions using RACE sequencing reveals
842–846 (2003). extensive transcription in the human genome. Genome
can be uniquely mapped24; for 3′-biased 4. Bertone, P. et al. Global identification of human Biol. 9, R3 (2008).
methods, read counts from a window near transcribed sequences with genome tiling arrays. 31. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short
Science 306, 2242–2246 (2004). oligonucleotide alignment program. Bioinformatics
the 3′ end are used18. Gene expression 5. Cheng, J. et al. Transcriptional maps of 10 human 24, 713–714 (2008).
levels determined by these methods closely chromosomes at 5-nucleotide resolution. Science 32. Li, H., Ruan, J. & Durbin, R. Mapping short DNA
308, 1149–1154 (2005). sequencing reads and calling variants using mapping
correlate with qPCR and RNA spike-in 6. Okoniewski, M. J. & Miller, C. J. Hybridization quality scores. Genome Res. 19 Aug 2008
controls. interactions between probesets in short oligo (doi:10.1101/gr.078212.108).
microarrays lead to spurious correlations. 33. Smith, A. D., Xuan, Z. & Zhang, M. Q. Using
One particularly powerful advantage of BMC Bioinformatics 7, 276 (2006). quality scores and longer reads improves accuracy
RNA-Seq is that it can capture transcrip- 7. Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. of Solexa read mapping. BMC Bioinformatics 9,
Toward a universal microarray: prediction of gene 128 (2008).
tome dynamics across different tissues or expression through nearest-neighbor probe sequence 34. Hillier, L. W. et al. Whole-genome sequencing and
conditions without sophisticated normali- identification. Nucleic Acids Res. 35, e99 (2007). variant discovery in C. elegans. Nature Methods 5,
8. Boguski, M. S., Tolstoshev, C. M. & Bassett, D. E. Jr. 183–188 (2008).
zation of data sets19,20,22. RNA-Seq has been Gene discovery in dbEST. Science 265, 1993–1994 35. Campbell, P. J. et al. Identification of somatically
used to accurately monitor gene expres- (1994). acquired rearrangements in cancer using genome-wide
9. Gerhard, D. S. et al. The status, quality, and expansion massively parallel paired-end sequencing. Nature
sion during yeast vegetative growth18, of the NIH full-length cDNA project: the Mammalian Genet. 40, 722–729 (2008).
yeast meiosis19 and mouse embryonic Gene Collection (MGC). Genome Res. 14, 2121–2127 36. Manak, J. R. et al. Biological function of
(2004). unannotated transcription during the early
stem-cell differentiation22, to track gene 10. Velculescu, V. E., Zhang, L., Vogelstein, B. & development of Drosophila melanogaster. Nature
expression changes during development, Kinzler, K. W. Serial analysis of gene expression. Genet. 38, 1151–1158 (2006).
Science 270, 484–487 (1995). 37. Hinnebusch, A. G. Translational regulation of GCN4
and to provide a ‘digital measurement’ of 11. Harbers, M. & Carninci, P. Tag-based approaches for and the general amino acid control of yeast. Annu.
gene expression difference between differ- transcriptome research and genome annotation. Rev. Microbiol. 59, 407–450 (2005).
Nature Methods 2, 495–502 (2005). 38. Ruiz-Echevarria, M. J. & Peltz, S. W. The RNA binding
ent tissues20. Because of these advantages, 12. Kodzius, R. et al. CAGE: cap analysis of gene protein Pub1 modulates the stability of transcripts
RNA-Seq will undoubtedly be valuable for expression. Nature Methods 3, 211–222 (2006). containing upstream open reading frames. Cell 101,
13. Nakamura, M. & Carninci, P. [Cap analysis gene 741–751 (2000).
understanding transcriptomic dynamics expression: CAGE]. Tanpakushitsu Kakusan Koso 49, 39. Tomari, Y. & Zamore, P. D. MicroRNA biogenesis:
during development and normal physi- 2688–2693 (2004) (in Japanese). drosha can’t cut it without a partner. Curr. Biol. 15,
14. Shiraki, T. et al. Cap analysis gene expression R61–64 (2005).
ological changes, and in the analysis of for high-throughput analysis of transcriptional starting 40. Bass, B. L. How does RNA editing affect dsRNA-
biomedical samples, where it will allow point and identification of promoter usage. Proc. Natl mediated gene silencing? Cold Spring Harb. Symp.
Acad. Sci. USA 100, 15776–15781 (2003). Quant. Biol. 71, 285–292 (2006).
robust comparison between diseased and 15. Brenner, S. et al. Gene expression analysis by 41. Sultan, M. et al. A global view of gene activity
normal tissues, as well as the subclassification massively parallel signature sequencing (MPSS) on and alternative splicing by deep sequencing of the
microbead arrays. Nature Biotechnol. 18, 630–634 human transcriptome. Science 321, 956–960
of disease states. (2000). (2008).
16. Peiffer, J. A. et al. A spatial dissection of the 42. Ross-Macdonald, P. et al. Large-scale analysis of the
Arabidopsis floral transcriptome by MPSS. yeast genome by transposon tagging and gene
Future directions BMC Plant Biol. 8, 43 (2008). disruption. Nature 402, 413–418 (1999).
Although RNA-Seq is still in the early 17. Reinartz, J. et al. Massively parallel signature 43. Kumar, A., des Etages, S. A., Coelho, P. S.,
sequencing (MPSS) as a tool for in-depth quantitative Roeder, G. S. & Snyder, M. High-throughput
stages of use, it has clear advantages over gene expression profiling in all organisms. Brief. Funct. methods for the large-scale analysis of gene function
previously developed transcriptomic Genomic Proteomic 1, 95–104 (2002). by transposon tagging. Methods Enzymol. 328,
18. Nagalakshmi, U. et al. The transcriptional landscape 550–574 (2000).
methods. The next big challenge for of the yeast genome defined by RNA sequencing.
RNA-Seq is to target more complex Science 320, 1344–1349 (2008). Acknowledgements
19. Wilhelm, B. T. et al. Dynamic repertoire of a eukaryotic We thank D. Raha for many valuable comments.
transcriptomes to identify and track the transcriptome surveyed at single-nucleotide
expression changes of rare RNA isoforms resolution. Nature 453, 1239–1243 (2008).
20. Mortazavi, A., Williams, B. A., McCue, K., FURTHER INFORMATION
from all genes. Technologies that will Schaeffer, L. & Wold, B. Mapping and quantifying Gerstein laboratory homepage:
advance achievement of this goal are mammalian transcriptomes by RNA-Seq. Nature http://bioinfo.mbb.yale.edu
Methods 5, 621–628 (2008). snyder laboratory homepage:
pair-end sequencing, strand-specific 21. Lister, R. et al. Highly integrated single-base resolution http://www.yale.edu/snyder
sequencing and the use of longer reads to maps of the epigenome in Arabidopsis. 454 Life science: http://www.454.com
Cell 133, 523–536 (2008). Applied Biosystems: www.appliedbiosystems.com
increase coverage and depth. As the cost 22. Cloonan, N. et al. Stem cell transcriptome profiling Helicos Biosciences: http://www.helicosbio.com
of sequencing continues to fall, RNA-Seq via massive-scale mRNA sequencing. Nature Methods illumina: http://www.illumina.com
5, 613–619 (2008). illumina forum:
is expected to replace microarrays for 23. Marioni, J., Mason, C., Mane, S., Stephens, M. & http://www.illumina.com/pagesnrn.ilmn?iD=245
many applications that involve determin- Gilad, Y. RNA-seq: an assessment of technical seQanswers:
reproducibility and comparison with gene expression http://seqanswers.com/forums/showthread.php?t=43
ing the structure and dynamics of the arrays. Genome Res. 11 Jun 2008 (doi:10.1101/ All links ARe Active in the online PDf
transcriptome. gr.079558.108).

NATURE REVIEwS | genetics VOLUME 10 | jANUARy 2009 | 63

© 2009 Macmillan Publishers Limited. All rights reserved