Académique Documents
Professionnel Documents
Culture Documents
RNA fragmentation (RNA hydrolysis or other can be obtained from cDNA libraries currently laborious to produce because they
nebulization) and cDNA fragmentation that have been amplified. These could be require many steps22 or direct RNA–RNA
(DNase I treatment or sonication). Each a genuine reflection of abundant RNA ligation21, which is inefficient. Moreover,
of these methods creates a different bias in species, or they could be PCR artefacts. it is essential to ensure that the antisense
the outcome. For example, RNA fragmen- One way to discriminate between these transcripts are not artefacts of reverse tran-
tation has little bias over the transcript possibilities is to determine whether the scription30. Because of these complications,
body 20, but is depleted for transcript ends same sequences are observed in different most studies thus far have analysed cDNAs
compared with other methods (FIG. 3). biological replicates. without strand information.
Conversely, cDNA fragmentation is Another key consideration concerning
usually strongly biased towards the iden- library construction is whether or not to Bioinformatic challenges. Like other
tification of sequences from the 3′ ends of prepare strand-specific libraries, as has high-throughput sequencing technolo-
transcripts, and thereby provides valuable been done in two studies21,22. These libraries gies, RNA-Seq faces several informatics
information about the precise identity of have the advantage of yielding information challenges, including the development of
these ends18 (FIG. 4). about the orientation of transcripts, which efficient methods to store, retrieve and
Some manipulations during library is valuable for transcriptome annotation, process large amounts of data, which must
construction also complicate the analysis especially for regions with overlapping be overcome to reduce errors in image
of RNA-Seq results. For example, many transcription from opposite directions2,19,29; analysis and base-calling and remove
shorts reads that are identical to each however, strand-specific libraries are low-quality reads.
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0 0
–1 0 1 2 3 4 4 5 6 7 8 8 10 12 14
Expression levels by RNA-Seq (log2)
Figure 2 | Quantifying expression levels: RnA-seq and microarray medium levels of expression (middle), but correlation is very| Genetics
Nature Reviews low for
compared. expression levels are shown, as measured by RNA-Seq genes with either low or high expression levels. The tiling array data
and tiling arrays, for Saccharomyces cerevisiae cells grown in nutrient- used in this figure is taken from ReF. 2, and the RNA-Seq data is taken
rich media. The two methods agree fairly well for genes with from ReF. 18.
Once high-quality reads have been only needs to be given to poly(A) tails a junction library that contains all the
obtained, the first task of data analysis is and to a small number of exon–exon known and predicted junction sequences
to map the short reads from RNA-Seq to junctions. Poly(A) tails can be identified and map reads to this library 19,20. A chal-
the reference genome, or to assemble them simply by the presence of multiple As or lenge for the future is to develop computa-
into contigs before aligning them to the Ts at the end of some reads. Exon–exon tionally simple methods to identify novel
genomic sequence to reveal transcription junctions can be identified by the pres- splicing events that take place between two
structure. There are several programs for ence of a specific sequence context (the distant sequences or between exons from
mapping reads to the genome, including GT–AG dinucleotides that flank splice two different genes.
ELAND, SOAP31, MAQ32 and RMAP33 sites) and confirmed by the low expression For large transcriptomes, alignment
(information about these can be found at of intronic sequences, which are removed is also complicated by the fact that a sig-
the Illumina forum and at SEQanswers). during splicing. Transcriptome maps nificant portion of sequence reads match
However, short transcriptomic reads also have been generated in this manner for multiple locations in the genome. One
contain reads that span exon junctions S. cerevisiae 18. For complex transcriptomes solution is to assign these multi-matched
or that contain poly(A) ends — these it is more difficult to map reads that span reads by proportionally assigning them
cannot be analysed in the same way. For splice junctions, owing to the presence of based on the number of reads mapped to
genomes in which splicing is rare (for extensive alternative splicing and trans- their neighbouring unique sequences20,22.
example, S. cerevisiae) special attention splicing. One partial solution is to compile This method has been successful for
low-copy repetitive sequences20. Short
reads that have high copy numbers (>100)
a and long stretches of repetitive regions
RNA fragmentation present a greater challenge. Obtaining
longer sequence reads, for example using
454 technology, should help alleviate the
multi-matching problem. Alternatively, a
paired-end sequencing strategy, in which
short sequences are determined from both
ends of a DNA fragment 25,34,35, extends the
mapped fragment length to 200–500 bp
Tag count
genes are presumably either not expressed ACT1 ORF YFL040W ORF
under this condition (for example, sporu-
5′ 3′ 3′ 5′
lation genes18) or do not have poly(A)
tails. Analyzing many different conditions
can further increase the coverage; in
S. pombe 122 million reads from six differ- Poly(A) tags on – strand
Local heterogeneity
ent growth conditions detected transcrip-
tion from >99% of annotated genes19.
In general, the larger the genome, the
more complex the transcriptome, the more
sequencing depth is required for adequate
coverage. Unlike genome-sequencing cov- Poly(A) tags
erage, it is less straightforward to calculate on + strand
the coverage of the transcriptome; this
is because the true number and level of
different transcript isoforms is not usually
known and because transcription activity
varies greatly across the genome. One study
Distinct poly(A) sites Site 1 Site 2
used the number of unique transcription
start sites as a measure of coverage in Figure 4 | Poly(A) tags from RnA-seq. A region containing two overlappingNaturetranscripts (ACT1, from
Reviews | Genetics
mouse embryonic cells, and demonstrated the actin gene, and YFL040W, an uncharacterized ORF) from the Saccharomyces cerevisiae genome
that at 80 million reads, the number of start is shown. Arrows point to transcription direction. The poly(A) tags from RNA-Seq experiments are
sites reached a plateau22 (FIG. 5b). However, shown below these transcripts, with arrows indicating transcription direction. The precise location
this approach does not address transcrip- of each locus identified by poly(A) tags reveals the heterogeneity in poly(A) sites, for example, ACT1
tome complexity in alternative splicing and has two big clusters, both with a few bases of local heterogeneity. The transcription direction
transcription termination sites; presumably revealed by poly(A) tags also helps to resolve 3′-end overlapping transcribed regions18.
further sequencing can reveal additional
variants.
can be precisely mapped by searching for many mRNAs with uORFs are transcrip-
New transcriptomic insights poly(A) tags, and introns can be mapped tion factors, suggesting that these regulators
Despite the challenges described above, by searching for tags that span GT–AG are themselves heavily regulated.
the advantages of RNA-Seq have enabled splicing consensus sites. Using these meth- The mapping of transcript boundaries
us to generate an unprecedented global ods the 5′ and 3′ boundaries of 80% and revealed several novel features of eukaryo-
view of the transcriptome and its organi- 85% of all annotated genes, respectively, tic gene organization. Many yeast genes
zation for a number of species and cell were mapped in S. cerevisiae 18. Similarly, were found to overlap at their 3′ ends18.
types. Before the advent of RNA-Seq, in S. pombe many boundaries were defined Using relaxed criteria similar to those
it was known that a much greater than by RNA-Seq data in combination with employed in a recent study18 we found that
expected fraction of the yeast, Drosophila tiling array data19. 808 pairs, approximately 25% of all yeast
melanogaster and human genomes are These two studies led to the discovery ORFs, overlap at their 3′ ends18. Likewise,
transcribed2,4,36, and for yeast and humans of many 5′ and 3′ UTRs that had not antisense expression is enriched in the 3′
a number of distinct isoforms have been been analysed previously. In S. cerevisiae, exons of mouse transcripts22. These features
found for many genes2,4. However, the extensive 3′-end heterogeneity was might confer interesting regulatory proper-
starts and ends of most transcripts and discovered at two levels: first, local ties on the affected genes. For multicellular
exons had not been precisely resolved heterogeneity exists in which a cluster of organisms, antisense transcription could
and the extent of spliced heterogeneity sites are involved, typically within a 10 bp modulate gene expression through the
remained poorly understood. RNA-Seq, window; second, there are distinct regions production of siRNAs or through dsRNa
with its high resolution and sensitivity has of poly(A) addition for 540 genes (FIG. 4). editing39,40. For yeast, which seems to lack
revealed many novel transcribed regions It is plausible that these different 3′ ends siRNA and dsRNA-editing functions,
and splicing isoforms of known genes, and confer distinct properties to the different transcription from one gene might interfere
has mapped 5′ and 3′ boundaries for many mRNA isoforms, such as mRNA localiza- with that from an overlapping gene, or
genes. tion or degradation signals, which in turn coordinate gene expression through other
might be responsible for unique biological mechanisms.
Mapping gene and exon boundaries. The functions18,19. In addition to 3′ heterogene-
single-base resolution of RNA-Seq has ity, the list of upstream ORFs within the 5′ Extensive transcript complexity. RNA-Seq
the potential to revise many aspects UTRs of mRNAs (uORFs) was also greatly can be used to quantitatively examine
of the existing gene annotation, including expanded from 17 to 340 (6% of yeast splicing diversity by searching for reads
gene boundaries and introns for known genes)18; uORFs regulate mRNA transla- that span known splice junctions as well
genes as well as the identification of novel tion37 or stability 38, so these sequences as potential new ones. In humans, 31,618
transcribed regions. 5′ and 3′ boundaries might make a previously underappreciated known splicing events were confirmed
can be mapped to within 10–50 bases by a contribution to the regulatory sophistica- (11% of all known splicing events) and 379
precipitous drop in signal. 3′ boundaries tion of eukaryotic genomes. Interestingly, novel splicing events were discovered24.
a b
6,000 90% 100
5,000
Number of ORFs detected
3,000
2,000 1
ORFs detected
1,000 80% coverage ES
50% coverage EB
0 0.1
0 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80
Number of uniquely mapped tags (million) Number of mapped tags (million)
Figure 5 | coverage versus depth. a | 80% of yeast genes were detected b | The number of unique start sites detected starts to reach a plateau
at 4 million uniquely mapped RNA-Seq reads, and coverage reaches when the depth of sequencing reaches 80 million in two mouse tran-
Nature Reviews | Genetics
a plateau afterwards despite the increasing sequencing depth. scriptomes. eS, embryonic stem cells; eB, embryonic body. Figure is
expressed genes are defined as having at least four independent modified, with permission, from ReF. 22 (2008) Macmillan Publishers
reads from a 50-bp window at the 3′ end. Data is taken from ReF. 18. Ltd. All rights reserved.
Another study of human cells found represented in poly(A)+ RNA2,36,42,43. S. cerevisiae and S. pombe, respectively 18,19;
94,241 junctions, among which 4,096 However, the accuracy of the tiling array for S. cerevisiae half of these were not
were novel, and further demonstrated that results is uncertain owing to concerns about identified using microarrays. Many of
the prevalent form of alternative splicing cross-hybridization (see below). RNA-Seq, these novel transcribed regions in yeast do
is exon skipping 41. In mice, extensive which does not suffer from problems with not seem to encode any protein, and their
alternative splicing was observed for 3,462 background noise, has confirmed that at functions remain to be determined. The
genes20. In addition, 42 splicing events least 75% and perhaps greater than 90% current sequencing depth is not sufficient
that join exons from multiple mouse genes of the S. cerevisiae and S. pombe genomes to define the boundaries of novel transcript
were detected22. are expressed18,19. In addition, results from units in mammals; however, 30–40% of
RNA-Seq suggest the existence of a large reads map to unannotated regions20,22,24.
Novel transcription. Previous studies using number of novel transcribed regions in These novel transcribed regions, combined
transposon tagging and tiling microarrays every genome surveyed, including the with many undiscovered novel splicing
have suggested that in the genomes of A. thaliana21, mouse20,22, human24, S. cerevi- variants, suggest that there is considerably
yeast, D. melanogaster and humans, there siae18 and S. pombe19 genomes. 487 and 453 more transcript complexity than previously
are many novel transcribed regions novel transcripts have been discovered in appreciated.
glossary
Cap analysis of gene expression MicroRNA Sequencing depth
(CaGe). Similar to SaGe, except that 5′-end information of (miRNa). Small RNa molecules that are The total number of all the sequences reads or base
the transcript is analysed instead of 3′-end information. processed from small hairpin RNa (shRNa) pairs represented in a single sequencing experiment or
precursors that are produced from miRNa series of experiments.
Contigs genes. miRNas are 21–23 nucleotides in length
A group of sequences representing overlapping regions and through the RNa-induced silencing complex Serial analysis of gene expression
from a genome or transcriptome. they target and silence mRNas containing imperfectly (SaGe). a method that uses short ~14–20-bp sequence
complementary sequence. tags from the 3′ ends of transcripts to measure gene
dsRNA editing
expression levels.
Site-specific modification of a pre-mRNa by dsRNa-specific
Piwi-interacting RNAs
enzymes that leads to the production of variant mRNa
(piRNa). Small RNa species that are processed Short interfering RNA
from the same gene.
from single-stranded precursor RNas. They (siRNa). RNa molecules that are 21–23 nucleotides long
are 25–35 nucleotides in length and form and that are processed from long double-stranded RNas;
Genomic tiling microarray
complexes with the piwi protein. piRNas are they are functional components of the RNai-induced
a DNa microarray that uses a set of overlapping
probably involved in transposon silencing and silencing complex. siRNas typically target and silence
oligonucleotide probes that represent a subset of or the
stem-cell function. mRNas by binding perfectly complementary sequences
whole genome at very high resolution.
in the mRNa and causing their degradation and/or
Massively parallel signature sequencing Quantitative PCR translation inhibition.
(MPSS). a gene expression quantification method that (qPCR). an application of PCR to determine
determines 17–20-bp ‘signatures’ from the ends of a the quantity of DNa or RNa in a sample. The Spike-in RNA
cDNa molecule using multiple cycles of enzymatic measurements are often made in real time and a few species of RNa with known sequence and quantity
cleavage and ligation. the method is also called real-time PCR. that are added as internal controls in RNa-Seq experiments.
Defining transcription level Zhong Wang and Michael Snyder are at the Department 24. Morin, R. et al. Profiling the HeLa S3 transcriptome
of Molecular, Cellular and Developmental Biology, and using randomly primed cDNA and massively parallel
As RNA-Seq is quantitative, it can be used short-read sequencing. Biotechniques 45, 81–94
Mark Gerstein is at the Department of Molecular,
to determine RNA expression levels more Biophysics and Biochemistry, Yale University, 219
(2008).
25. Holt, R. A. & Jones, S. J. The new paradigm of flow cell
accurately than microarrays. In principle, Prospect Street, New Haven, Connecticut 06520, USA. sequencing. Genome Res. 18, 839–846 (2008).
it is possible to determine the absolute Correspondence to M.S.
26. Barbazuk, W. B., Emrich, S. J., Chen, H. D., Li, L. &
Schnable, P. S. SNP discovery via 454 transcriptome
quantity of every molecule in a cell e‑mail: michael.snyder@yale.edu sequencing. Plant J. 51, 910–918 (2007).
population, and directly compare results doi:10.1038/nrg2484
27. Vera, J. C. et al. Rapid transcriptome characterization
for a nonmodel organism using 454 pyrosequencing.
between experiments. Several methods Published online 18 November 2008 Mol. Ecol. 17, 1636–1647 (2008).
have been used for quantification. For 28. Emrich, S. J., Barbazuk, W. B., Li, L. & Schnable, P. S.
1. Clark, T. A., Sugnet, C. W. & Ares, M. Jr. Gene discovery and annotation using LCM-454
RNA fragmentation followed by cDNA Genomewide analysis of mRNA processing in yeast transcriptome sequencing. Genome Res. 17, 69–73
synthesis, which gives more uniform cov- using splicing-specific microarrays. Science 296, (2007).
907–910 (2002). 29. Dutrow, N. et al. Dynamic transcriptome of
erage of each exon, gene expression levels 2. David, L. et al. A high-resolution map of transcription Schizosaccharomyces pombe shown by RNA–DNA
can be deduced from the total number in the yeast genome. Proc. Natl Acad. Sci. USA 103, hybrid mapping. Nature Genet. 40, 977–986
5320–5325 (2006). (2008).
of reads that fall into the exons of a gene, 3. Yamada, K. et al. Empirical analysis of transcriptional 30. Wu, J. Q., et al. Systematic analysis of transcribed loci
normalized by the length of exons that activity in the Arabidopsis genome. Science 302, in ENCODE regions using RACE sequencing reveals
842–846 (2003). extensive transcription in the human genome. Genome
can be uniquely mapped24; for 3′-biased 4. Bertone, P. et al. Global identification of human Biol. 9, R3 (2008).
methods, read counts from a window near transcribed sequences with genome tiling arrays. 31. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short
Science 306, 2242–2246 (2004). oligonucleotide alignment program. Bioinformatics
the 3′ end are used18. Gene expression 5. Cheng, J. et al. Transcriptional maps of 10 human 24, 713–714 (2008).
levels determined by these methods closely chromosomes at 5-nucleotide resolution. Science 32. Li, H., Ruan, J. & Durbin, R. Mapping short DNA
308, 1149–1154 (2005). sequencing reads and calling variants using mapping
correlate with qPCR and RNA spike-in 6. Okoniewski, M. J. & Miller, C. J. Hybridization quality scores. Genome Res. 19 Aug 2008
controls. interactions between probesets in short oligo (doi:10.1101/gr.078212.108).
microarrays lead to spurious correlations. 33. Smith, A. D., Xuan, Z. & Zhang, M. Q. Using
One particularly powerful advantage of BMC Bioinformatics 7, 276 (2006). quality scores and longer reads improves accuracy
RNA-Seq is that it can capture transcrip- 7. Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. of Solexa read mapping. BMC Bioinformatics 9,
Toward a universal microarray: prediction of gene 128 (2008).
tome dynamics across different tissues or expression through nearest-neighbor probe sequence 34. Hillier, L. W. et al. Whole-genome sequencing and
conditions without sophisticated normali- identification. Nucleic Acids Res. 35, e99 (2007). variant discovery in C. elegans. Nature Methods 5,
8. Boguski, M. S., Tolstoshev, C. M. & Bassett, D. E. Jr. 183–188 (2008).
zation of data sets19,20,22. RNA-Seq has been Gene discovery in dbEST. Science 265, 1993–1994 35. Campbell, P. J. et al. Identification of somatically
used to accurately monitor gene expres- (1994). acquired rearrangements in cancer using genome-wide
9. Gerhard, D. S. et al. The status, quality, and expansion massively parallel paired-end sequencing. Nature
sion during yeast vegetative growth18, of the NIH full-length cDNA project: the Mammalian Genet. 40, 722–729 (2008).
yeast meiosis19 and mouse embryonic Gene Collection (MGC). Genome Res. 14, 2121–2127 36. Manak, J. R. et al. Biological function of
(2004). unannotated transcription during the early
stem-cell differentiation22, to track gene 10. Velculescu, V. E., Zhang, L., Vogelstein, B. & development of Drosophila melanogaster. Nature
expression changes during development, Kinzler, K. W. Serial analysis of gene expression. Genet. 38, 1151–1158 (2006).
Science 270, 484–487 (1995). 37. Hinnebusch, A. G. Translational regulation of GCN4
and to provide a ‘digital measurement’ of 11. Harbers, M. & Carninci, P. Tag-based approaches for and the general amino acid control of yeast. Annu.
gene expression difference between differ- transcriptome research and genome annotation. Rev. Microbiol. 59, 407–450 (2005).
Nature Methods 2, 495–502 (2005). 38. Ruiz-Echevarria, M. J. & Peltz, S. W. The RNA binding
ent tissues20. Because of these advantages, 12. Kodzius, R. et al. CAGE: cap analysis of gene protein Pub1 modulates the stability of transcripts
RNA-Seq will undoubtedly be valuable for expression. Nature Methods 3, 211–222 (2006). containing upstream open reading frames. Cell 101,
13. Nakamura, M. & Carninci, P. [Cap analysis gene 741–751 (2000).
understanding transcriptomic dynamics expression: CAGE]. Tanpakushitsu Kakusan Koso 49, 39. Tomari, Y. & Zamore, P. D. MicroRNA biogenesis:
during development and normal physi- 2688–2693 (2004) (in Japanese). drosha can’t cut it without a partner. Curr. Biol. 15,
14. Shiraki, T. et al. Cap analysis gene expression R61–64 (2005).
ological changes, and in the analysis of for high-throughput analysis of transcriptional starting 40. Bass, B. L. How does RNA editing affect dsRNA-
biomedical samples, where it will allow point and identification of promoter usage. Proc. Natl mediated gene silencing? Cold Spring Harb. Symp.
Acad. Sci. USA 100, 15776–15781 (2003). Quant. Biol. 71, 285–292 (2006).
robust comparison between diseased and 15. Brenner, S. et al. Gene expression analysis by 41. Sultan, M. et al. A global view of gene activity
normal tissues, as well as the subclassification massively parallel signature sequencing (MPSS) on and alternative splicing by deep sequencing of the
microbead arrays. Nature Biotechnol. 18, 630–634 human transcriptome. Science 321, 956–960
of disease states. (2000). (2008).
16. Peiffer, J. A. et al. A spatial dissection of the 42. Ross-Macdonald, P. et al. Large-scale analysis of the
Arabidopsis floral transcriptome by MPSS. yeast genome by transposon tagging and gene
Future directions BMC Plant Biol. 8, 43 (2008). disruption. Nature 402, 413–418 (1999).
Although RNA-Seq is still in the early 17. Reinartz, J. et al. Massively parallel signature 43. Kumar, A., des Etages, S. A., Coelho, P. S.,
sequencing (MPSS) as a tool for in-depth quantitative Roeder, G. S. & Snyder, M. High-throughput
stages of use, it has clear advantages over gene expression profiling in all organisms. Brief. Funct. methods for the large-scale analysis of gene function
previously developed transcriptomic Genomic Proteomic 1, 95–104 (2002). by transposon tagging. Methods Enzymol. 328,
18. Nagalakshmi, U. et al. The transcriptional landscape 550–574 (2000).
methods. The next big challenge for of the yeast genome defined by RNA sequencing.
RNA-Seq is to target more complex Science 320, 1344–1349 (2008). Acknowledgements
19. Wilhelm, B. T. et al. Dynamic repertoire of a eukaryotic We thank D. Raha for many valuable comments.
transcriptomes to identify and track the transcriptome surveyed at single-nucleotide
expression changes of rare RNA isoforms resolution. Nature 453, 1239–1243 (2008).
20. Mortazavi, A., Williams, B. A., McCue, K., FURTHER INFORMATION
from all genes. Technologies that will Schaeffer, L. & Wold, B. Mapping and quantifying Gerstein laboratory homepage:
advance achievement of this goal are mammalian transcriptomes by RNA-Seq. Nature http://bioinfo.mbb.yale.edu
Methods 5, 621–628 (2008). snyder laboratory homepage:
pair-end sequencing, strand-specific 21. Lister, R. et al. Highly integrated single-base resolution http://www.yale.edu/snyder
sequencing and the use of longer reads to maps of the epigenome in Arabidopsis. 454 Life science: http://www.454.com
Cell 133, 523–536 (2008). Applied Biosystems: www.appliedbiosystems.com
increase coverage and depth. As the cost 22. Cloonan, N. et al. Stem cell transcriptome profiling Helicos Biosciences: http://www.helicosbio.com
of sequencing continues to fall, RNA-Seq via massive-scale mRNA sequencing. Nature Methods illumina: http://www.illumina.com
5, 613–619 (2008). illumina forum:
is expected to replace microarrays for 23. Marioni, J., Mason, C., Mane, S., Stephens, M. & http://www.illumina.com/pagesnrn.ilmn?iD=245
many applications that involve determin- Gilad, Y. RNA-seq: an assessment of technical seQanswers:
reproducibility and comparison with gene expression http://seqanswers.com/forums/showthread.php?t=43
ing the structure and dynamics of the arrays. Genome Res. 11 Jun 2008 (doi:10.1101/ All links ARe Active in the online PDf
transcriptome. gr.079558.108).