Académique Documents
Professionnel Documents
Culture Documents
In cap analysis gene expression (CAGE), short (20 nucleotide) sequence tags originating from the 5 end of full-length mRNAs are
sequenced to identify transcription events on a genome-wide scale. The rapid increase in the throughput of present-day sequencers
provides much deeper CAGE tag sequencing, where CAGE tags can be found multiple times for each mRNA in a given experiment.
CAGE tag counts can then be used to reliably estimate the cellular concentration of the corresponding mRNA. In contrast to microarray and SAGE expression profiling, CAGE identifies the location of each transcription start site in addition to its expression level.
This makes it possible for us to infer a genome-wide network of transcriptional regulation by searching the promoter region surrounding each CAGE-defined transcription start site for potential transcription factor binding sites. Hence, deep CAGE is a unique
tool for the construction of a promoter-based network of transcriptional regulation. CAGE-based expression profiling also allows us
to identify dynamic promoter usage in time-course experiments and the specific promoter regulated by a given transcription factor
in disruption experiments. The sheer size of the short-tag datasets produced by modern sequencers spurs a need for new software
development to handle the amount of data generated by next-generation sequencers. In addition, new visualization methods will be
needed to represent a promoter-based transcriptional network.
INTRODUCTION
Cap analysis gene expression (CAGE)
was introduced in 2003 as a method to
determine transcription start sites on
a genome-wide scale by isolating and
sequencing short sequence tags originating
from the 5 end of RNA transcripts (1).
Mapping these tags back to the reference
genome identifies the transcription start
sites from which the transcripts originated.
CAGE relies on a cap-trapper system to
capture full-length RNAs while avoiding
rRNA and tRNA transcripts. First, an oligodT primer is used to reverse-transcribe
poly-A terminated RNAs. Alternatively,
a random primer can be used for RNAs
without a poly-A tail, which may constitute
almost half of the transcriptome (2). RNA/
DNA double-stranded hybrids that contain
a mature mRNA are selected by biotinylating their 5 cap structure, allowing
capture by streptavidin-coated magnetic
beads. Ligation of a linker sequence
containing an MmeI recognition site to
the 5 end of the full-length cDNA creates
a restriction site about 20 nucleotides
Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Yokohama, Kanagawa, Japan
Vol. 44 No. 5 2008
4/10/08 2:40:46 PM
Mini-Review
sponding RNA molecule in a digital form.
As one mRNA is not preferentially detected
over another, expression profiling based
on tag sequencing is unbiased, allowing
a direct comparison of the expression
values of different genes measured in
a single deep CAGE experiment. In
contrast, microarray fluorescence levels
are affected both by the mRNA concentration and by the probe-dependent mRNA
affinity, precluding a direct comparison
between genes. In addition, tag counts as
a measure of mRNA concentrations have a
dynamic range that is orders of magnitude
larger than microarray expression levels.
The accuracy and the dynamic range of
CAGE- and SAGE-derived expression
levels as well as the sensitivity of detecting
lowly expressed transcripts can be
improved further by deeper sequencing.
Importantly, microarray and qRT-PCR
expression profiling are restricted to those
transcripts for which a probe or primer
pair is available, whereas methodologies
based on tag sequencing can also measure
the expression of currently unknown
transcripts.
Deep CAGE expression profiling is
unique among high-throughput expression
profiling methods because the 5 end of the
CAGE tag identifies the corresponding
transcription start site. Hence, deep CAGE
allows us to determine the promoter driving
the transcription of each transcript in
addition to its expression level. In contrast,
tags generated by SAGE or MPSS are
located at the 3 end of the transcript and
do not identify the promoter, which may
lie tens of kilobases or more upstream in
the genome sequence. A variant of SAGE
Table 1. Sequencing Capacities of Current and Near-future Next-generation Sequencers, and the Estimated Corresponding Deep CAGE Throughput and Sensitivity
Throughput
CAGE
Sequencer
Base/Run
Base/Read
Read/Run
Base/Day
Mapped Tag/Run
Sensitivity at 99%
454
100,000,000
250
400,000
100,000,000
1,100,000
1 per 1 cell
Solexa
500,000,000
25
20,000,000
120,000,000
5,000,000
1 per 5 cells
SOLiD
1,400,000,000
35
40,000,000
180,000,000
13,000,000
1 per 14 cells
Helicos
10,000,000,000
30
300,000,000
1,000,000,000
150,000,000
The sensitivity of deep CAGE is assessed as the concentration per cell of an mRNA molecule that can be detected at 99% probability, assuming a total mRNA
concentration of 200,000 molecules per cell.
4/10/08 2:40:48 PM
Mini-Review
CAGE experiments, promoters were
constructed by clustering the 5 ends of
the individual CAGE tags based on the
distance between them on the genome (12).
In deep CAGE experiments, transcription
start site clustering may also take into
account the similarity in expression
profiles. Deep CAGE thus allows us to
define the individual promoters from
which genes are transcribed based on the
promoter activities.
Previous CAGE experiments indicate
that transcription units in mammalian
genomes are characterized by alternative
transcription start sites and multiple splice
forms that are active in different cellular
contexts (1216). In any given CAGE
experiment, some of the CAGE tags correspond to previously identified promoters,
whereas others are due to novel promoters
and transcripts that may give rise to novel
protein variants or contain alternative
localization or degradation signals. CAGE
has also led to the discovery of novel
noncoding RNAs (12,17).
CAGE expression profiling has
produced a wide variety of new insights
into the mammalian transcriptome and
its regulation. An analysis of CAGEdefined transcription start sites in human
and mouse illustrated that promoters
characterized by a TATA-box tend to
have a clear, single transcription start
site, whereas promoters associated with
CpG islands tend to have transcription
start sites distributed over a broad area
(14). CAGE expression profiling, with its
ability to localize the promoter as well as
determine its expression level, is ideally
suited to study the regulation of bidirectional promoters (18,19), which may be
coregulated by shared transcription factor
binding sites. Antisense transcripts, which
are transcribed in the opposite direction
to the coding strand, may play a role in
regulation via RNA interference as well as
gene silencing at the chromatin level (20
24). A genome-wide analysis of CAGE
transcriptome data revealed frequent
concordant regulation of sense/antisense
pairs (25).
While genes can have multiple
promoters in principle, their usage will
likely vary depending on cellular conditions. Time-course CAGE experiments
can be used to study the dynamic usage of
promoters by comparing the distribution
of expression levels over transcription
start sites in the upstream region of a gene
to discover which promoters are switched
4/10/08 2:40:49 PM
Mini-Review
The size of the datasets produced by
next-generation sequencers also poses a
challenge to data management systems. As
terabytes of image data may be generated
in a single run, saving all raw data
produced in one experiment may no longer
be an option. With billions of bases being
generated in a single run, even saving only
the sequence data will be a considerable
enterprise.
In addition to new software to analyze
deep CAGE data, the paradigm shift of
gene-based networks to promoter-based
networks of transcriptional regulation
requires new ways to visualize such
networks. Visualization software packages
such as Cytoscape (32) have previously
been developed to represent biomolecular
interaction networks and can be used
to draw gene regulatory networks. In a
promoter-based network, the targets of
regulatory interactions are the individual
promoters of a gene, which may be too
numerous for graphical representation
except in the most detailed drawings. The
situation is further compounded by the
multitude of transcripts that have been
identified for each gene. A multiscale
visualization approach in which users
can choose the level of detail at which
each gene is represented may be suitable
for visualizing promoter-based regulatory
networks.
ACKNOWLEDGMENTS
This work was supported by a
research grant from the RIKEN Genome
Exploration Research Project from the
Ministry of Education, Culture, Sports,
Science and Technology of the Japanese
Government to Y.H., a grant from the
Genome Network Project, also from the
Ministry of Education, Culture, Sports,
Science and Technology, and a grant from
the RIKEN Frontier Research System,
Functional RNA Research Program.
COMPETING INTERESTS
STATEMENT
The authors declare no competing
interests.
REFERENCES
1. Shiraki, T., S. Kondo, S. Katayama, K. Waki, T.
Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki,
et al. 2003. Cap analysis gene expression for highthroughput analysis of transcriptional starting point
M. Oshimura. 2004. Dicer is essential for formation of the heterochromatin structure in vertebrate
cells. Nat. Cell Biol. 6:784-791.
22. Imamura, T., S. Yamamoto, J. Ohgane, N.
Hattori, S. Tanaka, and K. Shiota. 2004. Noncoding RNA directed DNA demethylation of Sphk1
CpG island. Biochem. Biophys. Res. Commun.
322:593-600.
23. Murrell, A., S. Heeson, and W. Reik. 2004.
Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19
into parent-specific chromatin loops. Nat. Genet.
36:889-893.
24. Andersen, A.A. and B. Panning. 2003. Epigenetic
gene regulation by noncoding RNAs. Curr. Opin.
Cell Biol. 15:281-289.
25. Katayama, S., Y. Tomaru, T. Kasukawa, K. Waki,
M. Nakanishi, M. Nakamura, H. Nishida, C.C.
Yap, et al. 2005. Antisense transcription in the mammalian transcriptome. Science 309:1564-1566.
26. Nilsson, R., V.B. Bajic, H. Suzuki, D. di Bernardo,
J. Bjrkegren, S. Katayama, J.F. Reid, M.J.
Sweet, et al. 2006. Transcriptional network dynamics in macrophage activation. Genomics 88:133142.
27. Altschul, S.F., W. Gish, W. Miller, E.W. Myers,
and D.J. Lipman. 1990. Basic local alignment
search tool. J. Mol. Biol. 215:403-410.
28. Ning, Z., A.J. Cox, and J.C. Mullikin. 2001.
SSAHA: a fast search method for large DNA databases. Genome Res. 11:1725-1729.
29. Lassmann, T., E. Arner, and C.O. Daub. 2008.
Manuscript in preparation.
30. Faulkner, G.J., A.R.R. Forrest, A.M. Chalk,
K. Schroder, Y. Hayashizaki, P. Carninci, D.A.
Hume, and S.M. Grimmond. 2008. A rescue
strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE.
Genomics 91:281-288
31. Imanishi, T., T. Itoh, Y. Suzuki, C. ODonovan,
S. Fukuchi, K.O. Koyanagi, R.A. Barrero, T.
Tamura, et al. 2004. Integrative annotation of
21,037 human genes validated by full-length cDNA
clones. PLoS Biol. 2:e162.
32. Shannon, P., A. Markiel, O. Ozier, N.S. Baliga,
J.T. Wang, D. Ramage, N. Amin, B. Schwikowski,
and T. Ideker. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13:2498-2504.
4/10/08 2:40:53 PM