Académique Documents
Professionnel Documents
Culture Documents
Correspondence
hana.el-samad@ucsf.edu (H.E.-S.),
matthew.thomson@ucsf.edu (M.T.)
In Brief
We develop a mathematical framework
that delineates how parameters such as
read depth and sample number influence
the error in transcriptional program
extraction from mRNA-sequencing data.
Our analyses reveal that gene expression
modularity facilitates low error at
surprisingly low read depths, arguing that
increased multiplexing of shallow
sequencing experiments is a viable
approach for applications such as single-
cell profiling of entire tumors.
Highlights
d Mathematical model reveals impact of mRNA-seq read depth
on gene expression analysis
Article
Cell Systems 2, 239250, April 27, 2016 2016 The Authors 239
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Figure 1. A Mathematical Model Reveals Factors Determining the Performance of Shallow mRNA-Seq
(A) mRNA-seq throughput as a function of sequencing depth per sample for a fixed sequencing capacity.
(B) Unsupervised learning techniques are used to identify transcriptional programs. We ask when and why shallow mRNA-seq can accurately identify tran-
scriptional programs.
(C) Decreasing sequencing depth adds measurement noise to the transcriptional programs identified by unsupervised learning. Our approach reveals that
dominant programs, defined as those that explain relatively large variances in the data, are tolerant to measurement noise.
analogy to signal processing, this natural structure suggests that Our analysis reveals that the dominance of a transcriptional
the lower effective dimensionality present in gene expression program, quantified by the fraction of the variance it explains
data can be exploited to make accurate, inexpensive measure- in the dataset, determines the read depth required to accu-
ments that are not degraded by noise. But when, and at what rately extract it. We demonstrate that common bioinformatic
error tradeoff, can low dimensionality be leveraged to enable analyses can be performed at 1% of traditional sequencing
low-cost, high-information-content biological measurements? depths with little loss in inferred biological information at the
Here, inspired by these developments in signal processing, level of transcriptional programs. We also introduce a simple
we establish a mathematical framework that addresses the read depth calculator that determines optimal experimental
impact of reducing coverage depth, and hence increasing mea- parameters to achieve a desired analytical accuracy. Our
surement noise, on the reconstruction of transcriptional regula- framework and computational results highlight the effective
tory programs from mRNA-seq data. Our framework reveals low dimensionality of gene expression, commonly caused by
that shallow mRNA-seq, which has been proposed to in- co-regulation of genes, as both a fundamental feature of
crease mRNA-seq throughput by reducing sequencing depth biological data and a major underpinning of biological sig-
in individual samples (Jaitin et al., 2014; Pollen et al., 2014; Klie- nals tolerance to measurement noise (Figures 1B and 1C). Un-
benstein, 2012) (Figure 1A), can be applied generally to many derstanding the fundamental limits and tradeoffs involved in
bulk and single-cell mRNA-seq experiments. By investigating extracting information from mRNA-seq data will guide re-
the fundamental limits of shallow mRNA-seq, we define the searchers in designing large-scale bulk mRNA-seq experi-
conditions under which it has utility and complements deep ments and analyzing single-cell data where transcript coverage
sequencing. is inherently low.
first three principal components an additional 5% (from 80% million reads, 140 times more than the first principal component,
to 85%), 55% more reads were required. We confirmed these to be recovered with the same 80% accuracy.
analytical results by simulating shallow mRNA-seq through To explore whether the shallow principal components also
direct sub-sampling of reads from the raw dataset (see the retained the same biological information as the programs
Experimental Procedures). computed from deep mRNA-seq data, we compared results
Further, as predicted by Equation 1, the dominant principal from Gene Set Enrichment Analysis applied to shallow and
components were more robust to shallow sequencing noise deep mRNA-seq data. At a read depth of 107 reads per sample,
than the trailing, minor principal components. This is a direct the first three principal components have many significant func-
consequence of the fact that the leading principal values are tional enrichments with the second and third principal compo-
well separated from other principal values, while the trailing nents enriched for neural and hematopoietic processes, respec-
values are spaced closely together. For instance, l1 is separated tively (Figure 2C; see Figure S1C for first principal component).
from other principal values by at least l1 $ l2 = 5 3 10$6, more These functional enrichments corroborate the separation seen
than two orders of magnitude greater than the minimum separa- when the gene expression profiles from each tissue are pro-
tion of l25 from other principal values (1.5 3 10$8) (Figure 2B). jected onto the second and third principal components (see
Therefore, the 25th principal component requires almost four the Experimental Procedures). Neural tissues (cerebellum,
cell. These results suggest that low dimensionality enables high ance, increases with the covariance of genes within the associ-
accuracy classification at low read depth across many methods. ated module (Figure 4A) and also the number of genes in the
module (Figures S3AS3C). While highly expressed genes
Gene Expression Covariance Induces Tolerance to also contribute to noise tolerance, in the Shen et al. (2012) da-
Shallow Sequencing Noise taset we found little correlation between the expression level of
In the datasets we considered, the dominant noise-robust prin- a gene and its contribution to the error of the first principal
cipal components corresponded directly to large modules of component (R2 = 0.13; Figure S3D).
covarying genes. Such modules are common in gene expres- This analysis predicts that the large groups of tightly co-
sion data (Eisen et al., 1998; Alter et al., 2000; Bergmann varying genes observed in the Shen et al. (2012) and Zeisel
et al., 2003; Segal et al., 2003). We therefore studied the contri- et al. (2015) datasets will contribute significantly to principal
bution of modularity to principal component robustness in a value separation and noise tolerance. To directly quantify the
simple, mathematical model of gene expression (Supplemental contribution of covariance to principal value separation in these
Information, section 2.2). Our analysis showed that the variance data, we randomly shuffled the sample labels for each gene. In
explained by a principal component, and hence its noise toler- the shuffled data, genes vary independently, which eliminates
gene-gene covariance and raises the effective dimensionality if a common phenomenon, suggests that shallow mRNA-seq
of the data. In contrast to the natural, low-dimensional data, may be rigorously employed when answering many biological
the principal values of the resulting data were nearly uniform questions. To assess whether our findings are broadly appli-
in magnitude. This significantly diminished the differences be- cable, we performed a broad computational survey of available
tween the leading principal values within the shuffled data (Fig- gene expression data.
ure 4B, top). Since both gene covariances and principal values are funda-
Consequently, reconstruction of the principal components mental properties of the biological systems under study, these
became more read-depth intensive. For instance to recover quantities may be analyzed using the wealth of microarray data-
the first principal component with 80% accuracy from the shuf- sets available, leveraging a larger collection of gene expression
fled Zeisel et al. (2015) data, 12.5 times more transcripts are datasets as compared to mRNA-seq (see Figure S5A for ana-
required than for the unshuffled data (Figure 4B, bottom). We lyses of several mRNA-seq datasets). We selected 352 gene
reached a similar conclusion for the mouse ENCODE data, expression datasets from the GEO (Edgar et al., 2002) spanning
where shuffling also decreased the differences between the three species (yeast, 20 datasets; mouse, 106 datasets; and hu-
leading principal values and the rest, causing a 23-fold increase man, 226 datasets) that each contained at least 20 samples and
in sequencing depth required to recover the first principal were performed on the Affymetrix platform.
component with 90% accuracy (Figure S4). Despite the differences between these datasets in terms of
species and collection conditions, they all possessed favorable
Large-Scale Survey Reveals that Shallow mRNA-Seq Is principal value distributions reflecting an effective low dimen-
Widely Applicable due to Gene-Gene Covariance sionality. For instance, on average the first principal value was
Both our analysis of Equation 1 and our computational investiga- roughly twice as large as the second principal value, and
tions of mRNA-seq datasets suggest that high gene-gene co- together the first five principal values explained a significant ma-
variances increase the distance of leading principal values jority of the variance, suggesting that these datasets contain a
from the rest, thereby enabling the recovery of dominant prin- few, dominant principal components (Figure 5A, left). By shuf-
cipal components at low mRNA-seq read depths. This finding, fling these datasets to reorder the sample labels for each gene,
we again found that these principal components emerge from reads per sample), we found that >60% of gene set enrichments
gene-gene covariance. were retained with only 1% of the reads (Figures 5B and 5C). This
We related this pattern of dominant principal components analysis demonstrates that biological information was also re-
to the ability to recover biological information with shallow tained at low depth.
mRNA-seq in these datasets. To generate synthetic mRNA- Collectively, our analyses demonstrate that the success of
seq data from these microarray datasets, we applied a probabi- low-coverage sequencing relies on a few dominant transcrip-
listic model to simulate mRNA-seq at a given read depth (see the tional programs. We also show that many gene expression data-
Experimental Procedures). We found that with only 60,000 reads sets contain such noise-resistant programs as determined by
per sample, 84% of the 352 datasets have %20% error in their PCA and identified them with dominant dimensions in the data-
first principal component. This translates into an average of set. Furthermore, low dimensionality and noise robustness are
almost 1,000% read depth savings to recover the first principal properties of the gene expression datasets themselves and exist
component with an acceptable PCA error tolerance of 20% (Fig- independent of the choice of analysis technique. Therefore, un-
ure 5A, right). By applying gene set enrichment analysis (GSEA) supervised learning methods other than PCA would reach similar
to the first principal component of each of the 352 datasets at conclusions, an expectation we verified using non-negative ma-
low (100,000 reads per sample) and high read depths (10 million trix factorization (Figure S5B).
able to accurately identify transcriptional programs. At this scale, ance of early principal components. These leading, noise-robust
researchers can perform entire chemical or genetic knockout principal components are effectively a small number of dimen-
screens or profile all !1,000 cells in an entire Caenorhabditis sions that dominate the biological phenomena under investiga-
elegans, 40 times over, in a single 400,000,000 read lane on tion. These insights are consistent with previous observations
the Illumina HiSeq 4000. Because shallow mRNA-based screens that were made following the advent of microarray technology
would provide information at the level of transcriptional pro- (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003), pro-
grams and not individual genes, complementing these experi- posing that low dimensionality arises from extensive covariation
ments by careful profiling of specific genes with targeted in gene expression. We suggest that the covariances and prin-
mRNA-seq (Fan et al., 2015) or samples of interest with conven- cipal values in gene expression are determined by the architec-
tional deep sequencing would provide a more complete picture tural properties of the underlying transcriptional networks, such
of the relevant biology. as the co-regulation of genes, and therefore it is the biological
Fundamentally, our results rely on a natural property of gene system itself that confers noise tolerance in shallow mRNA-seq
expression data: its effective low dimensionality. We observed measurements. Related work in neuroscience has explored the
that gene expression datasets often have principal values that implications of hierarchical network architecture for learning
span orders of magnitude independently of the measurement the dominant dimensions of data (Saxe et al., 2013; Hinton and
platform and that this property is responsible for the noise toler- Salakhutdinov, 2006).
Supplemental Information includes Supplemental Experimental Procedures Fan, H.C., Fu, G.K., and Fodor, S.P.A. (2015). Expression profiling.
and five figures and can be found with this article online at http://dx.doi.org/ Combinatorial labeling of single cells for gene expression cytometry. Science
10.1016/j.cels.2016.04.001. 347, 1258367.
Ham, J., Lee, D.D., Mika, S., and Scholkopf, B. (2004). A kernel view of the
AUTHOR CONTRIBUTIONS dimensionality reduction of manifolds. ICML Proceedings of the 21st
International Conference on Machine Learning (ACM), p. 47.
G.H., H.E.-S., and M.T. conceived the idea. G.H. wrote the simulations Hinton, G.E., and Salakhutdinov, R.R. (2006). Reducing the dimensionality of
and analyzed data, with input from M.T. and H.E.-S. R.B. and M.T. performed data with neural networks. Science 313, 504507.
theoretical analysis. R.B. wrote the mathematical proofs. The manuscript was
written by G.H., R.B., H.E.-S., and M.T. Holter, N.S., Maritan, A., Cieplak, M., Fedoroff, N.V., and Banavar, J.R. (2001).
Dynamic modeling of gene expression data. Proc. Natl. Acad. Sci. USA 98,
ACKNOWLEDGMENTS 16931698.
Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I.,
The authors would like to thank Jason Kreisberg, Alex Fields, David Sivak, Pat- Mildner, A., Cohen, N., Jung, S., Tanay, A., and Amit, I. (2014). Massively par-
rick Cahan, Jonathan Weissman, Chun Ye, Michael Chevalier, Satwik Ra- allel single-cell RNA-seq for marker-free decomposition of tissues into cell
jaram, and Steve Altschuler for careful reading of the manuscript; Eric Chow, types. Science 343, 776779.
Supplemental Information
Supplemental Figures
Figure S1
Figure S2
Figure S3
Figure S4
Figure S5
40% 32
100
40%
316
20% 1,000
3,162
20%
3 4 5 6 7 10,000 (x103 reads)
10 10 10 10 10
Reads per sample 10 20 30
Principal component index
C D
1
principal component 1
8
70,000 reads
20% 1
bonemarrow
Variance explained
liver
4 spleen
E14.5liver
thymus
10% 0.5
cerebellum
PC3
0
cortex
olfactory
4 MEF
mESC
E14.5brain
4 8 12 16
program index (i) E14.5limb
8
heart
10 5 0 5 E14.5heart
PC2 x 10
3
Figure SI 1
A
5
x 10
20%
5
5 10 15 20
B principal value index C D
Variance explained (Truetlein et al.)
5 4
x 10 100% x 10
8%
Principal component 2
6% PC 1 0
2
60% 3
0.8
4 3200 reads
4% 40% 6800 reads
1% deep
read depth 99% 8 15,000 reads
0.4
20% Day 14.5 (1107 reads)
2% Day 16.5 (1107 reads)
1
Day 18.5 (1107 reads)
10
3
10
4
10
5
10
6 7
10 8 4 0 4 8
5 10 15 20 Principal component 1
principal value index read depth x 10
4
E F G
variance explained (Kumar et al.)
5 4
x 10 100% x 10
Differences between principal values
Principal component 2
1 3200 reads
30% 1.2 80%
6800 reads
5 10 15 20 3 4 5 6
10 10 10 10 10
7
0 2 4 4
principal value index read depth Principal component 1
x 10
H I J
Differences between principal values
3
x 10
5 100% x 10
Variance explained (Shalek et al.)
20% 2
Principal component error
Classification line
2200 reads
Principal component 2
80% 2
4600 reads
10,000 reads
60% PC 1 1
2 1107 reads
3
10% 1
40%
0
1% deep 99%
20%
read depth 1
not mature mature
3 4 5 6 7
4 8 12 16
10 10 10 10 10 1 0 1 2 3
principal value index read depth Principal component 1 x 10
3
t-SNE
Data from Kumar et al.
LLE
t-SNE
Data from Shalek et al.
LLE
Figure SI 2
A 1 B pc1
5 1
gene expression
1 between cluster
cluster
i 300 covariance r
1
1 within cluster
covariance
100
High
covariance
No
covariance
gene variance mi
Negative
covariance
D
C 4
10
R2=0.13
Principal value magnitude
10 0
10
10
Increasing within cluster covariance
High
covariance
5 4 3
genes
No 10 10 10 10
covariance
Gene expression level (normalized read counts)
genes Negative
covariance
Figure SI 3
20%
variance explained
u ns
hu
10%
d
e da
ta
shu ed data
10 20 30
principal value index
Modular Non-modular High
(covariance removed) covariance
genes
genes
Negative
covariance
genes genes
60% 60%
10
10
8
10
2 3 4 5 6 7 8 9
principal value index
B
Yeast Mouse Human
0
R2= 0.19 10
median NMF error R2= 0.69 R2= 0.48
median NMF error
1
1 10
10
1
10
5 6 5 4 5 4
510 10
4
10 10 10 10 10
1 2 3 1 2 3 1 2 3
C Principal value decay human data D Principal value decay shuffled human data
1 1
i/ 1
i/ 1
0.5 0.5
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
principal value index principal value index (shuffled data)
E
3 3 3
10 10 10 Human
Yeast Mouse
R2= 0.0846 R2= 0.0007
first principal value (1
R2= 0.0426
5 5
10 10
5
10
1 2 10 1 2 10 1 2 3
10 10 10 10 10 10 10
sample number (n) sample number (n) sample number (n)
Figure SI 5