Vous êtes sur la page 1sur 20

Article

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing
Graphical Abstract Authors
Graham Heimberg, Rajat Bhatnagar,
Hana El-Samad, Matt Thomson

Correspondence
hana.el-samad@ucsf.edu (H.E.-S.),
matthew.thomson@ucsf.edu (M.T.)

In Brief
We develop a mathematical framework
that delineates how parameters such as
read depth and sample number influence
the error in transcriptional program
extraction from mRNA-sequencing data.
Our analyses reveal that gene expression
modularity facilitates low error at
surprisingly low read depths, arguing that
increased multiplexing of shallow
sequencing experiments is a viable
approach for applications such as single-
cell profiling of entire tumors.

Highlights
d Mathematical model reveals impact of mRNA-seq read depth
on gene expression analysis

d Modularity in gene expression facilitates robust


transcriptional program extraction

d Model suggests dramatic increases in sample multiplexing


for many applications

d Read depth calculator determines parameters for optimal


experimental design

Heimberg et al., 2016, Cell Systems 2, 239250


April 27, 2016 2016 The Authors
http://dx.doi.org/10.1016/j.cels.2016.04.001
Cell Systems

Article

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing
Graham Heimberg,1,2,3,4,5 Rajat Bhatnagar,1,3,5 Hana El-Samad,1,3,* and Matt Thomson3,4,*
1Department of Biochemistry and Biophysics, California Institute for Quantitative Biosciences, University of California, San Francisco, San

Francisco, CA 94158, USA


2Integrative Program in Quantitative Biology, University of California, San Francisco, San Francisco, CA 94158, USA
3Center for Systems and Synthetic Biology, University of California, San Francisco, San Francisco, CA 94158, USA
4Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
5Co-first author

*Correspondence: hana.el-samad@ucsf.edu (H.E.-S.), matthew.thomson@ucsf.edu (M.T.)


http://dx.doi.org/10.1016/j.cels.2016.04.001

SUMMARY Not all biological questions require such extreme technical


sensitivity. For example, a catalog of human cell types and
A tradeoff between precision and throughput con- the transcriptional programs that define them can potentially
strains all biological measurements, including seq- be generated by querying the general transcriptional state of
uencing-based technologies. Here, we develop a single cells (Trapnell, 2015). In principle, theoretical and
mathematical framework that defines this tradeoff computational methods could elucidate the tradeoff between
between mRNA-sequencing depth and error in the sequencing depth and granularity of the information that can
be accurately extracted from samples. Accordingly, opti-
extraction of biological information. We find that
mizing this tradeoff based on the granularity required by the
transcriptional programs can be reproducibly identi-
biological question at hand would yield significant increases
fied at 1% of conventional read depths. We demon- in the scale at which mRNA-seq can be applied, facilitating
strate that this resilience to noise of shallow applications such as drug screening and whole-organ or tu-
sequencing derives from a natural property, low mor profiling.
dimensionality, which is a fundamental feature of The modern engineering discipline of signal processing has
gene expression data. Accordingly, our conclusions demonstrated that structural properties of natural signals can
hold for !350 single-cell and bulk gene expression often be exploited to enable new classes of low cost measure-
datasets across yeast, mouse, and human. In total, ments. The central insight is that many natural signals are
our approach provides quantitative guidelines for effectively low dimensional. Geometrically, this means
the choice of sequencing depth necessary to achieve that these signals lie on a noisy, low-dimensional manifold
embedded in the observed, high-dimensional measurement
a desired level of analytical resolution. We codify
space. Equivalently, this property indicates that there is a
these guidelines in an open-source read depth calcu-
basis representation in which these signals can be accurately
lator. This work demonstrates that the structure captured by a small number of basis vectors relative to the orig-
inherent in biological networks can be productively inal measurement dimension (Donoho, 2006; Candes et al.,
exploited to increase measurement throughput, an 2006; Hinton and Salakhutdinov, 2006). Modern algorithms
idea that is now common in many branches of sci- exploit the fact that the number of measurements required to
ence, such as image processing. reconstruct a low-dimensional signal can be far fewer than the
apparent number of degrees of freedom. For example, in images
of natural scenes, correlations between neighboring pixels
INTRODUCTION induce an effective low dimensionality that allows high-accuracy
image reconstruction even in the presence of considerable mea-
All measurements, including biological measurements, contain a surement noise such as point defects in many camera pixels
tradeoff between precision and throughput. In sequencing-based (Duarte et al., 2008).
measurements like mRNA-sequencing (mRNA-seq), precision is Like natural images, it has long been appreciated that biolog-
determined largely by the sequencing depth applied to individual ical systems contain structural features that can lead to an
samples. At high sequencing depth, mRNA-seq can detect subtle effective low dimensionality in data. Most notably, genes are
changes in gene expression including the expression of rare commonly co-regulated within transcriptional modules; this pro-
splice variants or quantitative modulations in transcript abun- duces covariation in the expression of many genes (Eisen et al.,
dance. However, such precision comes at a cost, and sequencing 1998; Segal et al., 2003; Bergmann et al., 2003). The widespread
transcripts from 10,000 single cells at deep sequencing coverage presence of such modules indicates that the natural dimension-
(106 reads per cell) currently requires 2 weeks of sequencing on an ality of gene expression is determined not by the number of genes
Illumina HiSeq 4000. in the genome but by the number of regulatory modules. By

Cell Systems 2, 239250, April 27, 2016 2016 The Authors 239
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Figure 1. A Mathematical Model Reveals Factors Determining the Performance of Shallow mRNA-Seq
(A) mRNA-seq throughput as a function of sequencing depth per sample for a fixed sequencing capacity.
(B) Unsupervised learning techniques are used to identify transcriptional programs. We ask when and why shallow mRNA-seq can accurately identify tran-
scriptional programs.
(C) Decreasing sequencing depth adds measurement noise to the transcriptional programs identified by unsupervised learning. Our approach reveals that
dominant programs, defined as those that explain relatively large variances in the data, are tolerant to measurement noise.

analogy to signal processing, this natural structure suggests that Our analysis reveals that the dominance of a transcriptional
the lower effective dimensionality present in gene expression program, quantified by the fraction of the variance it explains
data can be exploited to make accurate, inexpensive measure- in the dataset, determines the read depth required to accu-
ments that are not degraded by noise. But when, and at what rately extract it. We demonstrate that common bioinformatic
error tradeoff, can low dimensionality be leveraged to enable analyses can be performed at 1% of traditional sequencing
low-cost, high-information-content biological measurements? depths with little loss in inferred biological information at the
Here, inspired by these developments in signal processing, level of transcriptional programs. We also introduce a simple
we establish a mathematical framework that addresses the read depth calculator that determines optimal experimental
impact of reducing coverage depth, and hence increasing mea- parameters to achieve a desired analytical accuracy. Our
surement noise, on the reconstruction of transcriptional regula- framework and computational results highlight the effective
tory programs from mRNA-seq data. Our framework reveals low dimensionality of gene expression, commonly caused by
that shallow mRNA-seq, which has been proposed to in- co-regulation of genes, as both a fundamental feature of
crease mRNA-seq throughput by reducing sequencing depth biological data and a major underpinning of biological sig-
in individual samples (Jaitin et al., 2014; Pollen et al., 2014; Klie- nals tolerance to measurement noise (Figures 1B and 1C). Un-
benstein, 2012) (Figure 1A), can be applied generally to many derstanding the fundamental limits and tradeoffs involved in
bulk and single-cell mRNA-seq experiments. By investigating extracting information from mRNA-seq data will guide re-
the fundamental limits of shallow mRNA-seq, we define the searchers in designing large-scale bulk mRNA-seq experi-
conditions under which it has utility and complements deep ments and analyzing single-cell data where transcript coverage
sequencing. is inherently low.

240 Cell Systems 2, 239250, April 27, 2016


RESULTS Formally, the principal components are defined as the eigen-
vectors of the gene expression covariance matrix, and the prin-
Statistical Properties of Gene Expression Data cipal values li are the associated eigenvalues that equal the
Determine the Accuracy of Principal Component variance of the data projected onto the component (Alter
Analysis at Low Read Depth et al., 2000; Holter et al., 2001). We use perturbation theory
To delineate the impact of sequencing depth on the analysis of to model how the eigenvectors of the gene expression covari-
mRNA-seq data, we developed a mathematical framework that ance matrix change when measurement noise is added (Stew-
models the performance of a common bioinformatics tech- art and Sun, 1990; Shankar, 2012). We perform our analysis
nique, transcriptional program identification, at low sequencing in units of normalized read counts for conceptual clarity (or
depth. We focus on transcriptional program identification as it normalized transcript counts where appropriate), but an iden-
is central in many analyses including gene set analysis, network tical analysis and error equation can be derived in FPKM units
reconstruction (Holter et al., 2001; Bonneau, 2008), and cancer through a simple rescaling. The principal component error is
classification (Alon et al., 1999; Shai et al., 2003; Patel et al., defined as the deviation between the deep (pci) and shallow
2014), as well as the analysis of single-cell mRNA-seq data. dpci principal components,
Our model defines exactly how reductions in read depth v
corrupt the extracted transcriptional programs and determines u 0 " # 12
uX pcT C b $ C pcj
the precise depth required to recover them with a desired u i
kpci $ pci k z
d t @ A (Equation 1)
accuracy. jsi
li $ lj
Our analysis focuses on the identification of transcriptional
programs from mRNA-seq data through principal component where C and C ^ are the covariance matrices obtained from deep
analysis (PCA), because of its prevalence in gene expression and shallow mRNA-seq data, respectively. Equation 1 can be
analysis (Alter et al., 2000; Ringner, 2008) and its fundamental used to model the impact of shallow sequencing on any given
similarities to other commonly used methods. A recent review mRNA-seq dataset. Moreover, qualitative analysis of the equa-
called PCA the most widely used method for unsupervised clus- tion reveals the key factors that determine whether low depth
tering and noted that it has already been successfully applied in profiling will accurately identify transcriptional programs. As
many single-cell genomics contexts (Trapnell, 2015). Addition- expected, this equation indicates that the principal component
ally, research in the computer science community over the error depends on generic features including read depth and
last decade has shown that many other unsupervised learning sample number, as these affect the difference between the
methods, including k-means, spectral clustering, and Locally shallow and deep covariance matrices in the numerator of Equa-
Linear Embedding, are naturally related to PCA or its generaliza- tion 1 (see the Supplemental Information, section 2.1). However,
tion, Kernel PCA (Ding and He, 2004; Ng et al., 2001; Ham et al., Equation 1 also reveals that the principal component error de-
2004; Bengio et al., 2004). Because of the deep connection pends on a system-specific property: the relative magnitude of
between PCA and other unsupervised learning techniques, the principal values (captured by li $ lj). Since the principal
we expect that our conclusions in this section will extend to other values correspond to the variance in the data along a principal
methods of analysis (and we provide such parallel analysis in the component, this term quantifies whether the information in the
Supplemental Information). Here, we focus on PCA because the gene expression data is concentrated among a few transcrip-
well-defined theory behind it provides a unique opportunity to tional programs. When genes covary along a small number of
understand, analytically, the factors that determine the robust- principal axes, the dataset has an effective low dimensionality,
ness of program identification to low-coverage sequencing i.e., the data are concentrated on a low-dimensional sub-space,
noise. and transcriptional programs can be extracted even in the pres-
PCA identifies transcriptional programs by extracting groups ence of sequencing noise.
of genes that covary across a set of samples. Covarying genes
are grouped into a gene expression vector known as a principal Mouse Tissues Can Be Distinguished at Low Depth in
component. Principal components are weighted by their relative Bulk mRNA-Seq Samples
importance in capturing the gene expression variation that oc- To understand the implications of this result in the context of
curs in the underlying data. Decreasing sequencing depth intro- an established mRNA-seq dataset, we applied Equation 1 to
duces measurement noise into the gene expression data and a subset of the mouse ENCODE data that uses deep mRNA-
corrupts the extracted principal components. seq (>107 reads per sample) to profile gene expression of
If the transcriptional programs obtained from shallow mRNA- 19 different mouse tissues with a biological replicate (Shen
seq data and deep mRNA-seq data are similar, then we can et al., 2012) (see the Experimental Procedures). The analysis
accurately perform many gene expression analyses at low depth revealed that the leading, dominant transcriptional programs
while collecting data in much higher throughput (Figure 1). We could be extracted with <1% of the studies original read
therefore developed a mathematical model that quantifies how depth. Specifically, the first three principal components could
the principal components computed at low and high sequencing be recovered with >80% accuracy (i.e., an error of 1 $ 0.8 =
depths differ. The model reveals that performance of transcrip- 20%) with just 55,000 reads per experiment (Figures 2A and
tional program extraction at low read depth is specific to the da- S1A). To reach 80% accuracy for all of the first nine principal
taset and even the program itself. It is the dominant transcrip- components, only 145,000 reads were needed (Figure S1B).
tional programs, which capture most variance, that are the Increasing read depth further had diminishing returns for prin-
most stable. cipal component accuracy. To increase the accuracy of the

Cell Systems 2, 239250, April 27, 2016 241


Figure 2. Transcriptional States of Mouse Tissues Are Distinguishable at Low Read Coverage
(A) Principal component error as a function of read depth for selected principal components for the Shen et al. (2012) data. For first three principal components,
1% of the traditional read depth is sufficient for achieving >80% accuracy. Improvements in error exhibit diminishing returns as read depth is increased. Less
dominant transcription programs (principal components 8 and 15 shown) are more sensitive to sequencing noise.
(B) Variance explained by transcriptional program (blue) and differences between principal values (green) of the Shen et al. (2012) data. The leading, dominant
transcriptional programs have principal values that are well separated from later principal values, suggesting that these should be more robust to measurement
noise.
(C) GSEA significance for the top ten terms of principal component two (top) and three (bottom) as a function of read depth. 32,000 reads are sufficient to recover
all top ten terms in the first three principal components. (Analysis for first principal component shown in Figure S1C.)
(D) Projection of a subset of the Shen et al. (2012) tissue data onto principal components two and three. The ellipses represent uncertainty at specific read depths.
Similar tissues lie close together. Transcriptional program two separates neural tissues from non-neural tissues while transcriptional program three distinguishes
tissues involved in hematopoiesis from other tissues. This is consistent with the GSEA of these transcriptional programs in (C).

first three principal components an additional 5% (from 80% million reads, 140 times more than the first principal component,
to 85%), 55% more reads were required. We confirmed these to be recovered with the same 80% accuracy.
analytical results by simulating shallow mRNA-seq through To explore whether the shallow principal components also
direct sub-sampling of reads from the raw dataset (see the retained the same biological information as the programs
Experimental Procedures). computed from deep mRNA-seq data, we compared results
Further, as predicted by Equation 1, the dominant principal from Gene Set Enrichment Analysis applied to shallow and
components were more robust to shallow sequencing noise deep mRNA-seq data. At a read depth of 107 reads per sample,
than the trailing, minor principal components. This is a direct the first three principal components have many significant func-
consequence of the fact that the leading principal values are tional enrichments with the second and third principal compo-
well separated from other principal values, while the trailing nents enriched for neural and hematopoietic processes, respec-
values are spaced closely together. For instance, l1 is separated tively (Figure 2C; see Figure S1C for first principal component).
from other principal values by at least l1 $ l2 = 5 3 10$6, more These functional enrichments corroborate the separation seen
than two orders of magnitude greater than the minimum separa- when the gene expression profiles from each tissue are pro-
tion of l25 from other principal values (1.5 3 10$8) (Figure 2B). jected onto the second and third principal components (see
Therefore, the 25th principal component requires almost four the Experimental Procedures). Neural tissues (cerebellum,

242 Cell Systems 2, 239250, April 27, 2016


cortex, olfactory, and embryonic day 14.5 [E14.5] brain) project classes of pyramidal neurons with similar gene expression pro-
along the second principal component while the hematopoietic files and oligodendrocytesthat are transcriptionally distinct.
tissues (spleen, liver, thymus, bone marrow, and E14.5 liver) As the first three principal values were well separated from the
project along the third principal component (Figure 2D). others (Figure S2A), Equation 1 estimated that the first three prin-
The statistically significant enrichments of the first three prin- cipal components could be reconstructed with 11%, 22%, and
cipal components persisted at low sequencing depths. At 38% error, respectively, with just 1,000 transcripts per cell
<32,000 reads per sample, only 0.37% of the total reads, all (Figure 3A).
ten of the top gene sets for these principal components passed We confirmed this result computationally. With just 100 unique
our significance threshold of p < 10$4 (negative predictive value transcripts, we were able to separate oligodendrocytes from
and positive predictive value in Figures S1D and S1E). To put this the two classes of pyramidal neurons with >90% accuracy.
result in perspective, using only 32,000 reads per sample (corre- With 1,000 unique transcripts per cell, we were able to distin-
sponding to PCA accuracies of 81%, 79%, and 75% for the first guish pyramidal neurons of the hippocampus from those of cor-
three principal components, respectively) would allow a faithful tex with the same >90% accuracy (Figure 3B). The different
recapitulation of functional enrichments while still multiplexing depths required to distinguish these subclasses of neural and
thousands of samples, rather than dozens, in a single Illumina non-neural cell-types reflect the differing robustness of the cor-
HiSeq sequencing lane. Additionally, this low number of reads responding principal components. The first principal component
was still sufficient to separate the different cell types (Figure 2D). captures a broad distinction between oligodendrocytes and py-
We obtained similar results when working in FPKM units, sug- ramidal cell types (Figure 3C, left) and is the most robust to low
gesting that the broad conclusions of our analysis are insensitive read depths. The third principal component captures a more
to gene expression units (Figures S1F, S1G, and S1H). fine-grained distinction between pyramidal neurons but is less
robust than the first principal component at low read depth
Transcriptional States in Single Cells Are and hence requires more coverage. This is consistent with bio-
Distinguishable with Less Than 1,000 Transcripts logical intuition: more depth is required to distinguish between
per Cell pyramidal neural subtypes than between oligodendrocytes and
We wanted to explore whether shallow mRNA-seq could also pyramidal neurons.
capture gene expression differences between individual single We next asked how contributions of individual genes to a prin-
cells within a heterogeneous tissue, arguably a more challenging cipal component change as a function of read depth. For every
problem than distinguishing different bulk tissue samples. In principal component, we derived a null model consisting of the
addition to the biological importance of quantifying variability distribution of the individual gene weightings, called loadings,
at the single-cell level, single-cell mRNA-seq data provide the from a shuffled version of the data (see the Experimental Proce-
necessary context for analyzing the performance of shallow dures). Comparing the data to the null model, we found that at a
sequencing for two reasons. First, single-cell mRNA-seq exper- depth of !340 transcripts, >80% of genes significantly associ-
iments are inherently low-depth measurements as current ated with the first principal component could still be detected
methods can capture only a small fraction (!20%) (Shalek (Figures 3C and 3D; Experimental Procedures). At just 100 tran-
et al., 2014) of the !300,000 transcripts (Velculescu et al., scripts per cell, we were still able to identify oligodendroycte
1999) typically contained in individual cells. Second, since ad- markers, such as myelin-associated oligodendrocyte basic pro-
vances in microfluidics (Macosko et al., 2015) now facilitate the tein (Mobp) and myelin-associated glycoprotein (Mag), as well as
automated preparation of tens of thousands of individual cells neural markers, such as Neuronal differentiation 6 (Neurod6) and
for single-cell mRNA-seq, sequencing requirements impose a Neurogranin (Nrgn), as statistically significant, and reliably clas-
key bottleneck on the further scaling of single-cell throughput. sify these distinct cell types. However, below 100 transcripts per
To probe the impact of sequencing depth reductions on sin- cell, cell-type classification becomes inaccurate, and this is
gle-cell mRNA-seq data, we analyzed a dataset characterizing correlated with markers such as Neurod6 being no longer statis-
3,005 single cells from the mouse cerebral cortex and hippocam- tically associated with the first principal component.
pus (Zeisel et al., 2015) that were classified bioinformatically at We were able to reach similar conclusions with three other
full sequencing depth (average of !15,000 unique transcripts single-cell mRNA-seq datasets (Shalek et al., 2013; Treutlein
per cell) into nine different neural and non-neural cell types. In et al., 2014; Kumar et al., 2014). With similarly low sequencing
addition to providing a rich biological context for analysis, this depths, we were able to distinguish transcriptional states of sin-
dataset allows for a quantitative analysis of low-depth transcrip- gle cells collected across stages of the developing mouse lung
tional profiling as it incorporates molecular barcodes known as (Figures S2BS2D), wild-type mouse embryonic stem cells
unique molecular identifiers (UMIs) that enable the precise from stem cells with a single gene knockout (Figures S2E
counting of transcripts from each single cell. The Zeisel et al. S2G), and heterogeneity within a population of bone-marrow-
(2015) data therefore allowed us to analyze the impact of derived dendritic cells (Figures S2HS2J). These results were
sequencing depth reductions quantitatively in units of transcript also not PCA-specific. We additionally examined two of these
counts rather than in the less precise unit of raw sequencing datasets with t-distributed Stochastic Neighbor Embedding
reads. (t-SNE) and Locally Linear Embedding (LLE), two nonlinear alter-
Similarly to the bulk tissue data, we found that leading prin- natives to PCA (Van der Maaten and Hinton, 2008; Roweis and
cipal components in single cells could be reconstructed with a Saul, 2000) and achieved successful classification of transcrip-
small fraction of the total transcripts collected in the raw dataset. tional states (Figures S2K and SKL), in each case recapitulating
We focused our analysis on three classes of cell typestwo the results of the original studies with fewer than 5,000 reads per

Cell Systems 2, 239250, April 27, 2016 243


Figure 3. Transcriptional States of Single Cells in the Mouse Brain Are Distinguishable at Low Transcript Coverage
(A) Principal component error as a function of read depth for selected principal components for the Zeisel et al. (2015) data.
(B) Accuracy of cell type classification as a function of transcripts per cell. Accuracy plateaus with increasing transcript coverage. At 1,000 transcripts per cell, all
three cell types can be distinguished with low error. At 100 transcripts per cell, pyramidal cells cannot be distinguished from each other, while oligodendrocytes
remain distinct.
(C) Covariance matrix of genes with high absolute loadings in the first principal component (left). The genes with the 100 highest positive and 100 lowest negative
loadings are displayed. The first principal component is enriched for genes indicative of oligodendrocytes and neurons (middle). Gene significance as a function of
transcript count for the first principal component (right).
(D) True and false detection rates as a function of transcript count for genes significantly associated with the first three principal components. Below 100
transcripts per cell, false positives are common.

cell. These results suggest that low dimensionality enables high ance, increases with the covariance of genes within the associ-
accuracy classification at low read depth across many methods. ated module (Figure 4A) and also the number of genes in the
module (Figures S3AS3C). While highly expressed genes
Gene Expression Covariance Induces Tolerance to also contribute to noise tolerance, in the Shen et al. (2012) da-
Shallow Sequencing Noise taset we found little correlation between the expression level of
In the datasets we considered, the dominant noise-robust prin- a gene and its contribution to the error of the first principal
cipal components corresponded directly to large modules of component (R2 = 0.13; Figure S3D).
covarying genes. Such modules are common in gene expres- This analysis predicts that the large groups of tightly co-
sion data (Eisen et al., 1998; Alter et al., 2000; Bergmann varying genes observed in the Shen et al. (2012) and Zeisel
et al., 2003; Segal et al., 2003). We therefore studied the contri- et al. (2015) datasets will contribute significantly to principal
bution of modularity to principal component robustness in a value separation and noise tolerance. To directly quantify the
simple, mathematical model of gene expression (Supplemental contribution of covariance to principal value separation in these
Information, section 2.2). Our analysis showed that the variance data, we randomly shuffled the sample labels for each gene. In
explained by a principal component, and hence its noise toler- the shuffled data, genes vary independently, which eliminates

244 Cell Systems 2, 239250, April 27, 2016


Figure 4. Modularity of Gene Expression Enables Accurate, Low-Depth Transcriptional Program Identification
(A) Variance explained and covariance matrix for increasing gene expression covariance in a model.
(B) Variance explained by different principal components for the Zeisel et al. (2015) dataset. Covariance matrix shows large modules of covarying genes (middle).
Dominant transcriptional programs are robust to low-coverage profiling as predicted by model (bottom). Shuffling the dataset destroys the modular structure,
resulting in noise-sensitive transcriptional programs. For the shuffled data, 4,250 transcripts are required for 80% accuracy of the first three principal compo-
nents, whereas 340 transcripts suffices for the original dataset.

gene-gene covariance and raises the effective dimensionality if a common phenomenon, suggests that shallow mRNA-seq
of the data. In contrast to the natural, low-dimensional data, may be rigorously employed when answering many biological
the principal values of the resulting data were nearly uniform questions. To assess whether our findings are broadly appli-
in magnitude. This significantly diminished the differences be- cable, we performed a broad computational survey of available
tween the leading principal values within the shuffled data (Fig- gene expression data.
ure 4B, top). Since both gene covariances and principal values are funda-
Consequently, reconstruction of the principal components mental properties of the biological systems under study, these
became more read-depth intensive. For instance to recover quantities may be analyzed using the wealth of microarray data-
the first principal component with 80% accuracy from the shuf- sets available, leveraging a larger collection of gene expression
fled Zeisel et al. (2015) data, 12.5 times more transcripts are datasets as compared to mRNA-seq (see Figure S5A for ana-
required than for the unshuffled data (Figure 4B, bottom). We lyses of several mRNA-seq datasets). We selected 352 gene
reached a similar conclusion for the mouse ENCODE data, expression datasets from the GEO (Edgar et al., 2002) spanning
where shuffling also decreased the differences between the three species (yeast, 20 datasets; mouse, 106 datasets; and hu-
leading principal values and the rest, causing a 23-fold increase man, 226 datasets) that each contained at least 20 samples and
in sequencing depth required to recover the first principal were performed on the Affymetrix platform.
component with 90% accuracy (Figure S4). Despite the differences between these datasets in terms of
species and collection conditions, they all possessed favorable
Large-Scale Survey Reveals that Shallow mRNA-Seq Is principal value distributions reflecting an effective low dimen-
Widely Applicable due to Gene-Gene Covariance sionality. For instance, on average the first principal value was
Both our analysis of Equation 1 and our computational investiga- roughly twice as large as the second principal value, and
tions of mRNA-seq datasets suggest that high gene-gene co- together the first five principal values explained a significant ma-
variances increase the distance of leading principal values jority of the variance, suggesting that these datasets contain a
from the rest, thereby enabling the recovery of dominant prin- few, dominant principal components (Figure 5A, left). By shuf-
cipal components at low mRNA-seq read depths. This finding, fling these datasets to reorder the sample labels for each gene,

Cell Systems 2, 239250, April 27, 2016 245


Figure 5. Gene Expression Survey of 352 Public Datasets Reveals Broad Tolerance of Bioinformatics Analysis to Shallow Profiling
(A) Variance explained by the first five transcriptional programs of 352 published yeast, mouse, and human microarray datasets (left). Shuffling microarray
datasets removes gene-gene covariance and destroys the relative dominance of the leading transcriptional programs. Read depth required to recover with 80%
accuracy the first five principal components of the 352 datasets (right). Removing gene expression covariance from the data requires a median of approximately
ten times more reads to achieve the same accuracy.
(B) Accuracy of GSEA of the human microarray datasets at low read depth (100,000 reads, i.e., 1% deep depth). Reactome pathway database gene sets are
correctly identified (blue) or not identified (yellow) at low read depth (false positives in red). !80% of gene sets can be correctly recovered at 100,000 reads.
(C) Accuracy of GSEA as a function of read depth.

we again found that these principal components emerge from reads per sample), we found that >60% of gene set enrichments
gene-gene covariance. were retained with only 1% of the reads (Figures 5B and 5C). This
We related this pattern of dominant principal components analysis demonstrates that biological information was also re-
to the ability to recover biological information with shallow tained at low depth.
mRNA-seq in these datasets. To generate synthetic mRNA- Collectively, our analyses demonstrate that the success of
seq data from these microarray datasets, we applied a probabi- low-coverage sequencing relies on a few dominant transcrip-
listic model to simulate mRNA-seq at a given read depth (see the tional programs. We also show that many gene expression data-
Experimental Procedures). We found that with only 60,000 reads sets contain such noise-resistant programs as determined by
per sample, 84% of the 352 datasets have %20% error in their PCA and identified them with dominant dimensions in the data-
first principal component. This translates into an average of set. Furthermore, low dimensionality and noise robustness are
almost 1,000% read depth savings to recover the first principal properties of the gene expression datasets themselves and exist
component with an acceptable PCA error tolerance of 20% (Fig- independent of the choice of analysis technique. Therefore, un-
ure 5A, right). By applying gene set enrichment analysis (GSEA) supervised learning methods other than PCA would reach similar
to the first principal component of each of the 352 datasets at conclusions, an expectation we verified using non-negative ma-
low (100,000 reads per sample) and high read depths (10 million trix factorization (Figure S5B).

246 Cell Systems 2, 239250, April 27, 2016


The Read Depth Calculator: A Quantitative Framework Experimentalists can use the read depth calculator to pre-
for Selecting Optimal mRNA-Seq Read Depth and dict requirements for read depth or sample number in high-
Number of Biological Samples throughput transcriptional profiling given their desired accuracy
Because the optimal choice of read depth in an mRNA-seq based on the statistics of principal value separation in our global
experiment is of widespread practical relevance, we developed survey. Figure 6B shows the reads required for desired accu-
a read depth calculator that can provide quantitative guidelines racies and an assumed principal value for a human transcrip-
for shallow mRNA-seq experimental design. Having pinpointed tional experiment with 100 samples (typical values for the first
the factors that determine the applicability of shallow mRNA- five principal values for human are indicated in dashed lines).
seq, we applied this understanding to determine the read depth As an illustration, a hypothetical experiment with a typical first
and number of biological samples to profile when designing an principal value of 1.4 3 10$5 (median principal value from the
experiment. To do so, we simplified the principal component er- 226 human microarray datasets) and 100 samples where 80%
ror described by Equation 1 by assuming that the principal PCA accuracy is tolerable requires less than 5,000 reads per
values of mRNA-seq data are well separated, i.e., that the ratio experiment or less than 500,000 reads in total, occupying less
between consecutive principal values li+1/li is small (as defined than 0.125% of a single sequencing lane in the Illumina HiSeq
in the Supplemental Information, section 2.1), an assumption 4000.
justified by our large-scale microarray survey (see Figures S5C The predictions from this analytically derived read depth
and S5D). These assumptions enable us to provide simple calculator are demonstrably accurate. We compared the analyt-
guidelines for making important experimental decisions, for ically predicted number of reads required for 80% PCA accuracy
example, choosing read depth, N: in the first five transcriptional programs to the value determined
through simulated shallow mRNA-seq for 226 microarray and 4
k2 mRNA-seq human datasets. We determined k empirically by
Nz 2
(Equation 2)
fitting 50% of the datasets. Cross-validation with the remaining
nli kpci $ d
pci k
50% of the datasets showed remarkable agreement between
where n is the number of biological samples and k is a constant the analytical predictions and computationally determined
that can be estimated from existing data (see the Supplemental values. In these calculations, the analytically predicted number
Information, section 2.1 for a derivation of this equation and its of reads required to reach 80% accuracy deviates from the
limitations). This relationship can be understood intuitively. First, depth required in simulation by less than 10% (Figure 6C). The
Equation 2 states that the principal component error decreases read depth calculator is available online (http://thomsonlab.
with read depth, a consequence of the well-known fact that the github.io/html/formula.html).
signal-to-noise
p ratio of a Poisson random variable is proportional Finally, while we use the first principal component for illustra-
to N. The read depth also depends on li, which comes from the tion, Equation 2 can be applied to any principal component,
li $ lj term of Equation 1. Finally, the influence of the sample including the trailing principal components. Recent work dis-
number n on read depth follows from the definition of covariance cusses a statistical method to identify those principal compo-
as an average over samples. (Figure S5E shows that n is approx- nents that are likely to be informative, and this work can be
imately statistically uncorrelated with principal values across the used in conjunction with Equation 2 to pinpoint the relevant prin-
microarray datasets.) cipal components and the sequencing parameters needed to es-
Equation 2 has implications for optimizing the tradeoff between timate them satisfactorily (Klein et al., 2015).
read depth and sample number in single-cell mRNA-seq experi-
ments. As principal component error depends on the product of DISCUSSION
read depth and number of samples, error in mRNA-seq analyses
can be reduced equivalently in two ways, by either increasing the Single-cell transcriptional profiling is a technology that holds
total number of profiled cells or the transcript coverage. To illus- the promise of unlocking the inner workings of cells and uncov-
trate this point, we computationally determined the error in the ering the roots of their individuality (Klein et al., 2015; Macosko
first principal component of the single cell mouse brain data et al., 2015). We show that for many applications that rely on
from Zeisel et al. (2015) as a function of cell number. Consistent the determination of transcriptional programs, biological in-
with Equation 2, our calculations show that increasing the num- sights can be recapitulated at a fraction of the widely proposed
ber of profiled cells reduces error in the first principal component high read depths. Our results are based on a rigorous mathe-
(Figure 6A). Furthermore, we show that with the Zeisel et al. (2015) matical framework that quantifies the tradeoff between read
data, multiple different experimental configurations with the depth and accuracy of transcriptional program identification.
same total number of transcripts can yield the same principal Our analytical results pinpoint gene-gene covariance, a ubiqui-
component error. For example, 100,000 transcripts divided tous biological property, as the key feature that enables un-
between either 50 or 400 cells both yield a principal component compromised performance of unsupervised gene expression
error of !20%. This result is of particular relevance in single- analysis at low read depth. The same mathematical framework
cell experiments because transcript depth per cell is currently also leads to practical methods to determine the optimal
limited by a !20% mRNA capture efficiency, and so cannot be read depth and sample number for the design of mRNA-seq
easily increased (Shalek et al., 2014). In such cases, limited experiments.
sequencing resources might be best used to sequence more Given the principal values that we observe in the human micro-
cells at low depth rather than allocating sequencing resources array datasets, our analysis suggests that one can profile tens of
to oversampling a few thousand unique transcripts. thousands of samples, as opposed to dozens, while still being

Cell Systems 2, 239250, April 27, 2016 247


Figure 6. Mathematical Framework Provides a Read Depth Calculator and Guidelines for Shallow mRNA-Seq Experimental Design
(A) Error in the first principal component of the Zeisel et al. (2015) dataset for varying cell number and read-depth. Black circles denote a fixed number of total
transcripts (100,000). Error can be reduced by either increasing transcript coverage or the number of cells profiled.
(B) Number of reads required (color) to achieve a desired error (y axis) for a given principal value (x axis). Typical principal values (dashed black vertical lines) are
the medians across the 352 gene expression datasets.
(C) Error of the read depth calculator (Equation 2) across 176 gene expression datasets used for validation (out of 352 total). The calculator predicts the number of
reads to achieve 80% PCA accuracy in each dataset (colored dots). The predicted values closely agree with simulated results, with the median error <10% for the
first five transcriptional programs.

able to accurately identify transcriptional programs. At this scale, ance of early principal components. These leading, noise-robust
researchers can perform entire chemical or genetic knockout principal components are effectively a small number of dimen-
screens or profile all !1,000 cells in an entire Caenorhabditis sions that dominate the biological phenomena under investiga-
elegans, 40 times over, in a single 400,000,000 read lane on tion. These insights are consistent with previous observations
the Illumina HiSeq 4000. Because shallow mRNA-based screens that were made following the advent of microarray technology
would provide information at the level of transcriptional pro- (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003), pro-
grams and not individual genes, complementing these experi- posing that low dimensionality arises from extensive covariation
ments by careful profiling of specific genes with targeted in gene expression. We suggest that the covariances and prin-
mRNA-seq (Fan et al., 2015) or samples of interest with conven- cipal values in gene expression are determined by the architec-
tional deep sequencing would provide a more complete picture tural properties of the underlying transcriptional networks, such
of the relevant biology. as the co-regulation of genes, and therefore it is the biological
Fundamentally, our results rely on a natural property of gene system itself that confers noise tolerance in shallow mRNA-seq
expression data: its effective low dimensionality. We observed measurements. Related work in neuroscience has explored the
that gene expression datasets often have principal values that implications of hierarchical network architecture for learning
span orders of magnitude independently of the measurement the dominant dimensions of data (Saxe et al., 2013; Hinton and
platform and that this property is responsible for the noise toler- Salakhutdinov, 2006).

248 Cell Systems 2, 239250, April 27, 2016


Discovering and exploiting low dimensionality to reduce un- John Haliburton, Sisi Chen, and Emeric Charles for their experimental insights;
certainty in measurements is at the heart of modern signal pro- and Paul Rivaud for website design assistance. This work was supported
by the UCSF Center for Systems and Synthetic Biology (NIGMS P50
cessing techniques (Donoho 2006; Candes et al., 2006). These
GM081879). H.E.S. acknowledges support from the Paul G. Allen Family Foun-
methods first found success in imaging applications, where dation. M.T. acknowledges support from the NIH Office of the Director, the Na-
low dimensionality arises from the statistics and redundancies tional Cancer Institute, and the National Institute of Dental and Craniofacial
of natural images, enabling most images to be accurately repre- Research (NIH DP5 OD012194).
sented by a small number of wavelets or other basis functions.
Our results suggest that shallow mRNA-seq is similarly enabled Received: November 30, 2015
by an inherent low dimensionality in gene expression datasets Revised: March 8, 2016
Accepted: April 4, 2016
that emerges from groups of covarying genes. Just as only a
Published: April 27, 2016
few wavelets are needed to represent most images, only a few
groups of transcriptional programs seem to be necessary to pro-
duce a coarse-grained representation of transcriptional state. REFERENCES
We believe that the measurement of many diverse biological
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and
systems could benefit from the identification and analysis of hid-
Levine, A.J. (1999). Broad patterns of gene expression revealed by clustering
den low-dimensional representations. For instance, proteome analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
quantification, protein-protein interactions, and human genetic Proc. Natl. Acad. Sci. USA 96, 67456750.
variant data all contain high levels of correlations, suggesting Alter, O., Brown, P.O., and Botstein, D. (2000). Singular value decomposition
these datasets may all be effectively low dimensional. We antic- for genome-wide expression data processing and modeling. Proc. Natl.
ipate new modes of biological inquiry as advances from signal Acad. Sci. USA 97, 1010110106.
processing are integrated into biological data analysis and as Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., and
the underlying structural features of biological networks are ex- Ouimet, M. (2004). Learning eigenfunctions links spectral embedding and
ploited for large-scale measurements. kernel PCA. Neural Comput. 16, 21972219.
Bergmann, S., Ihmels, J., and Barkai, N. (2003). Iterative signature algorithm
EXPERIMENTAL PROCEDURES for the analysis of large-scale gene expression data. Phys. Rev. E Stat.
Nonlin. Soft Matter Phys. 67, 031902.
Simulated Shallow Sequencing through Down-sampling of Reads
Bonneau, R. (2008). Learning biological networks: from modules to dynamics.
Transcriptional datasets were obtained from the GEO (Zeisel et al. [2015] was
Nat. Chem. Biol. 4, 658664.
from http://www.linnarssonlab.org). mRNA-seq read counts were normalized
by the total number of reads in the sample. For each read depth, we model the Candes, E.J., Romberg, J.K., and Tao, T. (2006). Stable signal recovery from
sequencing noise with a multinomial distribution. The Zeisel et al. (2015) data incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59,
were sampled without replacement because of the unique molecular identi- 12071223.
fiers (see Supplemental Experimental Procedures). Ding, C., and He, X. (2004). K-means clustering via principal component anal-
ysis. ICML Proceedings of the 21st International Conference on Machine
Finding Genes Significantly Associated with a Principal Component Learning (ACM), p. 29.
We first generated a null distribution of gene loadings from the principal com-
Donoho, D.L. (2006). Compressed sensing. IEEE Trans. Inf. Theory 52, 1289
ponents of a shuffled, transcript-count matrix. All p values were computed with
1306.
respect to this distribution; averages over 15 replicates are reported.
Duarte, M.F., Davenport, M.A., Takbar, D., Laska, J.N., Sun, T., Kelly, K.F., and
Gene Set Enrichment Analysis Baraniuk, R.G. (2008). Single-pixel imaging via compressive sampling. IEEE
GSEA was performed with 1,370 gene lists from MSigDB (Subramanian et al., Signal Process. Mag. 25, 8391.
2005). The loadings of each principal component were collected in a distribu- Edgar, R., Domrachev, M., and Lash, A.E. (2002). Gene Expression Omnibus:
tion and loadings within 2 SDs from the mean of this distribution were consid- NCBI gene expression and hybridization array data repository. Nucleic Acids
ered for analysis. We applied a hypergeometric test with a significance p value Res. 30, 207210.
cutoff of 10$4.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster anal-
ysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci.
SUPPLEMENTAL INFORMATION USA 95, 1486314868.

Supplemental Information includes Supplemental Experimental Procedures Fan, H.C., Fu, G.K., and Fodor, S.P.A. (2015). Expression profiling.
and five figures and can be found with this article online at http://dx.doi.org/ Combinatorial labeling of single cells for gene expression cytometry. Science
10.1016/j.cels.2016.04.001. 347, 1258367.
Ham, J., Lee, D.D., Mika, S., and Scholkopf, B. (2004). A kernel view of the
AUTHOR CONTRIBUTIONS dimensionality reduction of manifolds. ICML Proceedings of the 21st
International Conference on Machine Learning (ACM), p. 47.
G.H., H.E.-S., and M.T. conceived the idea. G.H. wrote the simulations Hinton, G.E., and Salakhutdinov, R.R. (2006). Reducing the dimensionality of
and analyzed data, with input from M.T. and H.E.-S. R.B. and M.T. performed data with neural networks. Science 313, 504507.
theoretical analysis. R.B. wrote the mathematical proofs. The manuscript was
written by G.H., R.B., H.E.-S., and M.T. Holter, N.S., Maritan, A., Cieplak, M., Fedoroff, N.V., and Banavar, J.R. (2001).
Dynamic modeling of gene expression data. Proc. Natl. Acad. Sci. USA 98,
ACKNOWLEDGMENTS 16931698.
Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I.,
The authors would like to thank Jason Kreisberg, Alex Fields, David Sivak, Pat- Mildner, A., Cohen, N., Jung, S., Tanay, A., and Amit, I. (2014). Massively par-
rick Cahan, Jonathan Weissman, Chun Ye, Michael Chevalier, Satwik Ra- allel single-cell RNA-seq for marker-free decomposition of tissues into cell
jaram, and Steve Altschuler for careful reading of the manuscript; Eric Chow, types. Science 343, 776779.

Cell Systems 2, 239250, April 27, 2016 249


Klein, A.M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Shalek, A.K., Satija, R., Adiconis, X., Gertner, R.S., Gaublomme, J.T.,
Peshkin, L., Weitz, D.A., and Kirschner, M.W. (2015). Droplet barcoding for sin- Raychowdhury, R., Schwartz, S., Yosef, N., Malboeuf, C., Lu, D., et al.
gle-cell transcriptomics applied to embryonic stem cells. Cell 161, 11871201. (2013). Single-cell transcriptomics reveals bimodality in expression and
Kliebenstein, D.J. (2012). Exploring the shallow end; estimating information splicing in immune cells. Nature 498, 236240, advance online publication.
content in transcriptomics studies. Front. Plant Sci. 3, 213. Shalek, A.K., Satija, R., Shuga, J., Trombetta, J.J., Gennert, D., Lu, D., Chen,
Kumar, R.M., Cahan, P., Shalek, A.K., Satija, R., DaleyKeyser, A.J., Li, H., Zhang, P., Gertner, R.S., Gaublomme, J.T., Yosef, N., et al. (2014). Single-cell RNA-
J., Pardee, K., Gennert, D., Trombetta, J.J., et al. (2014). Deconstructing tran- seq reveals dynamic paracrine control of cellular variation. Nature 510,
scriptional heterogeneity in pluripotent stem cells. Nature 516, 5661. 363369.
Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Shankar, R. (2012). Principles of Quantum Mechanics (Springer Science &
Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly par- Business Media).
allel genome-wide expression profiling of individual cells using nanoliter drop-
lets. Cell 161, 12021214. Shen, Y., Yue, F., McCleary, D.F., Ye, Z., Edsall, L., Kuan, S., Wagner, U.,
Dixon, J., Lee, L., Lobanenkov, V.V., and Ren, B. (2012). A map of the cis-reg-
Ng, A.Y., Jordan, M.I., and Weiss, Y. (2001). On spectral clustering: analysis
ulatory sequences in the mouse genome. Nature 488, 116120.
and an algorithm. In Advances in Neural Information Processing Systems
(MIT Press), pp. 849856. Stewart, G.W., and Sun, J. (1990). Matrix Perturbation Theory (Academic
Patel, A.P., Tirosh, I., Trombetta, J.J., Shalek, A.K., Gillespie, S.M., Wakimoto, Press).
H., Cahill, D.P., Nahed, B.V., Curry, W.T., Martuza, R.L., et al. (2014). Single- Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette,
cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirov, J.P.
Science 344, 13961401. (2005). Gene set enrichment analysis: a knowledge-based approach for inter-
Pollen, A.A., Nowakowski, T.J., Shuga, J., Wang, X., Leyrat, A.A., Lui, J.H., Li, preting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102,
N., Szpankowski, L., Fowler, B., Chen, P., et al. (2014). Low-coverage single- 1554515550.
cell mRNA sequencing reveals cellular heterogeneity and activated signaling Trapnell, C. (2015). Defining cell types and states with single-cell genomics.
pathways in developing cerebral cortex. Nat. Biotechnol. 32, 10531058. Genome Res. 25, 14911498.
Ringner, M. (2008). What is principal component analysis? Nat. Biotechnol. 26,
Treutlein, B., Brownfield, D.G., Wu, A.R., Neff, N.F., Mantalas, G.L., Espinoza,
303304.
F.H., Desai, T.J., Krasnow, M.A., and Quake, S.R. (2014). Reconstructing line-
Roweis, S.T., and Saul, L.K. (2000). Nonlinear dimensionality reduction by age hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature
locally linear embedding. Science 290, 23232326. 509, 271375.
Saxe, A.M., Mcclelland, J.L., and Ganguli, S. (2013). Learning hierarchical
Van der Maaten, L., and Hinton, G. (2008). Visualizing data using t-SNE.
category structure in deep neural networks. Proceedings of the 35th Annual
J. Mach. Learn. Res. 9, 85.
Meeting of the Cognitive Science Society, pp. 12711276.
Segal, E., Shapira, M., Regev, A., Peer, D., Botstein, D., Koller, D., and Velculescu, V.E., Madden, S.L., Zhang, L., Lash, A.E., Yu, J., Rago, C., Lal, A.,
Friedman, N. (2003). Module networks: identifying regulatory modules and Wang, C.J., Beaudry, G.A., Ciriello, K.M., et al. (1999). Analysis of human tran-
their condition-specific regulators from gene expression data. Nat. Genet. scriptomes. Nat. Genet. 23, 387388.
34, 166176. Zeisel, A., Munoz-Manchado, A.B., Codeluppi, S., Lonnerberg, P., La Manno,
Shai, R., Shi, T., Kremen, T.J., Horvath, S., Liau, L.M., Cloughesy, T.F., G., Jureus, A., Marques, S., Munguba, H., He, L., Betsholtz, C., et al. (2015).
Mischel, P.S., and Nelson, S.F. (2003). Gene expression profiling identifies mo- Brain structure. Cell types in the mouse cortex and hippocampus revealed
lecular subtypes of gliomas. Oncogene 22, 49184923. by single-cell RNA-seq. Science 347, 11381142.

250 Cell Systems 2, 239250, April 27, 2016


Cell Systems, Volume 2

Supplemental Information

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing

Graham Heimberg, Rajat Bhatnagar, Hana El-Samad, and Matt Thomson


SUPPLEMENTAL INFORMATION
Table of Contents

Supplemental Figures
Figure S1
Figure S2
Figure S3
Figure S4
Figure S5

1 Supplemental Figure Legends 1


2 Supplemental Theory 4
2.1 Shallow sequencing 4
2.2 Gene expression modules 12
3 Supplemental Experimental Procedures 14
Supplemental References 17
100%
A Principal component error
B
Program 1
80% 2
3 80%

Principal component error


8
15
60%
60% 10

40% 32
100
40%
316
20% 1,000
3,162
20%
3 4 5 6 7 10,000 (x103 reads)
10 10 10 10 10
Reads per sample 10 20 30
Principal component index
C D
1
principal component 1

Negative Predictive Value


formation of tubulin folding intermediates...
prefoldin mediated transfer of substrate... Significance
-log10(p)
sig regulation of the actin cytoskeleton...
10
axon guidance
protein folding 0.99
cell cycle
4 Principal component 1
mitotic prometaphase
Insignificant 2
pathogenic escherichia coli infection
0 3
oocyte meiosis
ncadherin pathway
1 10 100 1000 10000
0.98
3
Reads per sample(10 ) 3 4 5 6 7
10 10 10 10 10
Reads per sample
E 1 F
100%
Principal component error in FPKM units
Positive Predictive Value

principal component error

0.9 80% Principal component 1


2
3
8
15
Principal component 1 60%
0.8 2
3
40%

0.7 1% deep 99%


20% read depth
3 4 5 6 7
10 10 10 10 10
Reads per sample 3 4 5 6 7
10 10 10 10 10
Reads per sample
G H
Distinguishing tissue types with FPKM units 10,000 reads
5
Principal Value Decay in FPKM units x 10 x 10
3
32,000 reads
Differences between principal values (+1)

8
70,000 reads
20% 1
bonemarrow
Variance explained

liver
4 spleen
E14.5liver
thymus
10% 0.5
cerebellum
PC3

0
cortex
olfactory
4 MEF
mESC
E14.5brain
4 8 12 16
program index (i) E14.5limb
8
heart
10 5 0 5 E14.5heart
PC2 x 10
3

Figure SI 1
A

Differences between principal values Differences between principal values


Variance explained (Zeisel et al.)

5
x 10

20%
5

5 10 15 20
B principal value index C D
Variance explained (Truetlein et al.)

5 4
x 10 100% x 10
8%

Principal component error


4
1.2 80%

Principal component 2
6% PC 1 0
2
60% 3
0.8
4 3200 reads
4% 40% 6800 reads
1% deep
read depth 99% 8 15,000 reads
0.4
20% Day 14.5 (1107 reads)
2% Day 16.5 (1107 reads)
1
Day 18.5 (1107 reads)
10
3
10
4
10
5
10
6 7
10 8 4 0 4 8
5 10 15 20 Principal component 1
principal value index read depth x 10
4

E F G
variance explained (Kumar et al.)

5 4
x 10 100% x 10
Differences between principal values

Principal component error

Principal component 2
1 3200 reads
30% 1.2 80%
6800 reads

60% 0 32,000 reads


20% 0.8 PC 1
2 Dgrc8 -/-
3 1
40% WT
1% deep
10% 0.4 read depth 99% Neural Progenitor
20%

5 10 15 20 3 4 5 6
10 10 10 10 10
7
0 2 4 4
principal value index read depth Principal component 1
x 10

H I J
Differences between principal values

3
x 10
5 100% x 10
Variance explained (Shalek et al.)

20% 2
Principal component error

Classification line
2200 reads
Principal component 2

80% 2
4600 reads
10,000 reads
60% PC 1 1
2 1107 reads
3
10% 1
40%
0

1% deep 99%
20%
read depth 1
not mature mature
3 4 5 6 7
4 8 12 16
10 10 10 10 10 1 0 1 2 3
principal value index read depth Principal component 1 x 10
3

t-SNE
Data from Kumar et al.

LLE

t-SNE
Data from Shalek et al.

LLE

Figure SI 2
A 1 B pc1
5 1

gene expression
1 between cluster
cluster
i 300 covariance r
1
1 within cluster
covariance
100
High
covariance

No
covariance
gene variance mi
Negative
covariance

D
C 4
10
R2=0.13
Principal value magnitude

PC 1 loading error (%)


20 1
2
10
1
1

10 0
10


10
Increasing within cluster covariance
High
covariance

5 4 3
genes

No 10 10 10 10
covariance
Gene expression level (normalized read counts)
genes Negative
covariance

Figure SI 3
20%
variance explained

u ns
hu
10%
d
e da
ta
shu ed data

10 20 30
principal value index
Modular Non-modular High
(covariance removed) covariance

genes
genes

Negative
covariance
genes genes

Noise Tolerant Noise Sensitive


~100,000 reads ~ 2.3 million reads
100%
100%
Principal component error

Principal component error


90% accuracy

80% 80% 90% accuracy

60% 60%

40% 40% Program 1


2
3
20% Program 1 20%
2
3
3 4 5 6 7 3 4 5 6 7
10 10 10 10 10 10 10 10 10 10
Reads per sample Reads per sample

data: Shen et al.


Figure SI 4
A
Shen
Treutlein
5 Shalek
10 Kumar
Chen
Pollen
Median (shuffled)

i - i+1

10


10

8
10
2 3 4 5 6 7 8 9
principal value index

B
Yeast Mouse Human

0
R2= 0.19 10
median NMF error R2= 0.69 R2= 0.48
median NMF error

median NMF error


1
210

1
1 10
10
1
10
5 6 5 4 5 4
510 10
4
10 10 10 10 10
1 2 3 1 2 3 1 2 3

C Principal value decay human data D Principal value decay shuffled human data

1 1
i/ 1

i/ 1

0.5 0.5

2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
principal value index principal value index (shuffled data)

E
3 3 3
10 10 10 Human
Yeast Mouse
R2= 0.0846 R2= 0.0007
first principal value (1

first principal value (1

first principal value (1

R2= 0.0426

5 5
10 10
5
10


1 2 10 1 2 10 1 2 3
10 10 10 10 10 10 10
sample number (n) sample number (n) sample number (n)

Figure SI 5

Vous aimerez peut-être aussi