Académique Documents
Professionnel Documents
Culture Documents
OF GENE REGULATION
A DISSERTATION
OF STANFORD UNIVERSITY
DOCTOR OF PHILOSOPHY
Jason Buenrostro
December 2015
2016 by Jason Daniel Buenrostro. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Gerald Crabtree
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jin Li
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Acknowledgements
here I have had the immense privilege of working with many talented and amazing
people. That list begins with Will Greenleaf, my primary mentor and friend for the last 5
years in graduate school. Will has consistently provided thoughtful advice in science and
Throughout this time I have also had the privilege of mentorship from Howard Chang, I
thank Howard for his creative and thoughtful ideas, but more importantly his unwavering
I thank Stanford and the Stanford Genetics department for the opportunity to do
career. I also thank my faculty committee Mike Snyder, Lars Steinmetz, Jerry Crabtree,
Billy Li and Ravi Majeti for their service, insights and support. In addition, I thank Mike
Snyder for the opportunity to collaborate and for the opportunity to work with Beijing
I thank past mentors Hanlee Ji and Georges Natsoulis for mentorship and amazing
start to my scientific life. Specifically, I thank Hanlee Ji for my start in science, providing
2009. I also thank Samuel Myllykangas, for his mentorship and friendship throughout
this time.
In addition to the long list of mentors, I thank many close collaborators for
making this work possible. I thank Lauren Chircus and Carlos Araya for their
determination, hard work and creativity, an essential component for the development of
iv
the RNA array. I thank Paul Gerisi for teaching me everything I know about chromatin.
Ulli Litzenberger, Dave Ruff and the Fluidigm team for their wonderful insight and
dedication to the development of single-cell ATAC-seq. I also thank new friends, Ryan
Corces and Ansu Satpathy, who have brought new and fresh perspectives to my scientific
thinking. I also deeply thank Beijing Wu, I am thankful for her unwavering dedication to
our work, personal support, and general thoughtfulness, without her much of this work
Specifically, my parents Miguel and Martha Buenrostro, who have made tremendous
sacrifices throughout our lives to provide me with the opportunity and preparation to
pursue my dreams. I also thank my brother, sisters, and niece, Michael, Michelle, Erika
and Sam, for their love, support and patience. I also thank my roommate and partner Sara
Prescott, who has been there for me at my best and my worst, and continues to be my
closest ally in science and in life. Lastly, I thank all of my friends, family, past mentors
and other collaborators, whom I regret for not having enough space to mention here.
v
TABLE OF CONTENTS
vi
Accessibility profiles of purified cell populations identify the ontogeny of human
diseases .......................................................................................................................... 70
Discussion ..................................................................................................................... 72
Chapter 5 - Figures and Figure Legends ....................................................................... 75
References ..................................................................................................................... 86
CHAPTER SIX The regulatory landscape of acute myeloid leukemia ......................... 88
Introduction ................................................................................................................... 88
Leukemogenesis and cancer evolution in AML ............................................................ 89
AML represents a cooption of normal myelopoiesis .................................................... 90
AML cell types exhibit lineage infidelity with regulatory contributions from multiple
normal blood cell types ................................................................................................. 92
Generation of synthetic normal analogs for assessment of AML-specific biology ...... 94
Mechanism and clinical consequences of pre-leukemic HSC clonal advantage .......... 95
Discussion ..................................................................................................................... 98
Chapter 6 - Figures and Figure Legends ..................................................................... 100
References ................................................................................................................... 110
CHAPTER SEVEN Conclusion .................................................................................. 112
Methods for gene regulation ....................................................................................... 112
Future work ................................................................................................................. 112
vii
CHAPTER ONE - Introduction
The human body is comprised of a large collection of highly diverse cell types,
effecting diverse cellular processes such as chromatin accessibility, RNA localization and
nucleosome-free regulatory elements can have highly divergent interactions with gene
promoters in cis, acting as: i) activators, or enhancers, ii) repressors or iii) insulators,
Analogous principles hold true for RNA regulation, wherein RNA structure
defines the binding landscape of micro RNAs (miRNAs) and RNA binding protein
RNAs can fold into simple 2D or complex 3D folded structures, which define permissive
wide understanding of these dynamic cellular structures would provide unique insight
into the binding determinants of trans-acting regulators, drivers of cellular function and
cellular potential.
1
Genome-wide methods
and genome-wide characterization of these diverse cellular processes. For example, high-
proteins (RIP-seq and CLIP-seq)3, have been shown to be sensitive methods for
as described in the following sections, these methods are limited in several ways. In the
following thesis I will discuss the development of new methods, which focus on the
parameters defining a binding interaction. However, current methods are low throughput
and generalizable platform for performing biochemical assays of RNA called RNA-
parallel biochemistry platform. We use this platform to describe the kinetic parameters of
2
Measuring chromatin accessibility in rare cells
limiting their application to either cell lines or whole tissues. Applying these methods to
complex cellular populations, derived from tissues, averages over the rich diversity of
methods for measuring chromatin accessibility (ATAC-seq)(Fig. 1a) within rare cellular
within complex tissues can be isolated using flow cytometry and profiled using ATAC-
seq. However, this approach is also limited in that it requires established protocols for
partition cells into relevant subtypes de novo. Together, these assays offer an
Of particular importance to human health and disease, and an excellent model for
wherein a single hematopoietic stem cell (HSC) can give rise to a multitude of distinct
cellular populations ranging from enucleated red blood cells (RBCs) to specialized
immune cells (CD4 and CD8 T cells, B cells and more). Importantly, dysregulation of
this work we also apply ATAC-seq and scATAC-seq to normal human hematopoiesis
3
References
4
CHAPTER TWO Quantitative dissection of millions of sequence variants1
Introduction
estimated to bind RNA3, and recent work has begun to uncover a web of RNA-protein
interactions4-6 that can control gene expression through splicing, RNA localization, and
other post-transcriptional processes. Protein interactions with long noncoding RNAs also
proven powerful tools in synthetic biology, allowing gene expression control through
post-transcriptional regulation10,11.
protein interactions lags behind our growing realization of their biological importance.
dimensional RNA structure12-14 and set the landscape for interactions with RNA-binding
interactions, coupled with the relative paucity of data produced from current biophysical
requirements. Because the relationship between sequence and binding is often opaque,
1
Portions of this chapter were taken from Buenrostro et al. Quantitative analysis of RNA-protein
interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Nature
Biotechnology. 2014. doi:10.1038/nbt.2880.
5
little is understood regarding the evolutionary constraints on these RNA structures,
array hybridization18 and recently have been used to generate a catalogue of RNA binding
motifs19. While powerful, selection and sequencing methods bias results towards high-
activity variants and do not directly and quantitatively measure the biophysical
parameters that underlie biological function20. Recently, methods have been developed to
these instruments for high-throughput binding affinity assays across large DNA sequence
space. In this work, we have leveraged the Illumina DNA sequencing platform, an
TIRF imaging for massively parallel DNA sequencing27, to create a platform for direct,
developed quantitative image analysis tools for large-scale analysis of these data, and
kinetics. We apply these methods to the MS2 coat protein2,28-31, a system with widespread
6
applications in affinity purification32, RNA imaging33 and synthetic biology10,11. This
>107 RNA targets generated directly on the flow cell surface, providing massive
binding energies between primary and secondary structures, and quantitative analysis of
library containing an E. coli RNA polymerase (RNAP) initiation and stall sequence, and
a region coding for diverse sequence variants of the MS2 RNA hairpin synthesized using
RNAP initiation sequence. The barcoding strategy serves to identify individual molecules
within a population by uniquely tagging each molecule using a barcode library. We then
for multiple redundant measurements across the flow cell. The sequencing process
clonal DNA molecules on the flow cell surface27, and provided the sequence and position
7
double stranded DNA (dsDNA) using DNA polymerase to extend a biotinylated primer.
We then saturated the flow cell with streptavidin to create a terminal biotin-streptavidin
single molecule investigations35 designed to generate a single RNA per DNA template.
conditions, which allows RNAP to generate 26 bases of RNA (the footprint of RNAP)
before stalling at the first guanine on the DNA template strand. Second, we washed
excess RNA polymerase from solution and introduced all 4 nucleotides, allowing RNAP
to transcribe the variable region and stall at the biotin-streptavidin roadblock. This
procedure results in transcribed RNA tethered to its parent DNA via RNA polymerase
(Fig. 1a). The resulting RNA array contained 1.2 x 107 distinct RNA features comprising
over the RNA array, and imaged bound MS2 at equilibrium using total internal reflection
perfused 1.8 M unlabeled MS2 and recorded the fluorescence decay caused by
dissociation (Fig. 1c). The high-concentration of unlabeled MS2 protein blocks other
quantify bound MS2 we developed image analysis tools that cross-correlate cluster
centers from sequencing data to acquired images and fit the observed binding in each
cluster to a 2D Gaussian (Fig. S1 and S2). Using this approach, we quantified the
fluorescence signal for each cluster in 6,240 images representing 120 tiles imaged in two
8
fluorescence color channels across 11 equilibrium MS2 concentrations and 15
dissociation time points. Fluorescence signals from single clusters fit canonical
dissociation (Fig. 1d) and binding curves (Fig. 1e, f), yielding binding energy estimates
in excellent agreement with published measurements (R = 0.94, slope = 1.08, Fig. 1g)
We calculated off rates (koff) for 3,029 sequences and dissociation constants (Kds)
for 129,248 sequences, encompassing 57 single (100%), 1,539 double (100%), and
24,181 triple (92.4%) mutants (Fig. 2a, b). To investigate how sequence variation in the
RNA hairpin impacts MS2 binding, we examined differential binding energies for all
binding energy change from all possible single-base changes at each position reveals a
sensitivity to mutation throughout the hairpin that complements the effects of mutating
individual residues on the binding surface of MS2 to alanine36 (Fig. 2c). Specifically, we
observe high mutation sensitivity at base-paired positions near the loop and at specific
of mutations preserve hairpin structure and maintain high binding affinities (Fig. 2e). We
also observed negative epistasis in non-compensating mutants near the base of the stem,
9
Reciprocal mapping of positive epistasis signatures (1 s.d.) allowed de novo
reconstruction of the bound hairpin structure, identifying base-paired, loop, and bulge
pairing (secondary structure) to binding energy at each position in the hairpin with a
linear regression model from a set of 121 training sequences. This model provides two
free parameters for each unpaired base accounting for primary sequence changes in the
form of transitions or transversions. For each pair of interacting bases, the model
provides a total of 6 free parameters one for transition and transversion of each base in
the pair (4 parameters) as well as one parameter to account for disruption due to the loss
transversions that occur while holding secondary structure constant) and secondary
To quantify the sensitivity for non-canonical base-pairing at positions in the hairpin stem,
we trained the model 8 separate times (once for each possible non-canonical pairing) with
one free parameter representing the energetic cost of the respective non-canonical
pairing. This re-fitting analysis allowed the model to incorporate a different energetic
penalty for having non-canonical base pairs at a specific position instead of the energetic
penalty for a full loss of base-pairing. In this analysis, G:U base pairs caused substantially
10
less disruption to the binding energy than other non-canonical base pairs (Fig. 3a),
consistent with the formation of a wobble base pair at G:U positions that allows partial
rescue of the secondary structure12,37. Our final model, which incorporated a free
parameter for G:U non-canonical base pairs, captured 92% of the variance in binding
energy of the training set and predicted the binding energy of second and third mutations
for variants with mutations in both paired and unpaired positions with correlation
secondary determinants of affinity across the RNA structure (Fig. 3c, d). Energetic
penalties for disrupting base-pairing increase with proximity to the loop, while non-
canonical G:U base pairs cause substantially less energetic disruption at the -8:-3 and -
11:-1 positions. Altering the primary sequence at -10A (bulge) and -4A (loop), residues
that interact with the Lys61 binding pocket on alternate halves of the dimer29, confers
energetic costs that exceed disrupting the hairpin structure at any single base pair. We
also observed important roles for the -7A and -5C residues, consistent with stacking
interactions at these positions38. Altering the primary sequence on the 5 side of the
hairpin confers a greater energetic penalty compared with altering the 3 side, which we
contribute to measured G values for all mutants with measurable kinetic data. We
11
!"#$%# !"#$%#$&$ !"#$%#
association rates, [log(!!" /!!" ) log (!!" )]. Because log(koff) +
G across single and double mutants (Fig. 4a). At the base of the hairpin, only a small
effect suggests that mutations at these positions modulate association rates, possibly by
causing fraying of the hairpin and/or allowing competition with alternate RNA structures,
is reinforced by examining log(koff) and log(kon) in this region (Fig. 4b, c).
Dissociation rates change little while inferred association rates remain similar to that of
the consensus sequence only for structures that maintain base-pairing through
population of structures with G driven by association rates (Fig. 4d; P < 2.2 x 10-16,
Wilcoxon signed rank test, = 0.5). These results suggest the kinetic drivers of observed
affinity changes are position-specific and often operate through modulating association
Discussion
have converted a high-throughput DNA sequencing flow cell into an RNA array for
scale. Using this quantitative deep mutational profiling approach we report, to our
knowledge, the largest collection of binding affinities and kinetic constants for an
12
questions, including i) the relative contributions of primary and secondary structure
structure.
mutations provides a map for quantitative tuning of both the association rate and the
sequence variants will enable affinity tuning of MS2-based RNA sensors enabling new
applications in synthetic biology. Additionally, these data provide a useful framework for
base-pairing, creating a valuable framework for understanding the design and evolution
RNA hairpin formation or competition with alternate secondary structure, reducing the
number of productive binding collisions39. These observations suggest the data provided
here may also provide a rich resource for modeling the RNA hairpin stability and
alternate structure formation. While this is an area of inquiry beyond the focus of this
work, the potential for formation of alternate structures and the effects of local sequence
on native folding of RNA are well suited for study using this platform, as the RNA
and sequence methods. In addition, the technique might provide quantitative information
13
on RNA libraries generated by systematic enrichment of ligands by exponential
enrichment (SELEX), allowing affinity tuning for the design of biological parts. While
SELEX methods often begin with large libraries (~1014) and produce a small number of
selected molecules, this RNA array methodology allows characterization of a much larger
library subset (~105), opening the door to a detailed understanding of the sequence-
specific rules driving acquisition of affinity in the selection process. Alternatively, this
and used to directly quantify molecular affinities on in vitro generated RNA, providing
structure inference via fluorescence resonance energy transfer (FRET). In addition, the
of long RNAs and allowing investigations of long non-coding RNAs and catalytic
binding affinities and functional RNA molecules, as well as the identification and
parameters.
14
Chapter 2 - Figures and Figure Legends
biochemistry. (a) Steps for generating RNA tethered to DNA clusters on a high-
throughput DNA sequencing flow cell. (b) Structure of the MS2 coat protein homodimer
bound to the 19 nt hairpin RNA (PDB ID: 2BU1)31. (c) Images of fluorescently bound to
perfusion of unlabeled MS2 competitor. Below, fitted sum of Gaussians used to assign
containing the consensus sequence (-5C) (t1/2=8.39 minutes). (e) Fit binding curves to
clusters labeled in panel (c). (f) The probability distribution of binding energies from all
clusters with labeled variants; mean Kd = 2.57 nM, 36.8 nM, and 415 nM for the -5C, -
5U, and -5A variants, respectively. (g) Correlation between binding energies reported in
the literature and measured on the RNA array (squares, Carey et al.28, circles, Romaniuk
15
Figure 2. A quantitative map of MS2 binding across RNA sequence variants. (a)
per molecular variant as a function of mutation number. A median of ~11 clusters are
observed for sequences with 4 mutations. Affinities for the consensus sequence come
from NC=909,385 clusters. (c) Average G of point mutations per position. The G
of alanine36 substitutions to the MS2 binding surface are shown in parentheses (kBT).
Solid and dashed lines represent base and phosphate interactions, respectively. (d) Matrix
of G for single and double mutants of the consensus sequence. Inset contains the
matrix of G for single and double mutants of the +1G variant. All energies are
calculated relative to the consensus (-5C) sequence (arrow, G=0), and the number of
quality-filtered double mutants in each matrix is indicated (M2). (e) Epistasis matrix
16
Figure 3. Binding affinity is dependent on primary sequence and secondary RNA
structure. (a) Fit parameters for linear regression model showing position-specific
contributions. Energetic components for all possible base pair combinations are shown
below. (b) Predicted binding energies of variants with second (M2) and third mutations
(M3) in both single- and double-stranded regions. Primary (i.e. mean energetic
contributions to affinity derived from a, were mapped onto the hairpin (PDB ID:
1ZDH)38.
17
Figure 4. Sequence-specific contributions of association and dissociation rates to
binding affinity. (a) Fractional contribution of dissociation rates for 31 single and 289
double mutants with measurable affinities and dissociation rates. Positions at the base of
the hairpin are highlighted. (b) log(koff) and (c) log(kon) at the base of the hairpin. M2 =
association (blue, =0.57) and dissociation (red, =0.43) rates to G for all measured
mutants (N=3,029).
18
Supplementary Figure1. Data Analysis Workflow. (a) Sequencing cluster centers were
derived from the fastq files from the sequencing run. X/Y and tile positions were
extracted from the fastq header lines. Data were cross-correlated with the observed
images to define a global offset. Images were then cleaned to mask any saturated pixels.
Images were broken into smaller sub regions (24x24 pixels) and the fluorescence was
fitted to a sum of overlapping 2D Gaussians. This process was repeated for all 120 tiles
of the GAIIx sequencing lane and across the 26 image series (3,120 images). (b) Binding
images were normalized for RNA content using the all RNA image (Alexa647 oligo
hybridized to the stall sequence). Data was aggregated across the image series by cluster
ID, and the fluorescence values for each cluster across concentrations was fit to a binding
curve. The fit binding energies were grouped by hairpin sequence, and median binding
19
Supplementary Figure 2: Correlating sequencing data and fitting 2D Gaussians to
acquired images. We found that a simple cross-correlation was sufficient to map x/y
positions from the sequencing data to both the (a) all RNA image and the (b) MS2
binding images (cluster centers shown in green). Shown are unaligned images and cluster
centers (left), the cross-correlation value (middle), and the resulting mapped cluster
centers (right). The plotted cluster centers were adjusted using the least squares image fit.
Images were fit to 2D Gaussians and generated the following distribution for the relevant
parameters: (c) the fit amplitude and (d) the fit standard deviation from a representative
tile. Integrating these values generated (e) the distribution of the integrated fluorescence.
20
References
21
of RNA-binding proteins. Nat. Biotechnol. 27, 667670 (2009).
19. Ray, D. et al. A compendium of RNA-binding motifs for decoding gene
regulation. Nature 499, 172177 (2013).
20. Araya, C. L. et al. A fundamental protein property, thermodynamic stability,
revealed solely from large-scale measurements of protein function. Proceedings of
the National Academy of Sciences 109, 1685816863 (2012).
21. Pitt, J. N. & Ferre-D'Amare, A. R. Rapid Construction of Empirical RNA Fitness
Landscapes. Science 330, 376379 (2010).
22. Guenther, U.-P. et al. Hidden specificity in an apparently nonspecific RNA-
binding protein. Nature (2013). doi:10.1038/nature12543
23. Matzas, M. et al. High-fidelity gene synthesis by retrieval of sequence-verified
DNA identified using high-throughput pyrosequencing. Nat. Biotechnol. 28, 1291
1294 (2010).
24. Myllykangas, S., Buenrostro, J. D., Natsoulis, G., Bell, J. M. & Ji, H. P. Efficient
targeted resequencing of human germline and cancer genomes by oligonucleotide-
selective sequencing. Nat. Biotechnol. 29, 10241027 (2011).
25. Uemura, S. et al. Real-time tRNA transit on single translating ribosomes at codon
resolution. Nature 464, 10121017 (2010).
26. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-
throughput sequencing instrument. Nat. Biotechnol. 29, 659664 (2011).
27. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456, 5359 (2008).
28. Carey, J., Lowary, P. T. & Uhlenbeck, O. C. Interaction of R17 coat protein with
synthetic variants of its ribonucleic acid binding site. Biochemistry 22, 47234730
(1983).
29. Valegrd, K., Murray, J. B., Stockley, P. G., Stonehouse, N. J. & Liljas, L. Crystal
structure of an RNA bacteriophage coat proteinoperator complex. Nature 371,
623626 (1994).
30. Romaniuk, P. J., Lowary, P., Wu, H. N., Stormo, G. & Uhlenbeck, O. C. RNA
binding site of R17 coat protein. Biochemistry 26, 15631568 (1987).
31. Grahn, E. et al. Structural basis of pyrimidine specificity in the MS2 RNA hairpin-
coat-protein complex. RNA 7, 16161627 (2001).
32. Bardwell, V. J. & Wickens, M. Purification of RNA and RNA-protein complexes
by an R17 coat protein affinity method. Nucleic Acids Res. 18, 65876594 (1990).
33. Bertrand, E. et al. Localization of ASH1 mRNA particles in living yeast. 2, 437
445 (1998).
34. Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular
identifiers. Nat Meth 9, 7274 (2011).
35. Greenleaf, W. J., Frieda, K. L., Foster, D. A., Woodside, M. T. & Block, S. M.
Direct observation of hierarchical folding in single riboswitch aptamers. Science
319, 630633 (2008).
36. Hobson, D. & Uhlenbeck, O. C. Alanine Scanning of MS2 Coat Protein Reveals
ProteinPhosphate Contacts Involved in Thermodynamic Hot Spots. Journal of
Molecular Biology 356, 613624 (2006).
37. Gabriele Varani, W. H. M. The GU wobble base pair: A fundamental building
block of RNA structure crucial to RNA function in diverse biological systems.
22
EMBO Reports 1, 1823 (2000).
38. Valegrd, K. et al. The three-dimensional structures of two complexes between
recombinant MS2 capsids and RNA operator fragments reveal sequence-specific
protein-RNA interactions. Journal of Molecular Biology 270, 724738 (1997).
39. Gell, C. et al. Single-Molecule Fluorescence Resonance Energy Transfer Assays
Reveal Heterogeneous Folding Ensembles in a Simple RNA StemLoop. Journal
of Molecular Biology 384, 264278 (2008).
40. Licatalosi, D. D. et al. HITS-CLIP yields genome-wide insights into brain
alternative RNA processing. Nature 456, 464469 (2008).
41. Zhao, J. et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP-
seq. Molecular Cell 40, 939953 (2010).
42. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).
23
CHAPTER THREE Measuring accessibility in rare cellular populations2
Introduction
Eukaryotic genomes are hierarchically packaged into chromatin1, and the nature
of this packaging plays a central role in gene regulation2,3. Major insights into the
come from high-throughput, genome-wide methods for separately assaying the chromatin
4,5
accessibility (open chromatin) , nucleosome positioning6-8, and transcription factor
factor binding.
These limitations are problematic in three major ways: First, current methods can
average over and drown out heterogeneity in cellular populations. Second, cells must
often be grown ex vivo to obtain sufficient biomaterials, perturbing the in vivo context
and modulating the epigenetic state in unknown ways. Third, input requirements often
and sensitive method for epigenomic profiling that can provide a comprehensive portrait
2
Portions of this chapter were taken from Buenrostro et al. Transposition of native chromatin for
fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and
nucleosome position. Nature Methods. 2013;10(12):12131218. doi:10.1038/nmeth.2688.US
24
ATAC-seq measures chromatin accessibility using Tn5 transposase
Hyperactive Tn5 transposase10,11, loaded in vitro with adapters for high-throughput DNA
sequencing, can simultaneously fragment and tag a genome with sequencing adapters
chromatin (Fig 1A). The entire assay and library construction can be carried out in a
simple two-step process involving Tn5 insertion and PCR. In contrast, published DNase-
and FAIRE-seq protocols for assaying chromatin accessibility involve complex multi-
step protocols and many loss-prone steps, such as adapter ligation, gel purification and
carried out over at least 3 days13,14. Furthermore, these protocols require 1-50 million cells
(FAIRE) or 50 million cells (DNase-seq), likely due to their complex workflows13,14 (Fig
analyses show that ATAC-seq provides accurate and sensitive measure of chromatin
accessibility genome-wide. We carried out ATAC-seq on 50,000 and 500 unfixed nuclei
25
isolated from GM12878 lymphoblastoid cell line (ENCODE Tier 115) for comparison and
validation with chromatin accessibility data sets, including DNase-seq13 and FAIRE-
seq16. At a locus previously highlighted by others5, (Fig. 1C), ATAC-seq has a signal-to-
noise ratio similar to DNase-seq, which was generated from approximately 3 to 5 orders-
of-magnitude more cells13,14. Peak intensities were highly reproducible between technical
replicates (R=0.98), and highly correlated between ATAC-seq and DNase-seq (R=0.79
and R=0.83). Highly sensitive open chromatin detection is maintained even when using
5,000 or 500 human nuclei as starting material, although sensitivity is diminished for
positioning. The insert size distribution of sequenced fragments from human chromatin
has clear periodicity of approximately 200 base pairs, suggesting many fragments are
protected by integer multiples of nucleosomes (Fig 2A). This fragment size distribution
also shows clear periodicity equal to the helical pitch of DNA11. By partitioning insert
models17, and normalizing to the global insert distribution (Methods) we observe clear
class-specific enrichments across this insert size distribution (Fig. 2B), demonstrating that
these functional states of chromatin have an accessibility fingerprint that can be read
out with ATAC-seq. These differential fragmentation patterns are consistent with the
putative functional state of these classes, as insulator regions are enriched for short
26
fragments of DNA, while transcription start sites are differentially depleted for mono-, di-
and tri-nucleosome associated fragments. Transcribed and promoter flanking regions are
enriched for longer multi-nucleosomal fragments, suggesting they are more compacted
than other states that require access to DNA by regulatory factors. Finally, repressed
regions are differentially depleted for short fragments, consistent with their expected
compacted state. These data suggest that ATAC-seq reveals differentially compacted
cell line, we partitioned our data into reads generated from putative nucleosome free
regions of DNA, and reads likely derived from nucleosome associated DNA. Using a
simple heuristic that positively weights nucleosome associated fragments and negatively
weights nucleosome free fragments, we calculated a data track used to call nucleosome
positions within regions of accessible chromatin20. An example locus (Fig. 3A) contains a
putative bidirectional promoter with CAGE data showing two transcription start sites
(TSS) separated by ~700bps. ATAC-seq reveals in fact two distinct nucleosome free
regulatory regions, as the majority of reads are concentrated within accessible regions of
chromatin (Fig. 3B). By averaging signal across all active TSSs, we note nucleosome free
fragments are enriched at a canonical nucleosome free promoter region overlapping the
TSS, while our nucleosome signal is enriched both upstream and downstream of the
nucleosomes6,7 (Fig. 3C). Because ATAC-seq reads are concentrated at regions of open
27
chromatin, we see strong nucleosome signal at the +1 nucleosome, which decreases at the
distances from the TSS likely due to over digestion of more accessible nucleosomes.
magnitude more sequencing than ATAC-seq (198 million paired reads) to reach similar
resolution at regulatory nucleosomes (Fig. 3B,C). Using our nucleosome calls, we further
partitioned putative distal regulatory regions and TSSs into regions that were nucleosome
free and regions that were predicted to be nucleosome bound. We note that TSSs were
enriched for nucleosome free regions when compared to distal elements, which tend to
remain nucleosome rich (Fig. 3D). These data suggest ATAC-seq can provide high-
understand the relationship between nucleosomes and DNA binding factors. Using ChIP-
seq data, we plotted the position of a variety of DNA binding factors with respect to the
revealed major classes of binding with respect to the proximal nucleosome, including 1) a
strongly nucleosome avoiding group of factors with binding events stereotyped at ~180
bases from the nearest nucleosome dyad (comprising C-FOS, NFYA and IRF3), 2) a
class of factors that nestle up precisely to the expected end of nucleosome DNA
contacts, which notably includes chromatin looping factors CTCF and cohesion complex
28
subunits RAD21 and SMC3; 3) a large class of primarily transcription factors that have
a class whose binding sites tend to overlap nucleosome associated DNA. Interestingly,
this final class includes chromatin remodeling factors such as CHD1 and SIN3A as well
as RNA polymerase II, which appears to be enriched at the nucleosome boundary8. The
interplay between precise nucleosome positioning and locations of DNA binding factor
of ATAC-seq.
wide. We reasoned that DNA sequences directly occupied by DNA-binding proteins are
protected from transposition; the resulting sequence footprint reveals the presence of
location of the CTCF motif that coincides with the summit of the CTCF ChIP-seq signal
in GM12878 cells (Fig 4A). We averaged ATAC-seq signal over all expected locations of
CTCF within the genome and observed a well-stereotyped footprint (Fig. 4B). Similar
results were obtained for a variety of common TFs. We inferred the CTCF binding
footprinting data to generate a posterior probability of CTCF binding at all loci (Fig.
4C)25. Results using ATAC-seq closely recapitulate ChIP-seq binding data in this cell line
29
and compare favorably to DNase-based factor occupancy inference, suggesting that
factor occupancy data can be extracted from these ATAC-seq datasets, and allowing
networks. With this personalized regulatory map, we compared the genomic distribution
of the same 89 transcription factors between GM12878 and proband CD4+ T-cells.
Transcription factors that exhibit large variation in distribution between T-cells and B-
cells are enriched for T-cell specific factors (Fig. 4D). This analysis shows NFAT is
Discussion
insights, but are currently limited in application by their complex workflows and large
cell number requirements. ATAC-seq offers potentially unique advantages over pre-
existing ChIP-, MNase- and DNase-seq methods. ATAC-seq is an information rich assay,
regulatory sites, and chromatin compaction genome-wide. These insights are derived
from both the position of insertion and the distribution of insert lengths captured during
the transposition reaction. While extant methods such as DNase- and MNase-seq can
provide some subsets of the information in ATAC-seq, they each require separate assays
with large cell numbers, which increases the time, cost, and limits applicability to many
30
systems. ATAC-seq also provides insert size fingerprints of biologically relevant
expect ATAC-seq to have broad applicability, significantly add to the genomics toolkit,
and improve our understanding of gene regulation, particularly when integrated with
other powerful rare cell techniques, such as FACS, laser capture microdissection (LCM)
combination of speed, simplicity and low input requirements of ATAC-seq will enable
31
Chapter 3 - Figures and Figure Legends
(red and blue), inserts only in regions of open chromatin (nucleosomes in grey) and
generates sequencing library fragments that can be PCR amplified. B) Approximate input
material and sample preparation time requirements for genome-wide methods of open
32
Figure 2. ATAC-seq provides genome-wide information on chromatin compaction.
A) ATAC-seq fragment sizes generated from GM12878 nuclei (red) indicate chromatin-
high frequency periodicity consistent with the pitch of the DNA helix for fragments less
than 200 bp. (Inset) log-transformed histogram shows clear periodicity persists to 6
defined17.
33
Figure 3 ATAC-seq provides genome-wide information on nucleosome positioning
(TSSs) showing nucleosome free read track, calculated nucleosome track (Methods), as
well as DNase, MNase, and H3K27ac, H3K4me3, and H2A.Z tracks for comparison. B)
ATAC-seq (198 million paired reads) and MNase-seq (4 billion single-end reads)
nucleosome signal shown for all active TSSs (n=64,836), TSSs are sorted by CAGE
expression. C) TSSs are enriched for nucleosome free fragments, and show phased
nucleosomes similar to those seen by MNase-seq at the -2, -1, +1, +2, +3 and +4
bases in TSS and distal sites (see Methods). E) Hierarchical clustering of DNA binding
34
factor position with respect to the nearest nucleosome dyad within accessible chromatin
reveals distinct classes of DNA binding factors. Factors strongly associated with
35
Figure 4: ATAC-seq assays genome-wide factor occupancy. A) CTCF footprints
ATAC-seq footprint for CTCF (motif shown) generated over binding sites within the
genome C) CTCF predicted binding probability inferred from ATAC-seq data, position
weight matrix (PWM) scores for the CTCF motif, and evolutionary conservation
(PhyloP). Right-most column is the CTCF ChIP-seq data (ENCODE) for this GM12878
cell line, demonstrating high concordance with predicted binding probability. D) Cell
type-specific regulatory network from proband T cells compared with GM12878 B-cell
line. Each row or column is the footprint profile of a TF versus that of all other TFs in the
same cell type. Color indicates relative similarity (yellow) or distinctiveness (blue) in T
versus B cells. NFAT is one of the most highly differentially regulated TFs (red box)
36
References
37
occupancy by sequencing. Genome Research 23, 341351 (2013).
21. Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin
environment at regulatory elements. Genome Research 22, 1735 (2012).
22. Hesselberth, J. R. et al. Global mapping of protein-DNA interactions in vivo by
digital genomic footprinting. Nat Meth 6, 283289 (2009).
23. Boyle, A. P. et al. High-resolution genome-wide in vivo footprinting of diverse
transcription factors in human cells. Genome Research 21, 456464 (2011).
24. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
25. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
sequence and chromatin accessibility data. Genome Research 21, 447455 (2011).
26. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Meth
6, 377382 (2009).
27. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression
and splicing in immune cells. Nature 498, 236240 (2013).
38
CHAPTER FOUR Single-cell accessibility reveals principles of regulatory
variation3
Introduction
Heterogeneity within cellular populations has been evident since the first
for interrogating single cells1-5 has allowed detailed characterization of this molecular
types11 quantifying changes that lead to both activation and repression of gene
expression. Given this broad diversity of activity within regulatory elements when
heterogeneity at the single cell level extends to accessibility variability within cell types
at regulatory elements. However, the lack of methods to probe DNA accessibility within
variation.
3
Portions of this chapter were taken from Buenrostro et al. Single-cell chromatin accessibility
reveals principles of regulatory variation. Nature. 2015. doi:10.1038/nature14590.
39
(scATAC-seq), improving on the state-of-the-art12 sensitivity by >500-fold. ATAC-seq
uses the prokaryotic Tn5 transposase13,14 to tag regulatory regions by inserting sequencing
adapters into accessible regions of the genome. In scATAC-seq individual cells are
captured and assayed using a programmable microfluidics platform (C1 single-cell Auto
Prep System, Fluidigm) with methods optimized for this task (Fig. 1a). After
transposition and PCR on the Integrated Fluidics Circuit (IFC), libraries are collected and
PCR amplified with cell-identifying barcoded primers. Single-cell libraries are then
107 or 104 cells respectively (Fig. 1b,c). Data from single cells recapitulate several
diversity or poor measures of accessibility, which correlate with empty chambers or dead
cells, were excluded from further analysis (Fig. 1d). Chambers passing filter yielded an
average of 7.3x104 fragments mapping to the nuclear genome. We further validated the
representing 3 tier 1 ENCODE cell lines15 (H1 human embryonic stem cells [ESCs],
from V6.5 mouse ESCs, EML6 (mouse hematopoietic progenitor), TF-1 (human
40
fibroblast).
elements within individual cells. For example, within a typical single cell we estimate a
total of 9.4% of promoters are represented in a typical scATAC-seq library (Fig. S1c-f).
The sparse nature of scATAC-seq data makes analysis of cellular variation at individual
features (Fig. 2a,b and Fig. S2a-f). To quantify this variation we first choose a set of
open chromatin peaks, identified using the aggregate accessibility track, which share a
common characteristic (such as transcription factor binding motif, ChIP-seq peaks, cell
cycle replication timing domains, etc.). We then calculate the observed fragments in these
regions minus the expected fragments, down sampled from the aggregate profile, within
individual cells. To correct for bias, we divide this by the root mean square of fragments
expected from a background signal (BS) constructed to estimate technical and sampling
error within single-cell data sets. Herein, we refer to this metric as deviation. Finally,
for any set of features, we aggregate the deviation measurements across cells (Fig 2b) to
obtain an overall variability score, a metric of excess variance over the background
signal.
We first focused our analysis on K562 myeloid leukemia cells, a cell type with
41
associated with trans-factors within individual K562 cells, we computed variability
across all available ENCODE ChIP-seq, transcription factor motifs and regions that
differed in replication timing (as determined from Repli-Seq data sets18) (Fig. 2c,d). We
replicates (Fig. S2g-i). As expected from proliferating cells, we find increased variability
associated with changes in DNA content across the cell cycle. In addition, we discover a
set of trans-factors associated with high variability. These factors include sequence-
specific transcription factors (TFs), such as GATA1/2, JUN, and STAT2, and chromatin
cytometry (Fig. 2e and Fig. S3a-d) confirmed heterogeneous expression of GATA1 and
GATA2. Principal component (PC) analysis of single-cell deviations across all trans-
factors show seven significant PCs, with PC 5 describing changes in DNA abundance
throughout the cell cycle. This analysis suggests that high-variance trans-factors are
variable independent of the cell-cycle (Fig. 2f and Fig. S3e-g). The remaining PCs show
contributions from several TFs, suggesting that variance across sets of trans-factors
to-site variability in chromatin accessibility. For example, the most variant factors in
K562 cells GATA1 and GATA2 display expression heterogeneity and also bind an
42
identical consensus sequence GATA, suggesting these factors may compete for access
to DNA sequences. In support of this hypothesis, we find regulatory elements with both
whereas sites with only GATA1 or GATA2 show substantially less variability (Fig. 2g
binding sites that co-occur with JUN or CEBPB. We also find peaks unique to GATA1
binding are significantly more accessible than peaks unique to GATA2 (Fig. S3j-k)
GATA2 to induce single-cell variability. Extending this analysis to all TF ChIP-seq data
sets revealed a trans-factor synergy landscape for accessibility variation (Fig. 2g). For
significantly enhanced when the same region could also be bound by GATA1, TAL1 or
P300. In contrast, CTCF, SUZ12, and ZNF143 appear to act as general suppressors of
accessibility variance, unless associated with proximal binding of ZNF143 or SMC3, the
caused a marked reduction of variability within peaks associated with DNA replication
43
timing domains (Repli-seq) (Fig. 3a). The addition of inhibitors of JUN or BCR-ABL
suggesting an increase in the subpopulation of G1/S cells, which was validated with flow
cytometry (Fig. S4). JUN variability was one of the top changes caused by JNKi but not
specifically modulated accessibility variability at NF-kB sites (Fig. 3b), consistent with
the known stochastic and oscillatory property of nuclear shuttling in this system20.
Together, these results show that variability can be experimentally modulated and further
We observe that trans-factors associated with high variability are generally cell
type specific. Hierarchical bi-clustering of single-cell deviations generated from three cell
lines reveals cell-type specific sets of transcription factor motifs associated with high
variability (Fig. 3c). This analysis also shows cells from different biological replicates
cluster with their cell type of origin (with a single exception), suggesting scATAC-seq
all assayed cell types identified high-variance trans-factor motifs that are generally
unique to specific cell types (Fig. 3d). For example, regions associated with GATA TFs
are most variant in K562s while regions associated with master pluripotency TFs Nanog
and Sox2 are most variant in mouse embryonic stem cells (ESCs), consistent with
find high variability of GATA1 and PU.1 (SPI1) binding accessibility in EML cells, a
cell type previously shown to have >200x GATA1 and >15x PU.1 expression differences
44
within clonal cellular subpopulations6. Interestingly, the complete set of identified high-
localize into the nucleus, including NF-kB, JUN, and ETS/ERG20,23,24, suggesting that
variance among this set of annotated trans-factor motifs, suggesting differences in the
global levels of trans-factor variability across cell lines. Overall these findings suggest
wide.
single cell deviations within sliding windows across the genome, each encompassing a
fixed number of peaks (N=25) (Fig. 4a). We then determined which windows co-varied
within individual cells by calculating the co-correlation of each window across all others
within the same chromosome within individual cells. We then further enhanced this co-
correlation matrix using a secondary correlation analysis using methods similar to those
pairs of positions in the genome where accessibility co-varies within individual cells,
chromatin domains26 (Fig. 4b-d) (R=0.61 for chromosome 1). These data provide
45
chromatin structure25,26. Moreover, these results suggest that higher-order chromatin
interactions may drive regulatory variability in cis (elements that are close together tend
to be open together), and that ensemble chromosome conformation data may arise in part
DNA loci27.
Discussion
occurance with other factors such as P300 appears to amplify variability, perhaps due to
specific single-cell epigenomic variability across several cell types, suggesting that
this component of epigenomic noise has its roots in higher-order chromatin organization.
All together these data provide exciting new hypothesis of regulatory mechanisms that
46
We envision that future studies will enhance the utility of scATAC-seq by further
improving the recovery of DNA fragments, increasing throughput, and refining methods
of data analysis. Improvements to throughput and new statistical tools will enable single-
cells to be partitioned by cell-state and analyzed in aggregate to find the individual peaks
that drive variability (Fig. S5). In addition, we anticipate scATAC-seq may be paired
for systems analysis of individual cells. Such an approach will link regulatory variation to
details of phenotypic variation, promising new insight into the molecular underpinnings
of the epigenomic landscape of small or rare biological samples allowing for detailed,
47
Chapter 4 - Figures and Figure Legends
0.80). (d) Library size versus percentage of fragments in open chromatin peaks (filtered
as described in methods) within K562 cells (N=288). Dotted lines (15% and 10,000)
48
Figure 2. Trans-factors are associated with single-cell epigenomic variability. (a)
Schematic showing two cellular states (TF high and TF low) leading to differential
signal (BS; see Supplemental Methods section 3.2) to calculate TF deviations and
variability from scATAC-seq data. The TF value is calculated by subtracting the number
of expected fragments from the observed fragments per cell (see Supplemental Methods
section 3.1). (c) Observed cell-to-cell variability within sets of genomic features
associated with ChIP-seq peaks, transcription factor motifs, and replication timing (error
estimates shown in grey, see Methods for details). Variability measured from permuted
deviations from expected accessibility signal for GATA1 sites in individual cells,
histogram of cells shown in grey, density profile shown in purple (see Methods). (e)
49
Immunostaining of GATA1 (green) and GATA2 (red) shows protein expression in
K562s. (f) Principal components ranked by fraction of variance explained from observed
data (purple) and permuted data (orange). Bar plot of observed data shown in grey. (g)
Methods). Venn-diagrams show variability associated with GATA1 and/or GATA2 and
50
Figure 3. Cell type specific epigenomic variability. Change of cellular variability due
to chemical perturbations using (a) CDK4/6 cell-cycle inhibitor (K562) or (b) TNF-alpha
bootstrapped cells across the two conditions. (c) Heat map of deviations from expected
accessibility signal across trans-factors (rows) and of single cells (columns) from 3 cell
clustering. (d) Variability associated with trans-factor motifs across 7 cell types. Each
row is normalized to the maximum variability for that motif across cell types (shown
left).
51
Figure 4. Structured cis variability across single epigenomes. (a) Per-cell deviations
of expected fragments across a region within chromosome 1 (see Methods). For display,
only large deviation cells are shown (N=186 cells). (b) Pearson correlation coefficient
chromatin conformation capture assay (left, data from Kalhor et al.26) or doubly
Methods). Data in white represents masked regions due to highly repetitive regions. (c)
Permuted cis-correlation map for chromosome 1 (analyzed identically to (b)). (d) Box
52
Supplemental Figure 1. scATAC-seq data recapitulate bulk assays. (a) Histogram of
aggregated read starts around all TSSs (in K562 cells) comparing ensemble approaches to
scATAC-seq shows high enrichment above background level of reads. (b) DNA fragment
size distribution of ATAC-seq fragments from single cells (grey) and the average of all
Accessibility across all peaks (n=50,000) in GM12878 cells. (d) Accessibility across all
annotated promoters in GM12878 cells. Typical promoters used for subsequent analysis
are boxed with dotted lines. Recovery of typical promoters shown in (a) within single-
cells within (e) observed data and (f) extrapolated data using measures of predicted
library complexity.
53
Supplemental Figure 2. scATAC-seq data analysis pipeline and validation of bias
normalization. Standard deviation of log fold change in reads across cells within peaks
binned by deciles of (a) peak intensity, (b) Tn5 bias and (c) GC bias. Variability scores
(incorporating bias normalization) within the same peaks shown in (a-c), peaks are
binned by deciles of (d) peak intensity, (e) Tn5 bias and (f) GC bias. (g-i) Observed
changes in variability comparing the merged set of replicates (K562) to each individual
biological replicate. Error bars represent 1 standard deviation of the variability scores
54
Supplemental Figure 3. Characterization of high-variance trans-factors in K562
cells. (a-d) Distribution of (a) GATA1, (b) GATA2, (c) actin and (d) CTCF fluorescence
observed by flow cytometry. Distributions in grey depict isotype controls. (e) Bi-
clustered heat map of single cell deviations as observed within K562 cells (N=239).
Labels on right identify co-clustering of related factors. (f) Bi-clustered heat map of
single-cell deviations observed from permuted data. (g) Projection of factor loadings onto
principal component 1 versus 5 from principal component (PC) analysis of heatmap from
Fig. 2d. Factor loadings do not vary along PC5, while peaks associated with regions with
different replication timings (RepliSeq) have strong variation along this axis. Venn-
diagrams showing variability of (h) GATA1 and/or GATA2, (i) CJUN and/or GATA2
and CEBPB and/or GATA2 (co-) occurring ChIP-seq sites. (h) Distribution of
accessibility among GATA1 only, GATA2 only, and shared sites. (i) Mean accessibility
55
from GATA1 only, GATA2 only, and shared sites in (k), error bars represent 1 standard
56
Supplemental Figure 4. Drug treatments modulate factor variability. (a-b) Change in
variability of untreated K562 cells versus cells treated with (a) Imatinib and (b) JUN
inhibitor show increase of variability in factors associated with the cell cycle or s-phase
and JUN factors respectively. (c-f) Flow cytometry data depicting DNA content, using
DAPI or PI, in (c) control K562 cells or cells showing altered cell-cycle status after
treatment with (d) cell-cycle inhibitor, (e) Imatinib and (f) JUN inhibitor.
57
Supplemental Figure 5. Measurements of individual peaks within single-cells. (a)
The distribution of GATA1 deviation scores for single K562 cells. Volcano plots of (b)
non-GATA1 peaks and (c) GATA1 peaks in K562 cells, p-values were calculated using a
binomial test. (d) The distribution of NF-kB deviation scores for single GM12878 cells.
Volcano plots of (e) non-NFKB peaks and (f) NF-kB peaks in GM12878 cells, p-values
were calculated using a binomial test. Inset numbers show the number of points in upper
left or upper right quadrants of the panel. (g) Accessibility at a genomic locus, showing
58
(top) aggregate NFKB low (blue) and NFKB high (red) profiles, (middle) single
GM12878 cells ranked by NFKB deviations scores and (bottom) unranked single-cells.
59
References
60
information processing. Nature 466, 267271 (2010).
21. Grn, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-
cell transcriptomics. Nat Meth 11, 637640 (2014).
22. Singer, Z. S. et al. Dynamic Heterogeneity and DNA Methylation in Embryonic
Stem Cells. Molecular Cell 55, 319331 (2014).
23. Cai, L., Dalal, C. K. & Elowitz, M. B. Frequency-modulated nuclear localization
bursts coordinate gene regulation. Nature 455, 485490 (2008).
24. Levine, J. H., Lin, Y. & Elowitz, M. B. Functional roles of pulsing in genetic
circuits. Science 342, 11931200 (2013).
25. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions
reveals folding principles of the human genome. Science 326, 289293 (2009).
26. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. Genome architectures
revealed by tethered chromosome conformation capture and population-based
modeling. Nat. Biotechnol. 30, 9098 (2012).
27. Giorgetti, L. et al. Predictive Polymer Modeling Reveals Coupled Fluctuations in
Chromosome Conformation and Transcription. Cell 157, 950963 (2014).
61
CHAPTER FIVE The epigenomic determinants of human hematopoiesis4
Introduction
number of self-renewing hematopoietic stem cells (HSCs). These cells are long-lived and
retain the ability to give rise to multiple distinct lineages of blood cells. During the course
of a single day, more than 200 billion blood cells are produced1, highlighting the need for
tightly controlled regulatory programs that balance self-renewal of the apex stem cells
complexity, the hematopoietic system is the most extensively characterized adult stem
cell hierarchy whereby many diverse cell types can be isolated through the use of multi-
parameter fluorescence activated cell sorting (FACS)2. This enables interrogation of the
precise transcriptional dynamics that govern the cell state transitions associated with
have profiled gene expression patterns in mouse3-5 and human6,7 hematopoiesis providing
a rich resource for characterizing these cellular states. However, measuring gene
expression alone provides limited information regarding the causative regulators of cell
identity. The dynamic expression of key transcription factors (TFs) can dramatically alter
the regulatory landscape, which defines the expression of nearby genes, and forms the
62
methods for assaying cellular regulation. Importantly, chromatin accessibility measures
active regulatory elements and hotspots for TF binding. However, these assays require
regulation and limiting their application to either cell lines8 or whole tissues9 which do
not accurately represent individual primary cell types. Recent developments have enabled
rare cellular populations, which has successfully lead to the identification of regulatory
efficiency have afforded the unique opportunity to provide such comprehensive and data-
protocol, optimized for human blood cells, that allows for more rapid high-quality
measurements with 10-fold fewer cells. We apply this optimized protocol to cells isolated
from 9 healthy human donors, studying 13 of the major cell types of the normal
derive paired expression data. This atlas of normal human hematopoiesis provides a data
rich resource for discovering the molecular determinants of human hematopoiesis and
63
Identification of chromatin accessibility landscape in primary blood cells
transcriptome map of the normal hematopoietic hierarchy (Fig. 1a,b). Although ATAC-
seq is highly efficacious for a variety of cell sources, further optimizations were required
to profile rare primary human blood cells from cryopreserved specimens. This protocol,
termed Fast-ATAC, was optimized for use on primary blood cells and relies on a 1-step
membrane permeabilization and transposition using the lysis reagent digitonin. We found
that this simplified protocol provides extremely high quality data (Fig. S1a-c), requires
just 5,000 cells, offers an approximately 10-fold improvement in sensitivity, and reduces
the frequency of mitochondrial reads by ~5 fold (Fig. S1d). However, we note that
digitonin is a gentle detergent and may not be ideal for cell lines and other cell types that
are more resistant to lysis. Overall, Fast-ATAC provided hematopoietic epigenomes with
i) increased speed, ii) fewer cells and iii) lower cost, making it readily adaptable for
the human hematopoietic hierarchy via fluorescence activated cell sorting (FACS) (Fig.
1a and Fig. S2). Cells were taken directly from donor bone marrow or peripheral blood
without further in vitro manipulation or treating donors with agents such as granulocyte
high endogenous RNases and proteases as well as mature megakaryocytes, which proved
difficult to isolate in adequate cell numbers. The isolated cell populations included 7
64
unique stem and progenitor and 6 differentiated cell types spanning the myeloid,
erythroid, and lymphoid lineages2,11-13. All together, we performed ATAC-seq and RNA-
seq on 3-4 adult donors for each cell population totaling 49 transcriptomes and 77
Each individual cell type of the hematopoietic hierarchy displayed a set of uniquely
expressed genes and uniquely open peaks mapping to genes known to be involved in
cellular functions important for the given cell type (Fig. 1c and Fig. S1f,g). Additionally,
the sets of uniquely open peaks were enriched for motifs of transcription factors known to
be involved in the biological processes of the cell type of interest (Fig. S1h).
(R=0.93, Fig. 1d) and biological (R=0.93, Fig. 1e) replicates. We also observed a
significant correlation (R=0.73) between Fast-ATAC and DNase-seq (data from the
Importantly, we find that hematopoietic stem cells (HSCs), a CD34+ subpopulation, can
have significantly different chromatin profiles than the bulk CD34+ HSPC pool (R=0.77,
Fig. 1g), highlighting the value of analysis of highly purified stem and progenitor cell
subpopulations.
clustering of our RNA-seq and ATAC-seq data shows robust classification of cell types
65
among technical and biological replicates (Fig. 2a,b). In this analysis, ATAC-seq appears
to be more adept at classifying cell types as quantified by the cluster purity14, suggesting
that chromatin accessibility is more cell type-specific and better captures cell identity.
Intriguingly, and in line with previous studies of murine cell subsets5, epigenomes and
transcriptomes also provide different conclusions about the relationship between various
cell types. By RNA-seq, the common myeloid progenitor (CMP) clusters with the
clusters more closely with the HSC and the multipotent progenitor cell (MPP) (Fig. 2d),
which is more consistent with its role in the hematopoietic hierarchy. We reasoned that
latter revealing the cell-type specific and combinatorial logic of regulatory elements that
control the expression of nearby genes. When regulatory elements were subdivided to
gene promoters versus putative distal enhancers (>1000 bp away from the closest TSS),
compared to promoters and transcription profiles (Fig. 2e,f). Notably, we find that
promoter elements are largely invariant within CD34+ stem and progenitor cells,
suggesting that chromatin remodeling associated with these linked developmental lineage
clearly illustrated by the region surrounding the TET2 gene, a gene expressed within a 2-
fold range in all cell types throughout the hematopoietic hierarchy. Despite the invariant
66
Enhancer cytometry prospectively deconvolves complex cell populations
Given the accuracy with which regulatory landscapes delineate cell types, we
populations into their constitutive subsets. For instance, the Epigenomic Roadmap
mixtures of CD34+ HSPCs. These data are very useful for understanding the biology of
these cells; however, these tissues represent an ensemble average of multiple distinct cell
types. While some regulatory elements are ubiquitous among all HSPCs, others show
high cell type specificity (Fig. 3a). For example, the accessible site near micro-RNA 1915
shows a robust peak exclusively in CMP cells but shows almost no accessibility in the
CD34+ DNaseseq data. In fact, regulatory elements that are highly cell-type specific are
averaged out and difficult to detect in this bulk CD34+ data (Fig. 3a).
The highly cell type-specific nature of our ATAC-seq data enabled the
contribution of each individual cell type to the ensemble profile. Analogous to flow
cytometry of cell surface markers, enhancer cytometry with CIBERSORT uses the
patterns of cell identity. To do this, we filtered for high-quality distal regulatory elements
and removed promoter signal (see methods) and applied CIBERSORT to define an array
of cell-type specific regulatory elements (Fig. 3b). CIBERSORT employs support vector
67
regression (SVR) for deconvolution, a method shown to be robust to noise, unknown
out cross validation and found that enhancer cytometry proved to be highly robust for
classification of all normal hematopoietic cell types (Fig. 3c,d). One exception is the
MPP that showed reasonable but lower accuracy than other cell types. However, we note
that when MPP cells are misclassified, they are most frequently misclassified as HSCs,
their closest normal cell type. Next, we prospectively tested enhancer cytometry on bulk
CD34+ HSPCs and performed flow cytometry in parallel. We found that enhancer
cytometry yielded highly accurate enumeration of the constituent cell types when
compared to flow cytometry (R2 = 0.95, Fig. 3e,f). Notably, this cell type deconvolution
was not as accurate without restriction to distal regulatory elements (R2 = 0.91). In
addition, we found that enhancer cytometry can also be used to deconvolve CD34+
DNase-seq data (p < 0.001), suggesting that ATAC-seq with enhancer cytometry may be
a general strategy for identifying and counting cells within complex cellular mixtures.
by their underlying transcription factor motifs and calculated a bias corrected deviation
score, which represents a differential gain or loss of accessibility across peaks sharing a
68
given motif for each transition in the hematopoietic hierarchy. We note that, unlike
the number of sequenced reads, DNA sequence bias, and signal-to-noise bias. We
therefore chose this approach to measure the effect that a given TF motif enacts on the
accessibility (Fig. 4a and Fig. S3a). Notably, these factors have also been previously
these TFs are highly cell-type specific, often displaying step-wise gains across
developmental lineages. This is exemplified by the GATA and PAX motifs which
are strongly enriched in erythroid and lymphoid lineages respectively (Fig. 4b,c). To
validate this approach for determining global TF motif regulators of cell identity, we
compared GATA TF footprints20 between MEPs (GATA high) and common lymphoid
progenitors (CLPs) (GATA low) and found that CLPs had no detectable binding at
GATA sites when compared to MEPs (Fig. 4d). For further validation, we employed
PIQ21, a TF footprinting algorithm, and found drastically fewer GATA footprints in CLPs
compared to MEPs (N=173 and N=27,292 respectively), thus, confirming our analytical
We reasoned that the accessibility of a given motif should correlate with the
expression of the associated transcription factor. However, the underlying motif sequence
does not identify the precise causative regulator of accessibility at those motif instances.
This is a common issue in epigenomic studies and particularly important for cases in
69
which many factors share identical or near-identical TF motifs. For example, the GATA
motif is shared among 6 TFs (GATA1-6), while the PAX motif is shared among 9 TFs. In
an effort to assign motifs to transcription factors, we integrated our ATAC-seq and RNA-
association table linking hematopoietic TF motifs to 806 genes by motif similarity (Fig.
S3b-e). Next, we calculated correlation coefficients for the expression of all known TFs23
hematopoiesis (Fig. 4e). For example, the expression of GATA1 and PAX5 are highly
correlated with accessibility at GATA and PAX motifs, respectively (R = 0.75, P = 10-18
and R = 0.88, P = 10-230, Fig. 4e-g and Fig. S3f). Interestingly, for some motifs, such as
the HOX motif, we find many putative regulators with weak correlations (N = 11; Fig.
S3g,h), suggesting that regulation of HOX accessibility is more complex. Together, these
data.
diseases
hematopoietic regulome can trace the ontogeny of activity in the noncoding genome that
impacts human disease. Many genome-wide association studies (GWAS) have linked
diseases to polymorphisms, but have not been able to pinpoint the cells responsible for
70
those phenotypes. By measuring the activity of regulatory elements that overlap regions
with predicted sites of functional variation from GWAS, it is now possible to more
accurately predict the specific cell types impacted by genetic variants linked to diverse
human diseases24-26. To do this we first filtered for GWAS that were significantly
enriched in hematopoietic cells (Fig. S4a,b; see methods), then calculated deviation
scores for each GWAS across the hematopoietic hierarchy as described above. We found
that each of these associations can be traced through the hematopoietic lineage to predict
the developmental point at which each variant may first exert its effects, thus enriching
our understanding of developmental origins of human disease (Fig. 4h-k and Fig. S4c).
(MCV), a measure of the average volume of an erythrocyte cell, are most strongly
enriched in erythroblasts (Fig. 4h). Intriguingly, many regions associated with MCV
polymorphisms first become accessible at the CMP stage and increase in accessibility in
MEP cells. These non-coding polymorphisms are predicted to affect transcription factor
binding and would, therefore, lead to closure of sites that would otherwise be accessible.
From this, MCV-associated polymorphisms found in the accessible regions of CMPs and
MEPs suggest that these polymorphisms exert their effects prior to full erythroid lineage
(RA) show a strong enrichment in B cells (Fig. 4i). This association is consistent with the
known role of autoantibodies and pathogenic B cells in the pathogenesis of RA, as well
disease in which hair is lost from some or all areas of the body. The autoimmunity
71
driving this disease has recently been associated with both innate and adaptive immune
responses29, a result consistent with the enrichment of polymorphisms for alopecia areata
in both CD4+ and CD8+ T cells and monocytes (Fig. 4j). B cells also harbor many active
elements associated with alopecia areata but have not been studied in this disease,
suggesting a new direction of investigation. Importantly, the disease associations that are
highlighted by our data are not limited to diseases canonically associated with
enrichment in B cells and monocytes, two cell types that have predicted roles in the
Discussion
landscape of 13 unique blood cell types. This resource relies on the accurate and precise
determination of the epigenomic landscapes in primary human blood cells, made possible
by Fast-ATAC. The chromatin accessibility profiles of blood cells are highly cell type
specific and allow for a much more robust classification system than more frequently
specifically distal enhancers, groups individual cell types with extremely high cluster
purity, demonstrating that these distal regulatory elements more precisely define cell
identity and developmental trajectory. Enhancer cytometry harnesses this specificity and
enumerates the frequencies of pure cell types in complex cell mixtures. This technique
72
enabled the accurate deconvolution of data derived from CD34+ bone marrow cells into
the constituent highly-similar HSPC cell types. Flow cytometry has become a standard
technique, but it is typically limited to a handful of cell surface markers, each requiring a
different antibody that may have off-target binding and gating idiosyncrasies. In contrast,
method destroys the cell as the measurements are made, and thus does not permit
prospective cell purification at present. We note that while we have used well-
characterized cell types with known cell surface immunophenotypes to generate pure cell
type reference maps, single cell ATAC-seq with enhancer cytometry may be used as an
unbiased measure of cell type identity within a population providing archetypal cell
profiles within complex cellular populations. In principle, this general approach may be
the open chromatin landscapes of specific hematopoietic cell types, notably the
developmental contexts in which the disease-relevant elements first become active. In the
case of mean corpuscular volume, a measurement of the size of red blood cells, the
strongest association occurs in erythroblast cells, but a significant association can be seen
as early as the common myeloid progenitor stage (CMP). These results are consistent
with the concept that many enhancers are developmentally primed prior to their
73
human HSPC subtypes, we are able to identify the earliest progenitor cells that may be
relevant in the pathogenesis of specific diseases and elucidate putative targets for
corrective action. It is now well accepted that effective genetic correction of coding
mutations needs to take place in the stem cell compartment - e.g. the HSC in blood or
basal cells in epithelia - in order to achieve long lasting phenotypic correction in the
tissue. The same logic applies to genetic variants in the noncoding genome and suggests
the need to map the developmental ontogeny of regulatory elements. Comprehensive and
cell type-specific regulome maps will help to nominate hypotheses of relevant cell types
in diseases.
regulators that drive blood cell identity and function. Integration of ATAC-seq and RNA-
seq data improves motif-transcription factor pairing and enables the accurate
tools that model both cis32 and trans33 determinants of chromatin accessibility and gene
expression.
74
Chapter 5 - Figures and Figure Legends
Schematic of the human hematopoietic hierarchy shows the 13 primary cell types
analyzed in this work. Granulocytes and megakaryocytes were excluded. (b) Diagram of
analyses performed using paired ATAC-seq and RNA-seq data in both primary human
blood cells and primary patient AML cells. (c) Normalized ATAC-seq profiles at
developmentally important genes. Profiles represent the union of all technical and
biological replicates for each cell type. See Supplementary Table 1 for the exact number
75
of technical and biological replicates for each cell type. (d-g) Scatter plot showing
correlation of (d) technical replicates, (e) different human donors, (f) ATAC-seq and
DNase-seq data derived from CD34+ HSPCs, and (g) ATAC-seq HSCs with bulk CD34+
HSPCs.
76
Figure 2. Distal regulatory elements enable accurate classification of the
hematopoietic hierarchy. (a,b) Hierarchical clustering of (a) RNA-seq (N=49) and (b)
ATAC-seq (N=77) data from all biological replicates of 13 normal hematopoietic cell
types. Values shown are Pearson correlation coefficients. Cluster purity quantifies the
degree that cells of the same lineage (color coded in the key) are clustered together. (c,d)
Phylogenetic dendrograms of (c) RNA-seq and (d) ATAC-seq data showing inter-cell
type correlations derived from aggregate averages of all biological and technical
replicates. Length of tree branches represents Euclidean distance. Data represents the
union of all technical and biological replicates for each cell type. (e,f) Hierarchical
clustering of ATAC-seq profiles (N=77) mapping to (e) promoters and (f) distal
regulatory elements. (g) ATAC-seq peaks in the TET2 locus show highly variable distal
regulatory landscapes (left) and relatively constitutive expression of TET2 (right). Data
represents the union of all technical and biological replicates for each cell type.
77
Figure 3. Enhancer cytometry allows for deconvolution of the hematopoietic
hierarchy. (a) Normalized ATAC-seq profiles of HSPC subsets and ensemble CD34+
subpopulations. Predicted cell fractions are shown on the left and nearest annotated genes
are shown on the bottom. (b) Schematic of enhancer cytometry, including methods to
define a signature matrix of highly cell-type specific enhancers (right panel, N=735).
to test robustness to (c) sequential subtraction and (d) randomized mixture content. Test
data and training data are non-overlapping. Error bars in (c) represent the standard
derived from FACS sorted bulk CD34+ HSPCs identifies fractional contribution from all
expected cell types. (f) Correlation of predicted fractional contribution of each HSPC cell
type by enhancer cytometry versus flow cytometric ground truth data of input CD34+
cells.
78
Figure 4. Integrative analysis of the hematopoietic regulome refines transcriptional
circuitry driving cell specification and enriches the understanding of human disease
(a) Transcription factor dynamics showing major TFs driving hematopoietic regulomes.
The size of the circle represents the effect of that motif in driving accessibility in human
blood cells. The relative distance between circles represents the co-occurrence of motifs
throughout hematopoietic differentiation (see methods). (b,c) Usage of the (b) GATA and
(c) PAX motif throughout hematopoietic differentiation. Values represent the relative
deviation of the motif accessibility, a measure of motif usage, compared to that in HSCs.
(d) Footprint analysis of the GATA motif in MEP and CLP cells. (e) Correlation
(Pearson) of motif accessibility and significance of gene expression for GATA (top) and
PAX (bottom). Red dots represent DNA-binding factors annotated to bind the given
79
motif, gray dots represent all other DNA-binding factors. (f,g) Expression of (f) GATA1
and (g) PAX5 phenocopies the usage of the GATA motif throughout hematopoietic
hematopoietic regulatory elements with GWAS SNPs for (h) mean corpuscular volume,
(i) rhuematoid arthritis, (j) alopecia areata, and (k) Alzheimers disease.
80
Supplementary Figure 1. Data processing pipelines. (a) ATAC-seq insert size
annotated transcription start sites (TSS) from Fast-ATAC data compared to (b) DNase-
seq and (c) previously published ATAC-seq data using the original ATAC-seq protocol10.
(d) Fraction of total mitochondrial reads derived from the original ATAC-seq protocol
constitutively accessible region of the genome. Profiles represent the union of all
technical and biological replicates for each cell type. (f,g) GO Term analyses from unique
81
(f) gene expression and (g) accessible peaks from normal hematopoietic cells. (h)
82
Supplementary Figure 2. Cell sorting strategies. (a) Representative examples of
sorting strategies for the seven CD34+ HSPC populations isolated in this study.
83
Supplementary Figure 3. Trans regulators of hematopoiesis. (a) Summary of motif
represented above each column. (b) Clustering of hematopoiesis TF motifs (N=46) with
CIS-BP motifs (N=806) using Pearson correlation (see methods). (c,d) Example of
clustered motifs for (c) GATA4 and (d) MEIS1. (e) Histogram of all correlation values
shown in (b) with lists of putative hematopoietic regulators highlighted (N=255). (f)
developmentally important TFs, GATA1 and PAX5. (g,h) Summary list of putative TF
(g) positive and (h) negative regulators of hematopoiesis. Motifs are listed on the left and
genes are listed on the right. Values represent correlation coefficients (Pearson).
84
Supplementary Figure 4. GWAS enrichments across hematopoiesis. (a)
shown in (b). (b) Hierarchical clustering of all GWAS (N=235) across diverse tissues. (c)
minimum signal.
85
References
86
19. Nerlov, C. & Graf, T. PU.1 induces myeloid lineage commitment in multipotent
hematopoietic progenitors. Genes & development 12, 24032412 (1998).
20. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
21. Sherwood, R. I. et al. Discovery of directional and nondirectional pioneer
transcription factors by modeling DNase profile magnitude and shape. Nat.
Biotechnol. 32, 1718 (2014).
22. Weirauch, M. T. et al. Determination and Inference of Eukaryotic Transcription
Factor Sequence Specificity. Cell 158, 14311443 (2014).
23. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A
census of human transcription factors: function, expression and evolution. Nature
Reviews Genetics 10, 252263 (2009).
24. Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal
immune basis of Alzheimers disease. Nature 518, 365369 (2015).
25. Farh, K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune
disease variants. Nature 518, 337343 (2015).
26. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated
Variation in Regulatory DNA. 337, 11901195 (2012).
27. De Vita, S. et al. Efficacy of selective B cell blockade in the treatment of
rheumatoid arthritis: evidence for a pathogenetic role of B cells. Arthritis &
Rheumatology 46, 202933 (2002).
28. Coenen, M. J. H. & Gregersen, P. K. Rheumatoid arthritis: a view of the current
genetic landscape. Genes and Immunity 10, 101111 (2009).
29. Petukhova, L. et al. Genome-wide association study in alopecia areata implicates
both innate and adaptive immunity. Nature 466, 113117 (2010).
30. Butovsky, O., Kunis, G., Koronyo-Hamaoui, M. & Schwartz, M. Selective
ablation of bone marrow-derived dendritic cells increases amyloid plaques in a
mouse Alzheimer's disease model. European Journal of Neuroscience 26, 413416
(2007).
31. Khoury, El, J. et al. Ccr2 deficiency impairs microglial accumulation and
accelerates progression of Alzheimer-like disease. Nat Med 13, 432438 (2007).
32. Gonzlez, A. J., Setty, M. & Leslie, C. S. Early enhancer establishment and
regulatory locus complexity shape transcriptional programs in hematopoietic
differentiation. Nature Genetics (2015).
33. Whitaker, J. W., Chen, Z. & Wang, W. Predicting the human epigenome from
DNA motifs. Nat Methods 12, 265272 (2015).
87
CHAPTER SIX The regulatory landscape of acute myeloid leukemia5
Introduction
has been shown to play a critical role in the development of hematologic malignancies1.
Despite a low overall mutation rate2 and prolonged periods between cell divisions3, the
long lifespan of HSCs makes them susceptible to the accumulation of mutations over
time. Recent work4-8 has demonstrated that HSCs constitute a cellular reservoir for
In particular, in the case of acute myeloid leukemia (AML), HSCs isolated from leukemia
patients have been shown to harbor some but not all of the genetic alterations found in the
frankly leukemic cells and have, therefore, been termed pre-leukemic HSCs. Importantly,
many of the genes found to be recurrently mutated during the pre-leukemic phase of
AML have been shown to regulate the epigenome5,6 such as DNA methyltransferase 3A
(IDH1/2)11,12. However, the role of these epigenetic mutations during the evolutionary
process of leukemogenesis and their effects on the regulatory networks that govern
centered on how cell fate choices are corrupted in human leukemias13 whether leukemic
cells truly harbor multiple lineage-specific regulatory programs at once (termed lineage
88
With hematopoiesis as a reference of normal development, we measure the effects
on the leukemogenic process of both early mutations in epigenetic modifiers and late
cells isolated from normal human bone marrow and patient-matched pre-leukemic HSCs,
leukemia stem cells, and leukemic blast cells, we chart the genetic and epigenetic
progression from normal to malignant in AML. We demonstrate that the vast majority of
epigenetic and transcriptomic change that occurs during leukemogenesis is derived from
similar epigenetic alterations, suggesting a common path for the leukemogenic process.
Our results provide key insights into the evolutionary process of leukemogenesis and
identify important transcriptional programs that could be targeted to disrupt this process
during its earliest stages. In summary, this work serves as a rich resource for the study of
leukemia stem cells (LSCs), and leukemic blast cells (blasts). Each of these leukemic cell
Unmutated HSCs serve as the reservoir for mutation acquisition during the early phases
of leukemogenesis (Fig. 1a). Acquisition of mutations, typically in genes that regulate the
89
epigenome, creates pHSCs that expand to create a pre-leukemic clone. Subsequent
generates LSCs that are capable of self-renewal and the production of AML blasts (Fig.
1a).
HSCs isolated from a leukemia patient that harbor at least the first mutation. We profiled
the mutation frequency of known leukemogenic driver mutations in HSCs, T cells, and
blast cells from 39 AML patients. Pre-leukemic burden is highly variable in this cohort
with some patients exhibiting a complete repopulation of the HSC compartment with pre-
leukemic cells and others exhibiting undetectable levels of pre-leukemic mutations (Fig.
1b). The pre-leukemic mutations found in this large cohort recapitulate previous
findings5,6 showing that early mutations tend to occur in genes that modify the epigenome
The AML leukemogenic process provides a novel system to study the genesis and
evolution of cancer at the level of the epigenome through the lens of normal
protocol produced robust accessibility profiles from cryopreserved primary patient AML
cells (Fig. 1c). This allowed us to quantify the heterogeneity exhibited among the
90
different stages in leukemia evolution. We find that the level of epigenetic variance
between all samples of the same cell type increases through progressive stages of
leukemia evolution (Fig. 1d, see methods). As expected, all AML cell types exhibit more
inter-donor variance than normal hematopoietic cells. This may be the consequence of
the epigenetic mutations present in the leukemic cell types or a manifestation of the point
along the normal hematopoietic hierarchy at which the particular AML cell types exist.
variation amongst the AML cell types consistent with different developmental stages
(Fig. 1e). When overlaid across the principal components derived from normal
hematopoiesis, we find that the first four principal components from normal
hematopoietic differentiation account for 60% of the variation observed in our leukemia
samples (Fig. 1f). Assigning a score to the myeloid differentiation component of our data,
we find that the various stages of AML spread across the trajectory from HSC to
monocyte, indicating that the process of leukemogenesis largely mirrors the process of
normal myelopoiesis (Fig. 1g). Consistent with their functional ability to produce both
lymphoid and myeloid cells in xenotransplantation assays6,15,16, pHSCs are most closely
related to HSCs and MPPs (Fig. 1g). As shown previously17, LSCs show strong similarity
to GMP and LMPP cells and leukemic blast cells show a wider distribution with less
differentiated blasts clustering with GMP cells and more differentiated blasts clustering
with monocyte cells18,19 (Fig. 1g). These results indicate that the majority of inter-patient
variation in AML is derived from the developmental position along the normal myeloid
91
AML cell types exhibit lineage infidelity with regulatory contributions from multiple
specific AML might harbor a unique collection of multiple distinct normal regulatory
programs. Using enhancer cytometry, we quantified the contribution of each normal cell
type for each leukemic sample assayed (Fig. 2a). We found that each patient, at each
epigenetic diversity of leukemic cell types. Importantly, we find that the majority of the
patient donors have AML blasts that are clonally derived and harbor all the leukemic
mutations at comparable allele frequencies. Together, these findings raise the intriguing
possibility that AML cell types may either i) exist in stable intermediate cell states that
are not normally maintained during normal hematopoiesis, or ii) show developmental
wide approaches for measuring regulatory elements average over cellular states and
cytometry would be able to resolve these two pressing hypotheses (Fig. 2b).
purified LSCs and blast cells from patient SU070. Although CIBERSORT could
within single cells often contained 0, 1 or 2 fragments, consistent with our previous
work20, and was simply too sparse for existing deconvolution methods such as
92
principle component analysis (PCA) of the regulome, learned from normal bulk
developmental lineages and enable enhancer cytometry in single-cells (Fig. 2b). Indeed,
we found that with this approach, single cell accessibility profiles could be projected onto
hematopoietic principal components with high accuracy (Fig. 2c,d and Fig. S1b,c; see
methods). To better visualize and quantify heterogeneity within these cell subsets we
progression (Fig. 2e). Using these projections, we find that primary patient derived LSCs
and blast cells are remarkably homogenous and indeed exist at intermediate cell states.
cell line HL60, which also shows mixed normal cell contributions using ensemble (Fig.
S1a) and single-cell (Fig. 2e) enhancer cytometry. To further test our ability to project
purified MEP cells. Intriguingly, we find single MEPs show a predominant peak centered
at the MEP position with a prominent tail towards CMP along erythropoietic
differentiation (Fig. 2f and Fig. S1c). This observation is consistent with post-sort
transitional cell-states (Fig. S1d). Importantly, we also find that biological replicates of
scATAC-seq from the erythroleukemia cell line (K562) show highly reproducible
lineage infidelity model wherein primary human AML cells and AML-derived cell lines
can simultaneously access two normally independent regulatory programs within the
same cell.
93
Generation of synthetic normal analogs for assessment of AML-specific biology
The ability to accurately quantify the contribution of each normal cell regulome to
the epigenetic profile of a leukemic cell type enables a more robust identification of
past have relied on comparing the malignant cells to a carefully chosen normal cell type.
Our data (Fig. 2a) shows that this may not be sufficient, and that multiple distinct normal
regulatory patterns are contributing to the biology of AML cells. Due to these mixed
lineages, we suspect that past epigenomic and transcriptomic cancer studies may be
highly biased towards the rediscovery of normal and developmentally dynamic genes
rather than bona fide cancer-specific genes. We reasoned that effective removal of this
which represent admixtures of various normal cells defined by enhancer cytometry (see
methods). While comparison of AML cell types to their closest normal cell analogs yields
a high correlation (R = 0.86, Fig. 2g), comparison of AML cell types to their synthetic
normal analogs yields an even higher correlation (R = 0.91, Fig. 2h) and, more
synthetic normal analogs consistently resulted in higher Pearson correlation values (Fig.
S1e) and provided fewer cancer-specific peaks than comparison to the closest normal
modules that are utilized by AML cells (Fig. 3a and Fig. S2a). We can track the usage of
94
these modules through leukemogenesis and identify patterns related to specific AML cell
types (Fig. 3b). Additionally, each module shows enrichment for peaks associated with
different key transcription factors (Fig. 3c). For example, modules 1 and 2 show strong
enrichment for JUN and FOS activity, indicating the activation of AP-1-dependent stress
moderate but consistent selective targeting of AML blasts (Fig. S2c-e). This observation
AML21 and indicates that similar strategies may prove efficacious in targeting pre-
leukemic HSC.
in primary AML samples have not been characterized. Using ATAC-seq and enhancer
cytometry we show that pHSCs share many regulatory programs with HSCs and MPPs
the earliest known event of AML evolution (Fig. 3b). This repressed regulatory module is
95
enriched for motifs associated with HSPCs (i.e. HOX and GATA) and provides direct
evidence to support a model where pHSCs maintain a unique epigenetic and functional
state.
associated with HSPCs, we probed pHSCs for phenotypic changes related to self-renewal
and differentiation. When pushed to differentiate down the myeloid and erythroid
lineages (Fig. S2f), pHSCs showed a strong resistance towards differentiation, instead
favoring maintenance of the stem cell state (Fig. 3d,e). Given the decreased accessibility
of module 6, this suggests that accessibility at certain stem cell-related motifs may confer
the ability to properly differentiate rather than properly self-renew. We have previously
assessed the effect of depletion of GATA1 and GATA2 on HSPC differentiation and self-
observation excludes these GATA factors from mediating the defects in differentiation
associated with repression of module 6. Given the well-studied role of HOX factors in
stem cells22, in particular the role of HOXA9 in HSCs, we hypothesized that HOXA9
might mediate the observed stemness phenotype. In fact, previous studies have shown an
increase in the number of HSCs in mice deficient for HOXA923. From this, we reasoned
that loss of accessibility at HOXA9 target sites may confer an increase in stemness and
HOXA9 by short hairpin RNA (shRNA) knockdown (Fig. S2g) in umbilical cord blood
CD34+ HSPCs led to a retention of stemness in the context of both myeloid (Fig. 3f) and
96
granulocytes and erythroid cells was also observed (Fig. S2h,i), consistent with results
from mouse models of HOXA9 deficiency23,24. In addition, we note that this retention of
stemness is also observed in the absence of a differentiation stimulus (Fig. S2j). Together,
these results suggest that decreased HOX accessibility in pHSCs may promote retention
motifs helps to explain the observation that pHSCs outcompete their normal HSC
One implication of this model is that pre-leukemic burden may have adverse effects on
patient survival, despite the fact that pHSCs do not confer disease in xenograft transplant
inversely correlates with overall survival and relapse-free survival (Fig. 3h,i). High pre-
leukemia relapse (hazard ratio = 3.30 for overall survival and 2.99 for relapse free
survival, p < 0.05). These results further implicate pHSCs in AML pathology and suggest
a mechanism wherein AML arises from the presence of a pre-leukemic clone that is
capable of outcompeting its normal HSC counterparts (Fig. S7k) and predispose patients
regulomes enables the identification of novel features of pHSC biology that have
97
Discussion
The study of acute myeloid leukemia sheds light on the biology and step-wise
leukemic HSC, LSC, and blast cells representing three distinct time points in AML
evolution. Examination of the average epigenetic variance across the genome shows that
variance increases through the stages of leukemia evolution with the majority of this
differentiation. The epigenetic landscapes of AML blast cells isolated from various
patients are extremely divergent, highlighting the need for personalized approaches to
A longstanding debate in cancer biology is how cancer cells violate cell lineage
rules. Cancer cells with markers or morphologies of one cell type have been shown to
also express markers of a different cell type25, which raises diagnostic challenges and
treatment conundrums. Two classic but competing models posited (i) lineage infidelity
a single cancer cell simultaneously accesses two normally distinct regulatory programs;
and the cancer cell is simply an expansion of this rare but physiologic bipotential state.
evidence of lineage infidelitya single cell accessing a mixed regulatory program. This
result has potentially important diagnostic and mechanistic implications, and we build
upon both classical models to address this challenge. Comparison of cancer to matched
normal cells is one of the most basic and commonplace experiments in cancer biology,
98
but lineage infidelity demonstrates that there may be no appropriate normal for
aberrations.
discover the loss of HOXA9-mediated accessibility as the most consistent defect in pre-
leukemic HSCs. We found that HOXA9 loss can, in fact, cause defects in differentiation
pathogenesis. These results provide potential avenues for therapeutic intervention during
widespread phenomenon in many types of cancer, and that our integrative approach using
99
Chapter 6 - Figures and Figure Legends
activated signal transduction such as FLT3 and RAS lead to generation of leukemia stem
cells which both self-renew and produce leukemic blast cells. (b) Genotype and mutation
frequencies of HSCs isolated from AML patients (N=39). Color indicates the percent of
cells mutated as estimated from the variant allele frequency. Gray color indicates a
mutation known to be present in leukemic cells but not observed during the pre-leukemic
phase of AML evolution (i.e. a late mutation event). Asterisks indicate the predicted first
mutation. If a mutation is bi-allelic, the representative bar is divided in half. Patients with
more than 20% of HSCs harboring a pre-leukemic mutation were classified as high
100
burden and those patients with less than 20% of HSCs harboring a pre-leukemic
mutation were classified as low burden. (c) Normalized sequencing track of control
loci on chromosome 19 from FACS-purified AML cell types. Profiles represent the union
of all biological replicates for each cell type. (d) Mean variance of chromatin accessibility
across the genome as calculated by a moving average across each leukemic cell stage (see
GATA2 (left) and CEBPB (right). Profiles represent the union of all biological replicates
for each cell type pHSC (N=12), LSC (N=8), Blasts (N=12). (f) Cumulative variance of
AML ATAC-seq data explained by the first N principal components derived from normal
hematopoiesis. (g) Myeloid development score in normal blood cell types (N=4
biological replicates) and AML cell types. The myeloid score is calculated from the first
myelopoiesis.
101
Figure 2. Enhancer cytometry and single-cell regulomes support a model of lineage
epigenetic landscape of different AML cell types. (b) Schematic of single-cell ATAC-seq
protocol and analysis. (c,d) Projection of ATAC-seq data derived from (c) single SU070
LSCs and (d) single SU070 blast cells onto the principal components derived from the
normal hematopoietic hierarchy. (e,f) Relative density of (e) single SU070 LSCs, SU070
blasts, and HL60 and (f) single MEP and K562 cells projected onto a one-dimensional
replicates of K562 cells are marked as K562-1 and K562-2. (g) Scatter plot showing
the correlation of ATAC-seq data derived from SU353 blast cells with the closest normal
102
cell type (GMP) (R=0.86). Using a log2(fold change) cutoff of 4 we identify 8,209 peaks
depleted and 10,954 peaks enriched in SU353 blast cells. (h) Scatter plot, as shown in (g),
showing the correlation between SU353 blast cells with the enhancer cytometry-defined
synthetic normal analog (R=0.91). Using a log2(fold change) cutoff of 4 identifies 5,887
peaks enriched in the synthetic normal analog and 8,003 peaks enriched in SU353 blast
cells. (i) Comparison of AML cell types to synthetic normal analogs. The closest normal
is shown in color. The percent of the total significant peaks that are removed by
103
Figure 3. Early chromatin accessibility alterations within pHSCs promote stemness
peaks identifies 6 distinct regulatory modules. (b) Enrichment of each module, identified
Gray bars shown represent 1 S.D. across all samples of that given cell type. (c)
expression after 6 days of enforced differentiation down the (d) myeloid lineage and (e)
erythroid lineage. Error bars represent 1 S.D. Experiments done in triplicate. (f,g) Fold
change in the percent of cells expressing CD34 as measured by flow cytometric analysis
of human umbilical cord blood-derived HSCs transduced with shRNAs targeting HOXA9
104
or a non-targeting control. Percent CD34+ cells measured after 6 days of enforced
differentiation down the (f) myeloid lineage and (g) erythroid lineage. Only GFP+
transduced cells analyzed. Error bars represent 1 S.D. Experiments done in triplicate. (h)
N=15). High pre-leukemic burden defined as greater than or equal to 20% of HSCs
harboring at least the first pre-leukemic mutation. Survival analysis was performed using
the Kaplan-Meier estimate method. All patients were included for the analysis regardless
of their treatment. P values comparing two Kaplan-Meier survival curves were calculated
using the log-rank (Mantel-Cox) test. Hazard ratios were determined using the Mantel-
105
Supplementary Figure 1. Validation of enhancer cytometry in AML cell lines and
derived from various blood cell lines demonstrates mixed regulatory contribution from
various normal hematopoietic cell types. (b) Projection of down sampled bulk
hematopoiesis data onto myeloid (left) and erythroid (right) progression. (c) Projection of
single MEPs onto hematopoiesis principal components 2 and 3. (d) Post-sort analysis of
gated for CMP (2.54%), MEP (97.5%) and GMP (0%). (e) Pearson correlations of AML
cell types with the closest normal analog (color) and the enhancer cytometry-derived
106
synthetic normal (gray). (f) Total significant peaks observed after comparison of AML
cell types to synthetic normal analogs. Significance measured as log2(fold change) > 3.
107
Supplementary Figure 2. Validation of regulatory network analysis in AML cell
types. (a) Principal component analysis of the log2(fold change) values of each AML cell
type compared to its synthetic normal. (b) Expression of JUN in various normal
hematopoietic cells, pHSCs, and blasts. *p<0.05, two-tailed t-test. (c-e) The effect of
JNK/ERK inhibition by (a) JNK-IN-8, (b) SP600125, and (c) SCH772984 was
determined by IC50 of sorted primary AML blast cells in comparison to CD34+ HSPCs
derived from umbilical cord blood. Viability determined by flow cytometric assessment
of Annexin V and DAPI. (f) Strategy for in vitro differentiation of HSPCs down the
myeloid and erythroid lineages. HSPCs are grown in defined culture media for 6 days
108
and then analyzed for cell surface markers of stemness or differentiation. Immature cells
at day 6 express CD34 and have not yet upregulated CD33. (g) Quantitative reverse-
in THP1 cells for 72 hours and validated with two separate primer sets. (h,i) Fold change
in the percent of (h) CD15+ granulocytes or (i) CD71+GPA+ erythroblasts between cord
shRNAs after 6 days of differentiation down the (h) myeloid or (i) erythroid lineage.
***p<0.001, ****p<0.0001 by two-tailed t-test. (j) Fold change in the percent of CD34+
HSPCs after 6 days of culture in stemness retention media (see methods) between cord
shRNAs. (k) Burden of mutations in DNMT3A, TET2, IDH1/2, or other genes when
detected in pre-leukemic HSC. *p < 0.05, **p < 0.01 by two-tailed t-test
109
References
1. Shih, A. H., Abdel-Wahab, O., Patel, J. P. & Levine, R. L. The role of mutations in
epigenetic regulators in myeloid malignancies. Nature Reviews Cancer 263, 2235
(2015).
2. Araten, D. J. et al. A quantitative measurement of the human somatic mutation
rate. Cancer research 65, 81117 (2005).
3. Sun, J. et al. Clonal dynamics of native haematopoiesis. Nature (2014).
4. Jan, M. et al. Clonal evolution of preleukemic hematopoietic stem cells precedes
human acute myeloid leukemia. Science translational medicine 4, 110 (2012).
5. Corces-Zimmerman, M. R. & Majeti, R. Pre-leukemic evolution of hematopoietic
stem cells: the importance of early mutations in leukemogenesis. Leukemia 28,
22762282 (2014).
6. Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in
acute leukaemia. Nature 506, 328333 (2014).
7. Lindberg, J. et al. Clonal Hematopoiesis and Blood-Cancer Risk Inferred from
Blood DNA Sequence. N Engl J Med 371, 24772487 (2014).
8. Jaiswal, S. et al. Age-Related Clonal Hematopoiesis Associated with Adverse
Outcomes. N Engl J Med 371, 24882498 (2014).
9. Okano, M., Xie, S. & Li, E. Cloning and characterization of a family of novel
mammalian DNA ( cytosine-5 ) methyltransferases Non-invasive sexing of
preimplantation stage mammalian embryos. Nature Genetics 19, 219220 (1998).
10. Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in
mammalian DNA by MLL partner TET1. Science 324, 9305 (2009).
11. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate.
Nature 462, 73944 (2009).
12. Figueroa, M. E. et al. Leukemic IDH1 and IDH2 Mutations Result in a
Hypermethylation Phenotype, Disrupt TET2 Function, and Impair Hematopoietic
Differentiation. Cancer Cell 18, 553567 (2010).
13. Greaves, M. F., Chan, L. C., Furley, A. J. W., Watt, S. M. & Molgaard, H. V.
Lineage Promiscuity in Hemopoietic Differentiation and Leukemia. Blood 67, 1
11 (1986).
14. Dohner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute Myeloid Leukemia. N
Engl J Med 373, 113652 (2015).
15. Jan, M. & Majeti, R. Clonal evolution of acute leukemia genomes. Oncogene 16
(2012).
16. Corces-Zimmerman, M. R., Hong, W.-J., Weissman, I. L., Medeiros, B. C. &
Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect
epigenetic regulators and persist in remission. Proceedings of the National
Academy of Sciences of the United States of America 111, 254853 (2014).
17. Goardon, N. et al. Coexistence of LMPP-like and GMP-like Leukemia Stem Cells
in Acute Myeloid Leukemia. Cancer Cell 19, 138152 (2011).
18. Bennet, J. M. et al. Proposals for the classification of the acute leukaemias.
French-American-British (FAB) co-operative group. British Journal of
Haematology 33, 4518 (1976).
19. van't Veer, M. B. The diagnosis of acute leukemia with undifferentiated or
110
minimally differentiated blasts. Annals of Hematology 64, 1615 (1992).
20. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of
regulatory variation. Nature 523, 486490 (2015).
21. Volk, A. et al. Co-inhibition of NF- B and JNK is synergistic in TNF-expressing
human AML. Journal of Experimental Medicine 211, 10931108 (2014).
22. Abramovich, C. & Humphries, R. K. Hox regulation of normal and leukemic
hematopoietic stem cells. Current opinion in hematology 12, 210216 (2005).
23. Magnusson, M., Brun, A. C. M., Lawrence, H. J. & Karlsson, S.
Hoxa9/hoxb3/hoxb4 compound null mice display severe hematopoietic defects.
Experimental Hematology 35, 1421.e11421.e9 (2007).
24. Lawrence, H. J. et al. Mice bearing a targeted interruption of the homeobox gene
HOXA9 have defects in myeloid, erythroid, and lymphoid hematopoiesis. Blood
89, 19221930 (1997).
25. Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals
Progenitor-like Cells that Correlate with Prognosis. Cell 162, 184197 (2015).
111
CHAPTER SEVEN Conclusion
protein interaction across sequence mutants. This platform may be extended to profile a
diversity of RNA-protein interactions and may form the basis of methods that quantify
trans-factor binding.
Measuring these regulatory processes in vivo provides unique insight into the
characteristics and potential of cellular behavior. Preceding this work, methods for
measuring chromatin structure genome-wide often required tens of millions of cells and
seq for profiling chromatin accessibility within rare cellular populations and/or from
inference of the trans-acting regulatory proteins that define them. In addition, these
Future work
expression of nearby genes would greatly enhance our ability to causally link the
112
epigenome to gene expression and subsequently disease mutations to phenotypes. Such a
critical for understanding TF binding landscapes and gene expression in vivo. Further
unique opportunity to develop regulatory models, wherein natural variation within single
cells can be used to infer causal changes of expression at nearby genes. Integrating
throughput may serve to quantify trans-acting regulators, their binding to cis regulatory
patterns, however, only a combined or systems approach to these data would yield a
computational models that integrate these data sets and infer causality, for example the
cellular regulation and form the basis of our understanding of human disease.
113