Buenrostro Thesis Augmented

METHODS FOR QUANTIATIVE DISSECTION
OF GENE REGULATION
A DISSERTATION
SUBMITTED TO THE GENETICS DEPARTMENT
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Jason Buenrostro
December 2015
2016 by Jason Daniel Buenrostro. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-

Noncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/mn616fx6627
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
William Greenleaf, Primary Adviser
Howard Chang, Co-Adviser
Gerald Crabtree
Jin Li
Michael Snyder, PhD
Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Acknowledgements
I have had an incredible experience at Stanford University, throughout my time
here I have had the immense privilege of working with many talented and amazing
people. That list begins with Will Greenleaf, my primary mentor and friend for the last 5
years in graduate school. Will has consistently provided thoughtful advice in science and
in life, and an incredible amount of patience throughout my graduate experience.
Throughout this time I have also had the privilege of mentorship from Howard Chang, I
thank Howard for his creative and thoughtful ideas, but more importantly his unwavering
support of all my scientific endeavors.
I thank Stanford and the Stanford Genetics department for the opportunity to do
my research here and providing such a wonderful environment to start my academic
career. I also thank my faculty committee Mike Snyder, Lars Steinmetz, Jerry Crabtree,
Billy Li and Ravi Majeti for their service, insights and support. In addition, I thank Mike
Snyder for the opportunity to collaborate and for the opportunity to work with Beijing
Wu, experiences that have significantly enriched my graduate experience.
I thank past mentors Hanlee Ji and Georges Natsoulis for mentorship and amazing
start to my scientific life. Specifically, I thank Hanlee Ji for my start in science, providing
me an amazing opportunity to join the Stanford community as a research assistant in
2009. I also thank Samuel Myllykangas, for his mentorship and friendship throughout
this time.
In addition to the long list of mentors, I thank many close collaborators for
making this work possible. I thank Lauren Chircus and Carlos Araya for their
determination, hard work and creativity, an essential component for the development of
iv

the RNA array. I thank Paul Gerisi for teaching me everything I know about chromatin.
Ulli Litzenberger, Dave Ruff and the Fluidigm team for their wonderful insight and
dedication to the development of single-cell ATAC-seq. I also thank new friends, Ryan
Corces and Ansu Satpathy, who have brought new and fresh perspectives to my scientific
thinking. I also deeply thank Beijing Wu, I am thankful for her unwavering dedication to
our work, personal support, and general thoughtfulness, without her much of this work
presented here would not have been possible.
Importantly, Id like to thank my family who has provided life-long support.
Specifically, my parents Miguel and Martha Buenrostro, who have made tremendous
sacrifices throughout our lives to provide me with the opportunity and preparation to
pursue my dreams. I also thank my brother, sisters, and niece, Michael, Michelle, Erika
and Sam, for their love, support and patience. I also thank my roommate and partner Sara
Prescott, who has been there for me at my best and my worst, and continues to be my
closest ally in science and in life. Lastly, I thank all of my friends, family, past mentors
and other collaborators, whom I regret for not having enough space to mention here.
v

TABLE OF CONTENTS
CHAPTER ONE - Introduction .......................................................................................... 1

Cellular regulation in cis and trans .................................................................................. 1
Genome-wide methods .................................................................................................... 2
A quantitative and high-throughput approach to binding ............................................... 2
Measuring chromatin accessibility in rare cells .............................................................. 3
CHAPTER TWO Quantitative dissection of millions of sequence variants ................... 5
Introduction ..................................................................................................................... 5
A high-throughput RNA array platform for quantitative binding measurements ........... 7
The RNA-array enables quantitative measurement of both binding and dissociation .... 8
Binding affinity can be partitioned between primary and secondary structure............... 9
Changes in association rate substantially contribute to changes in binding energies ... 11
Discussion ..................................................................................................................... 12
Chapter 2 - Figures and Figure Legends ....................................................................... 15
References ..................................................................................................................... 21
CHAPTER THREE Measuring accessibility in rare cellular populations..................... 24
Introduction ................................................................................................................... 24
ATAC-seq measures chromatin accessibility using Tn5 transposase ........................... 25
Insert size yields information regarding nucleosome packing and positioning ............ 26
ATAC-seq reveals distinct classes of factor-nucleosome spacing ................................ 28
Footprints can be used to infer factor occupancy genome-wide ................................... 29
Discussion ..................................................................................................................... 30
References ..................................................................................................................... 37
CHAPTER FOUR Single-cell accessibility reveals principles of regulatory variation 39
Introduction ................................................................................................................... 39
Single-cell ATAC-seq a measure of chromatin accessibility genome-wide ................. 39
Cell-cell variability in trans ........................................................................................... 41
Trans-factors synergize to induce cell-cell variability .................................................. 42
Cell-state and chemical perturbation effects on cell-cell variability ............................. 43
Single-cells vary in cis .................................................................................................. 45
Discussion ..................................................................................................................... 46
References ..................................................................................................................... 60
CHAPTER FIVE The epigenomic determinants of human hematopoiesis ................... 62
Introduction ................................................................................................................... 62
Identification of chromatin accessibility landscape in primary blood cells .................. 64
Chromatin accessibility at distal elements delineates the hematopoietic hierarchy...... 65
Enhancer cytometry prospectively deconvolves complex cell populations .................. 67
Regulatory networks of normal hematopoiesis ............................................................. 68
vi

Accessibility profiles of purified cell populations identify the ontogeny of human
diseases .......................................................................................................................... 70
Discussion ..................................................................................................................... 72
References ..................................................................................................................... 86
CHAPTER SIX The regulatory landscape of acute myeloid leukemia ......................... 88
Introduction ................................................................................................................... 88
Leukemogenesis and cancer evolution in AML ............................................................ 89
AML represents a cooption of normal myelopoiesis .................................................... 90
AML cell types exhibit lineage infidelity with regulatory contributions from multiple
normal blood cell types ................................................................................................. 92
Generation of synthetic normal analogs for assessment of AML-specific biology ...... 94
Mechanism and clinical consequences of pre-leukemic HSC clonal advantage .......... 95
Discussion ..................................................................................................................... 98
Chapter 6 - Figures and Figure Legends ..................................................................... 100
References ................................................................................................................... 110
CHAPTER SEVEN Conclusion .................................................................................. 112
Methods for gene regulation ....................................................................................... 112
Future work ................................................................................................................. 112
vii

CHAPTER ONE - Introduction
Cellular regulation in cis and trans
The human body is comprised of a large collection of highly diverse cell types,
each providing a specialized and context-specific function. The establishment and
maintenance of a cells identity is largely determined by defined regulatory programs
effecting diverse cellular processes such as chromatin accessibility, RNA localization and
degradation, or protein modifications. The expression of transcription factors (TFs) and
chromatin remodelers drive chromatin accessibility, which spans a continuum from
nucleosome-free and nucleosome-associated, to higher-order chromatin compaction.
Highly compacted chromatin is sequestered from regulatory machinery, whereas
nucleosome-free chromatin demarcates regions of active regulation in cells. Distal
nucleosome-free regulatory elements can have highly divergent interactions with gene
promoters in cis, acting as: i) activators, or enhancers, ii) repressors or iii) insulators,
which together determine the expression of nearby genes.
Analogous principles hold true for RNA regulation, wherein RNA structure
defines the binding landscape of micro RNAs (miRNAs) and RNA binding protein
(RBPs), which have diverse effects on post-transcriptional processes. Here, eukaryotic
RNAs can fold into simple 2D or complex 3D folded structures, which define permissive
or occluded binding substrates for trans-acting regulators. A quantitative and genome-
wide understanding of these dynamic cellular structures would provide unique insight
into the binding determinants of trans-acting regulators, drivers of cellular function and
cellular potential.
1
Genome-wide methods
The advent of high throughput sequencing1 methodologies has enabled unbiased
and genome-wide characterization of these diverse cellular processes. For example, high-
throughput assays measuring chromatin bound proteins (ChIP-seq)2 or RNA bounds
proteins (RIP-seq and CLIP-seq)3, have been shown to be sensitive methods for
identifying the binding locations of trans-acting proteins. In addition, assays for

4,5
measuring chromatin accessibility (DNase-seq) or RNA structure (PARS)6 enable a
genome-wide analysis of the structural determinants of this binding landscape. However,
as described in the following sections, these methods are limited in several ways. In the
following thesis I will discuss the development of new methods, which focus on the
structural determinants of trans-factor binding to chromatin or RNA.
A quantitative and high-throughput approach to binding
Carefully controlled in vitro assays can be used to determine the biophysical
parameters defining a binding interaction. However, current methods are low throughput
or not quantitative. In this thesis, I will describe the development of a high-throughput
and generalizable platform for performing biochemical assays of RNA called RNA-
MaP5. Here, we repurpose a high-throughput sequencing instrument to serve as massively
parallel biochemistry platform. We use this platform to describe the kinetic parameters of
an RNA binding protein to >107 mutants of an RNA stem loop.
2
Measuring chromatin accessibility in rare cells
Current methods to profile chromatin accessibility require millions of cells,
limiting their application to either cell lines or whole tissues. Applying these methods to
complex cellular populations, derived from tissues, averages over the rich diversity of
chromatin structures within different cell states leading to an incorrect understanding of
regulatory processes within these tissues. In response, we have developed genome-wide
methods for measuring chromatin accessibility (ATAC-seq)(Fig. 1a) within rare cellular
populations6 or within single-cells7. With such methods defined cellular populations
within complex tissues can be isolated using flow cytometry and profiled using ATAC-
seq. However, this approach is also limited in that it requires established protocols for
cell-type isolation, in contrast single-cell ATAC-seq (scATAC-seq) may be used to
partition cells into relevant subtypes de novo. Together, these assays offer an
unprecedented view of chromatin structure in vivo.
Of particular importance to human health and disease, and an excellent model for
understanding dynamic cellular behavior, is the hematopoietic hierarchy (Fig. 1b),
wherein a single hematopoietic stem cell (HSC) can give rise to a multitude of distinct
cellular populations ranging from enucleated red blood cells (RBCs) to specialized
immune cells (CD4 and CD8 T cells, B cells and more). Importantly, dysregulation of
these intricate regulatory networks lead to a multitude of hematologic malignancies11. In
this work we also apply ATAC-seq and scATAC-seq to normal human hematopoiesis
and acute myeloid leukemia (AML) in effort to elucidate governing biochemical
principles defining normal human development and disease.
3
References
1. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible

terminator chemistry. Nature 456, 5359 (2008).
2. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of
in vivo protein-DNA interactions. Science 316, 14971502 (2007).
3. Zhao, J. et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP-
seq. Molecular Cell 40, 939953 (2010).
4. Thurman, R. E. et al. The accessible chromatin landscape of the human genome.
Nature 489, 7582 (2012).
5. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
Chromatin across the Genome. Cell 132, 311322 (2008).
6. Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the
transcriptome through RNA structure. Nature Reviews Genetics 12, 641655 (2011).
7. Buenrostro, J. D. et al. Quantitative analysis of RNA-protein interactions on a
massively parallel array reveals biophysical and evolutionary landscapes. Nat.
Biotechnol. 32, 562568 (2014).
8. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).
9. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of
regulatory variation. Nature 523, 486490 (2015).

4
CHAPTER TWO Quantitative dissection of millions of sequence variants1
Introduction
RNA-protein interactions drive a wide variety of critical biological processes
from gene expression1 to viral assembly2. Up to 10% of the eukaryotic proteome is
estimated to bind RNA3, and recent work has begun to uncover a web of RNA-protein
interactions4-6 that can control gene expression through splicing, RNA localization, and
other post-transcriptional processes. Protein interactions with long noncoding RNAs also
play a role in epigenetic state changes during differentiation7, perhaps through
scaffolding chromatin remodelers8,9. Furthermore, RNA-protein interactions have
proven powerful tools in synthetic biology, allowing gene expression control through
post-transcriptional regulation10,11.
A biophysical understanding of the nucleic-acid sequence determinants of RNA-
protein interactions lags behind our growing realization of their biological importance.
Unlike double-stranded DNA, RNA substrates demonstrate diverse intramolecular
interactionsincluding, mismatched base bulges, stem loops, pseudo knots, g-quartets,
divalent cation interactions, and non-canonical base pairsthat determine three-
dimensional RNA structure12-14 and set the landscape for interactions with RNA-binding
proteins (RBPs)15. The combinatorial nature of RNA sequence and intramolecular
interactions, coupled with the relative paucity of data produced from current biophysical
methods has precluded a high-resolution, predictive understanding of both the sequence
dependence of affinity and the resulting evolutionary constraints imposed by these
requirements. Because the relationship between sequence and binding is often opaque,

1
Portions of this chapter were taken from Buenrostro et al. Quantitative analysis of RNA-protein
interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Nature
Biotechnology. 2014. doi:10.1038/nbt.2880.
5
little is understood regarding the evolutionary constraints on these RNA structures,
making bioinformatic identification of functional RNAs difficult16.
Current methods for investigating the sequence dependence of RNA-protein
interactions include medium-throughput microfluidic methods17 and high-throughput
methods coupling affinity-based selection with high-throughput DNA sequencing or
array hybridization18 and recently have been used to generate a catalogue of RNA binding
motifs19. While powerful, selection and sequencing methods bias results towards high-
activity variants and do not directly and quantitatively measure the biophysical
parameters that underlie biological function20. Recently, methods have been developed to
quantitatively measure catalysis21,22, however, no such high-throughput methods exists for
determining binding parameters kon, koff, and Kd for RNA-protein interactions.
The technological innovations that have propelled the high-throughput sequencing
revolution provide the foundations for massively parallel, fluorescence-based
observations over a large variety of nucleic acid structures immobilized on a surface23-26.
Recent work characterizing DNA-protein interactions26 has demonstrated the utility of
these instruments for high-throughput binding affinity assays across large DNA sequence
space. In this work, we have leveraged the Illumina DNA sequencing platform, an
instrument that integrates solid-phase molecular biology, fluidics, and high-throughput
TIRF imaging for massively parallel DNA sequencing27, to create a platform for direct,
ultra-high throughput measurement of RNA-protein interactions. In addition, we have
developed quantitative image analysis tools for large-scale analysis of these data, and
demonstrate measurement of both equilibrium binding constants and dissociation
kinetics. We apply these methods to the MS2 coat protein2,28-31, a system with widespread
6
applications in affinity purification32, RNA imaging33 and synthetic biology10,11. This
approach enables quantitative measurement of binding and dissociation of a protein to
>107 RNA targets generated directly on the flow cell surface, providing massive
biophysical datasets enabling predictive models for affinity tuning, decomposition of
binding energies between primary and secondary structures, and quantitative analysis of
evolutionary trajectories across sequence space.
A high-throughput RNA array platform for quantitative binding measurements
To generate a library of RNA targets, we first generated an Illumina sequencing
library containing an E. coli RNA polymerase (RNAP) initiation and stall sequence, and
a region coding for diverse sequence variants of the MS2 RNA hairpin synthesized using
doped oligonucleotides (Fig. 1a,b). To ensure multiple measurements of each RNA
variant and reduce sequencing error34, we introduced single-molecule barcodes 5 of the
RNAP initiation sequence. The barcoding strategy serves to identify individual molecules
within a population by uniquely tagging each molecule using a barcode library. We then
bottlenecked the DNA variant population by diluting ~8x105 molecules into a
subsequent PCR amplification reaction. Bottlenecking allowed for each barcoded
molecular species to be sequenced a median of 15 times per sequencing lane, allowing
for multiple redundant measurements across the flow cell. The sequencing process
converted individual molecules within the library to ~1 m diameter clusters of ~1,000
clonal DNA molecules on the flow cell surface27, and provided the sequence and position
of the DNA templates across the 2D array.
Following sequencing, we removed the sequenced DNA strand, and regenerated
7
double stranded DNA (dsDNA) using DNA polymerase to extend a biotinylated primer.
We then saturated the flow cell with streptavidin to create a terminal biotin-streptavidin
roadblock on these dsDNA fragments. To synthesize RNA we adapted methods from
single molecule investigations35 designed to generate a single RNA per DNA template.
First, we initiated E. coli RNA polymerase holoenzyme (RNAP) in CTP-starved
conditions, which allows RNAP to generate 26 bases of RNA (the footprint of RNAP)
before stalling at the first guanine on the DNA template strand. Second, we washed
excess RNA polymerase from solution and introduced all 4 nucleotides, allowing RNAP
to transcribe the variable region and stall at the biotin-streptavidin roadblock. This
procedure results in transcribed RNA tethered to its parent DNA via RNA polymerase
(Fig. 1a). The resulting RNA array contained 1.2 x 107 distinct RNA features comprising
1.48 x 105 unique sequences in a single sequencing lane.
The RNA-array enables quantitative measurement of both binding and dissociation
To measure binding energies, we flowed fluorescent SNAP-Surface 549-MS2
over the RNA array, and imaged bound MS2 at equilibrium using total internal reflection
fluorescence (TIRF) at 10 increasing concentrations. After the final measurement, we
perfused 1.8 M unlabeled MS2 and recorded the fluorescence decay caused by
dissociation (Fig. 1c). The high-concentration of unlabeled MS2 protein blocks other
binding sites on the array, preventing re-binding of fluorescently labeled MS2. To
quantify bound MS2 we developed image analysis tools that cross-correlate cluster
centers from sequencing data to acquired images and fit the observed binding in each
cluster to a 2D Gaussian (Fig. S1 and S2). Using this approach, we quantified the
fluorescence signal for each cluster in 6,240 images representing 120 tiles imaged in two
8
fluorescence color channels across 11 equilibrium MS2 concentrations and 15
dissociation time points. Fluorescence signals from single clusters fit canonical
dissociation (Fig. 1d) and binding curves (Fig. 1e, f), yielding binding energy estimates
in excellent agreement with published measurements (R = 0.94, slope = 1.08, Fig. 1g)
and in vitro binding assays (R = 0.92, slope = 0.76).
We calculated off rates (koff) for 3,029 sequences and dissociation constants (Kds)
for 129,248 sequences, encompassing 57 single (100%), 1,539 double (100%), and
24,181 triple (92.4%) mutants (Fig. 2a, b). To investigate how sequence variation in the
RNA hairpin impacts MS2 binding, we examined differential binding energies for all
single-mutants compared to the consensus sequence (Gconsensus=0 kBT). The average
binding energy change from all possible single-base changes at each position reveals a
sensitivity to mutation throughout the hairpin that complements the effects of mutating
individual residues on the binding surface of MS2 to alanine36 (Fig. 2c). Specifically, we
observe high mutation sensitivity at base-paired positions near the loop and at specific
single-stranded positions, suggesting significant primary sequence and secondary
structure requirements for RNA recognition.
Binding affinity can be partitioned between primary and secondary structure
To comprehensively examine these primary and secondary structure effects on
binding, we calculated the G of all double-mutants (Fig. 2d). We observed high
positive epistasis in a population of compensating mutants, suggesting that these pairs
of mutations preserve hairpin structure and maintain high binding affinities (Fig. 2e). We
also observed negative epistasis in non-compensating mutants near the base of the stem,
potentially due to cooperative effects on hairpin destabilization in these regions.
9
Reciprocal mapping of positive epistasis signatures (1 s.d.) allowed de novo
reconstruction of the bound hairpin structure, identifying base-paired, loop, and bulge
positions demonstrating the feasibility of reconstructing molecular RNA structures from
large-scale sequence-function data.
We modeled the contributions of base-specificity (primary structure) and base-
pairing (secondary structure) to binding energy at each position in the hairpin with a
linear regression model from a set of 121 training sequences. This model provides two
free parameters for each unpaired base accounting for primary sequence changes in the
form of transitions or transversions. For each pair of interacting bases, the model
provides a total of 6 free parameters one for transition and transversion of each base in
the pair (4 parameters) as well as one parameter to account for disruption due to the loss
of base-pairing and one parameter representing possible non-canonical base-pairing
interactions. These parameters were optimized jointly, in order to identify (via
regression) the energetic contributions of primary sequence changes (i.e. transitions or
transversions that occur while holding secondary structure constant) and secondary
structure changes (i.e. inferred energetic consequences of secondary structure disruptions
or formation of non-canonical bases in isolation from primary sequence perturbations).
To quantify the sensitivity for non-canonical base-pairing at positions in the hairpin stem,
we trained the model 8 separate times (once for each possible non-canonical pairing) with
one free parameter representing the energetic cost of the respective non-canonical
pairing. This re-fitting analysis allowed the model to incorporate a different energetic
penalty for having non-canonical base pairs at a specific position instead of the energetic
penalty for a full loss of base-pairing. In this analysis, G:U base pairs caused substantially
10
less disruption to the binding energy than other non-canonical base pairs (Fig. 3a),
consistent with the formation of a wobble base pair at G:U positions that allows partial
rescue of the secondary structure12,37. Our final model, which incorporated a free
parameter for G:U non-canonical base pairs, captured 92% of the variance in binding
energy of the training set and predicted the binding energy of second and third mutations
for variants with mutations in both paired and unpaired positions with correlation
coefficients R=0.94 and R=0.83, respectively (Fig. 3b).
The model fit parameters allowed quantitative decomposition of primary and
secondary determinants of affinity across the RNA structure (Fig. 3c, d). Energetic
penalties for disrupting base-pairing increase with proximity to the loop, while non-
canonical G:U base pairs cause substantially less energetic disruption at the -8:-3 and -
11:-1 positions. Altering the primary sequence at -10A (bulge) and -4A (loop), residues
that interact with the Lys61 binding pocket on alternate halves of the dimer29, confers
energetic costs that exceed disrupting the hairpin structure at any single base pair. We
also observed important roles for the -7A and -5C residues, consistent with stacking
interactions at these positions38. Altering the primary sequence on the 5 side of the
hairpin confers a greater energetic penalty compared with altering the 3 side, which we
speculate results from direct interactions with MS2 on the 5 side36.
Changes in association rate substantially contribute to changes in binding energies
We sought to quantify how changes in association and dissociation rates
contribute to measured G values for all mutants with measurable kinetic data. We
calculated the energetic contributions to G from changes in dissociation rates [

!"#$%# !"#$%#$&$ !"#$%#
log(!!"" /!!"" ) log (!!"" )], and inferred the contribution from changes in
11
!"#$%# !"#$%#$&$ !"#$%#
association rates, [log(!!" /!!" ) log (!!" )]. Because log(koff) +
log(kon) = G, we treated these parameters as pseudo-energies. Using this
decomposition, we examined the fractional contribution of change in dissociation rates to
G across single and double mutants (Fig. 4a). At the base of the hairpin, only a small
fraction of G measurements are explained by dissociation rate changes. This small
effect suggests that mutations at these positions modulate association rates, possibly by
causing fraying of the hairpin and/or allowing competition with alternate RNA structures,
thereby reducing the per-collision probability of productive binding. This interpretation
is reinforced by examining log(koff) and log(kon) in this region (Fig. 4b, c).
Dissociation rates change little while inferred association rates remain similar to that of
the consensus sequence only for structures that maintain base-pairing through
compensating mutations. Across all measured variants, we observe a significant
population of structures with G driven by association rates (Fig. 4d; P < 2.2 x 10-16,
Wilcoxon signed rank test, = 0.5). These results suggest the kinetic drivers of observed
affinity changes are position-specific and often operate through modulating association
rates, likely by changing hairpin stability.
Discussion
Using in situ transcription and inter-molecular tethering of RNA to DNA, we
have converted a high-throughput DNA sequencing flow cell into an RNA array for
quantitatively measuring both binding kinetics and thermodynamics at an unprecedented
scale. Using this quantitative deep mutational profiling approach we report, to our
knowledge, the largest collection of binding affinities and kinetic constants for an
intermolecular interaction. Using this dataset, we addressed long-standing biophysical
12
questions, including i) the relative contributions of primary and secondary structure
elements to binding energy, ii) the sequence-dependent kinetic contributions to observed
affinities, iii) the context-dependence of preference for G:U intermediates in secondary
structure.
Our predictive model for RNA-protein affinity across thousands of point
mutations provides a map for quantitative tuning of both the association rate and the
equilibrium constants of this RNA-protein interaction. We anticipate this resource of
sequence variants will enable affinity tuning of MS2-based RNA sensors enabling new
applications in synthetic biology. Additionally, these data provide a useful framework for
understanding the effect of primary sequence, secondary structure and non-canonical
base-pairing, creating a valuable framework for understanding the design and evolution
of new RNA aptamers.
We hypothesize that inferred changes in on-rates are due to destabilization of the
RNA hairpin formation or competition with alternate secondary structure, reducing the
number of productive binding collisions39. These observations suggest the data provided
here may also provide a rich resource for modeling the RNA hairpin stability and
alternate structure formation. While this is an area of inquiry beyond the focus of this
work, the potential for formation of alternate structures and the effects of local sequence
on native folding of RNA are well suited for study using this platform, as the RNA
transcripts are synthesized by E. coli RNAP and folded co-transcriptionally, closely
approximating synthesis conditions in vivo.
We anticipate this RNA-MaP methodology will be a powerful addition to select
and sequence methods. In addition, the technique might provide quantitative information
13
on RNA libraries generated by systematic enrichment of ligands by exponential
enrichment (SELEX), allowing affinity tuning for the design of biological parts. While
SELEX methods often begin with large libraries (~1014) and produce a small number of
selected molecules, this RNA array methodology allows characterization of a much larger
library subset (~105), opening the door to a detailed understanding of the sequence-
specific rules driving acquisition of affinity in the selection process. Alternatively, this
platform might be coupled to sequenced in vivo RNA immunoprecipitation libraries40,41
and used to directly quantify molecular affinities on in vitro generated RNA, providing
measurements of interactions in well-defined conditions. The multicolor imaging
capabilities of the sequencer enables measurement of more complex biological
interactions such as cooperativity between differentially labeled binding partners or RNA
structure inference via fluorescence resonance energy transfer (FRET). In addition, the
sequencing platform is capable of generating DNA clusters >1kb42, enabling transcription
of long RNAs and allowing investigations of long non-coding RNAs and catalytic
ribozymes. In short, we believe future application of RNA-MaP to diverse RNA-protein
and RNA-RNA interactions promises to enable quantitative prediction and engineering of
binding affinities and functional RNA molecules, as well as the identification and
understanding of evolutionary sequence constraints based on underlying biophysical
parameters.
14
Chapter 2 - Figures and Figure Legends
Figure 1. A massively parallel RNA array for quantitative, high-throughput
biochemistry. (a) Steps for generating RNA tethered to DNA clusters on a high-
throughput DNA sequencing flow cell. (b) Structure of the MS2 coat protein homodimer
bound to the 19 nt hairpin RNA (PDB ID: 2BU1)31. (c) Images of fluorescently bound to
RNA clusters at increasing concentrations of protein and at time points following
perfusion of unlabeled MS2 competitor. Below, fitted sum of Gaussians used to assign
fluorescence to clusters. (d) Fluorescence decay of MS2 dissociating from clusters
containing the consensus sequence (-5C) (t1/2=8.39 minutes). (e) Fit binding curves to
clusters labeled in panel (c). (f) The probability distribution of binding energies from all
clusters with labeled variants; mean Kd = 2.57 nM, 36.8 nM, and 415 nM for the -5C, -
5U, and -5A variants, respectively. (g) Correlation between binding energies reported in
the literature and measured on the RNA array (squares, Carey et al.28, circles, Romaniuk
et al.30). (Grey bar indicates our affinity measurement cutoff.)
15
Figure 2. A quantitative map of MS2 binding across RNA sequence variants. (a)
Distribution of observed RNA variants by number of mutations. (b) Clusters measured
per molecular variant as a function of mutation number. A median of ~11 clusters are
observed for sequences with 4 mutations. Affinities for the consensus sequence come
from NC=909,385 clusters. (c) Average G of point mutations per position. The G
of alanine36 substitutions to the MS2 binding surface are shown in parentheses (kBT).
Solid and dashed lines represent base and phosphate interactions, respectively. (d) Matrix
of G for single and double mutants of the consensus sequence. Inset contains the
matrix of G for single and double mutants of the +1G variant. All energies are
calculated relative to the consensus (-5C) sequence (arrow, G=0), and the number of
quality-filtered double mutants in each matrix is indicated (M2). (e) Epistasis matrix
derived from (d) allows de novo reconstruction of the hairpin structure.
16
Figure 3. Binding affinity is dependent on primary sequence and secondary RNA
structure. (a) Fit parameters for linear regression model showing position-specific
contributions. Energetic components for all possible base pair combinations are shown
below. (b) Predicted binding energies of variants with second (M2) and third mutations
(M3) in both single- and double-stranded regions. Primary (i.e. mean energetic
contributions of transitions and transversions) (c) and secondary (d) structure
contributions to affinity derived from a, were mapped onto the hairpin (PDB ID:
1ZDH)38.
17
Figure 4. Sequence-specific contributions of association and dissociation rates to
binding affinity. (a) Fractional contribution of dissociation rates for 31 single and 289
double mutants with measurable affinities and dissociation rates. Positions at the base of
the hairpin are highlighted. (b) log(koff) and (c) log(kon) at the base of the hairpin. M2 =
number of quality-filtered double mutants. (d) Distribution of fractional contributions of
association (blue, =0.57) and dissociation (red, =0.43) rates to G for all measured
mutants (N=3,029).
18
Supplementary Figure1. Data Analysis Workflow. (a) Sequencing cluster centers were
derived from the fastq files from the sequencing run. X/Y and tile positions were
extracted from the fastq header lines. Data were cross-correlated with the observed
images to define a global offset. Images were then cleaned to mask any saturated pixels.
Images were broken into smaller sub regions (24x24 pixels) and the fluorescence was
fitted to a sum of overlapping 2D Gaussians. This process was repeated for all 120 tiles
of the GAIIx sequencing lane and across the 26 image series (3,120 images). (b) Binding
images were normalized for RNA content using the all RNA image (Alexa647 oligo
hybridized to the stall sequence). Data was aggregated across the image series by cluster
ID, and the fluorescence values for each cluster across concentrations was fit to a binding
curve. The fit binding energies were grouped by hairpin sequence, and median binding
energies for each sequence were reported.
19
Supplementary Figure 2: Correlating sequencing data and fitting 2D Gaussians to
acquired images. We found that a simple cross-correlation was sufficient to map x/y
positions from the sequencing data to both the (a) all RNA image and the (b) MS2
binding images (cluster centers shown in green). Shown are unaligned images and cluster
centers (left), the cross-correlation value (middle), and the resulting mapped cluster
centers (right). The plotted cluster centers were adjusted using the least squares image fit.
Images were fit to 2D Gaussians and generated the following distribution for the relevant
parameters: (c) the fit amplitude and (d) the fit standard deviation from a representative
tile. Integrating these values generated (e) the distribution of the integrated fluorescence.
20
References
1. Keene, J. D. RNA regulons: coordination of post-transcriptional events. Nature

Reviews Genetics 8, 533543 (2007).
2. Carey, J., Cameron, V., De Haseth, P. L. & Uhlenbeck, O. C. Sequence-specific
interaction of R17 coat protein with its ribonucleic acid binding site. Biochemistry
22, 26012610 (1983).
3. Tsvetanova, N. G., Klass, D. M., Salzman, J. & Brown, P. O. Proteome-Wide
Search Reveals Unexpected RNA-Binding Proteins in Saccharomyces cerevisiae.
PLoS ONE 5, e12671 (2010).
4. Scherrer, T., Mittal, N., Janga, S. C. & Gerber, A. P. A Screen for RNA-Binding
Proteins in Yeast Indicates Dual Functions for Many Enzymes. PLoS ONE 5,
e15499 (2010).
5. Butter, F., Scheibe, M., Mrl, M. & Mann, M. Unbiased RNAprotein interaction
screen by quantitative proteomics. Proceedings of the National Academy of
Sciences 106, 1062610631 (2009).
6. Castello, A. et al. Insights into RNA Biology from an Atlas of Mammalian
mRNA-Binding Proteins. Cell 149, 13931406 (2012).
7. Wang, K. C. et al. A long noncoding RNA maintains active chromatin to
coordinate homeotic gene expression. Nature 472, 120124 (2011).
8. Tsai, M. C. et al. Long Noncoding RNA as Modular Scaffold of Histone
Modification Complexes. Science 329, 689693 (2010).
9. Guttman, M. & Rinn, J. L. Modular regulatory principles of large non-coding
RNAs. Nature 482, 339346 (2012).
10. Culler, S. J., Hoff, K. G. & Smolke, C. D. Reprogramming Cellular Behavior with
RNA Controllers Responsive to Endogenous Proteins. Science 330, 12511255
(2010).
11. Auslnder, S., Auslnder, D., Mller, M., Wieland, M. & Fussenegger, M.
Programmable single-cell mammalian biocomputers. Nature (2012).
doi:10.1038/nature11149
12. SantaLucia, J. & Turner, D. H. Measuring the thermodynamics of RNA secondary
structure formation. Biopolymers 44, 309319 (1997).
13. Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals
novel regulatory features. Nature (2013). doi:10.1038/nature12756
14. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. Genome-wide
probing of RNA structure reveals active unfolding of mRNA structures in vivo.
Nature (2013). doi:10.1038/nature12894
15. Ban, N., Nissen, P., Hansen, J., Moore, P. B. & Steitz, T. A. The Complete Atomic
Structure of the Large Ribosomal Subunit at 2.4 Resolution. Science 289, 905
920 (2000).
16. Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the
transcriptome through RNA structure. Nature Reviews Genetics 12, 641655
(2011).
17. Martin, L. et al. Systematic reconstruction of RNA functional motifs with high-
throughput microfluidics. Nat Meth 9, 11921194 (2012).
18. Ray, D. et al. Rapid and systematic analysis of the RNA recognition specificities
21
of RNA-binding proteins. Nat. Biotechnol. 27, 667670 (2009).
19. Ray, D. et al. A compendium of RNA-binding motifs for decoding gene
regulation. Nature 499, 172177 (2013).
20. Araya, C. L. et al. A fundamental protein property, thermodynamic stability,
revealed solely from large-scale measurements of protein function. Proceedings of
the National Academy of Sciences 109, 1685816863 (2012).
21. Pitt, J. N. & Ferre-D'Amare, A. R. Rapid Construction of Empirical RNA Fitness
Landscapes. Science 330, 376379 (2010).
22. Guenther, U.-P. et al. Hidden specificity in an apparently nonspecific RNA-
binding protein. Nature (2013). doi:10.1038/nature12543
23. Matzas, M. et al. High-fidelity gene synthesis by retrieval of sequence-verified
DNA identified using high-throughput pyrosequencing. Nat. Biotechnol. 28, 1291
1294 (2010).
24. Myllykangas, S., Buenrostro, J. D., Natsoulis, G., Bell, J. M. & Ji, H. P. Efficient
targeted resequencing of human germline and cancer genomes by oligonucleotide-
selective sequencing. Nat. Biotechnol. 29, 10241027 (2011).
25. Uemura, S. et al. Real-time tRNA transit on single translating ribosomes at codon
resolution. Nature 464, 10121017 (2010).
26. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-
throughput sequencing instrument. Nat. Biotechnol. 29, 659664 (2011).
27. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456, 5359 (2008).
28. Carey, J., Lowary, P. T. & Uhlenbeck, O. C. Interaction of R17 coat protein with
synthetic variants of its ribonucleic acid binding site. Biochemistry 22, 47234730
(1983).
29. Valegrd, K., Murray, J. B., Stockley, P. G., Stonehouse, N. J. & Liljas, L. Crystal
structure of an RNA bacteriophage coat proteinoperator complex. Nature 371,
623626 (1994).
30. Romaniuk, P. J., Lowary, P., Wu, H. N., Stormo, G. & Uhlenbeck, O. C. RNA
binding site of R17 coat protein. Biochemistry 26, 15631568 (1987).
31. Grahn, E. et al. Structural basis of pyrimidine specificity in the MS2 RNA hairpin-
coat-protein complex. RNA 7, 16161627 (2001).
32. Bardwell, V. J. & Wickens, M. Purification of RNA and RNA-protein complexes
by an R17 coat protein affinity method. Nucleic Acids Res. 18, 65876594 (1990).
33. Bertrand, E. et al. Localization of ASH1 mRNA particles in living yeast. 2, 437
445 (1998).
34. Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular
identifiers. Nat Meth 9, 7274 (2011).
35. Greenleaf, W. J., Frieda, K. L., Foster, D. A., Woodside, M. T. & Block, S. M.
Direct observation of hierarchical folding in single riboswitch aptamers. Science
319, 630633 (2008).
36. Hobson, D. & Uhlenbeck, O. C. Alanine Scanning of MS2 Coat Protein Reveals
ProteinPhosphate Contacts Involved in Thermodynamic Hot Spots. Journal of
Molecular Biology 356, 613624 (2006).
37. Gabriele Varani, W. H. M. The GU wobble base pair: A fundamental building
block of RNA structure crucial to RNA function in diverse biological systems.
22
EMBO Reports 1, 1823 (2000).
38. Valegrd, K. et al. The three-dimensional structures of two complexes between
recombinant MS2 capsids and RNA operator fragments reveal sequence-specific
protein-RNA interactions. Journal of Molecular Biology 270, 724738 (1997).
39. Gell, C. et al. Single-Molecule Fluorescence Resonance Energy Transfer Assays
Reveal Heterogeneous Folding Ensembles in a Simple RNA StemLoop. Journal
of Molecular Biology 384, 264278 (2008).
40. Licatalosi, D. D. et al. HITS-CLIP yields genome-wide insights into brain
alternative RNA processing. Nature 456, 464469 (2008).
41. Zhao, J. et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP-
seq. Molecular Cell 40, 939953 (2010).
12131218 (2013).
23
CHAPTER THREE Measuring accessibility in rare cellular populations2
Introduction
Eukaryotic genomes are hierarchically packaged into chromatin1, and the nature
of this packaging plays a central role in gene regulation2,3. Major insights into the
epigenetic information encoded within the nucleoprotein structure of chromatin have
come from high-throughput, genome-wide methods for separately assaying the chromatin
4,5
accessibility (open chromatin) , nucleosome positioning6-8, and transcription factor
occupancy9. While powerful, existing methods require millions of cells as starting
material, complex and time-consuming sample preparations, and cannot simultaneously
probe the interplay of nucleosome positioning, chromatin accessibility, and transcription
factor binding.
These limitations are problematic in three major ways: First, current methods can
average over and drown out heterogeneity in cellular populations. Second, cells must
often be grown ex vivo to obtain sufficient biomaterials, perturbing the in vivo context
and modulating the epigenetic state in unknown ways. Third, input requirements often
prevent application of these assays to well-defined clinical samples, precluding
generation of personal epigenomes in diagnostic timescales. Here we report a robust
and sensitive method for epigenomic profiling that can provide a comprehensive portrait
of gene regulatory processes.

2
Portions of this chapter were taken from Buenrostro et al. Transposition of native chromatin for
fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and
nucleosome position. Nature Methods. 2013;10(12):12131218. doi:10.1038/nmeth.2688.US
24
ATAC-seq measures chromatin accessibility using Tn5 transposase
Hyperactive Tn5 transposase10,11, loaded in vitro with adapters for high-throughput DNA
sequencing, can simultaneously fragment and tag a genome with sequencing adapters
(previously described as tagmentation11). Because transposons have been shown to
integrate into active regulatory elements in vivo12, we hypothesized that transposition by
purified Tn5, a prokaryotic transposase, on small numbers of unfixed nuclei would
interrogate regions of accessible chromatin. Here we describe Assay for Transposase
Accessible Chromatin (ATAC-seq), ATAC-seq uses Tn5 transposase to integrate its
adapter payload into regions of accessible chromatin, whereas compaction and
sequestration of chromatin make transposition improbable. Therefore, amplifiable DNA
fragments suitable for high-throughput sequencing are generated at locations of open
chromatin (Fig 1A). The entire assay and library construction can be carried out in a
simple two-step process involving Tn5 insertion and PCR. In contrast, published DNase-
and FAIRE-seq protocols for assaying chromatin accessibility involve complex multi-
step protocols and many loss-prone steps, such as adapter ligation, gel purification and
reversal of crosslinks. Specifically, DNase-seq consist of 44 steps, and two overnight
incubations, while published FAIRE-seq protocols require two overnight incubations
carried out over at least 3 days13,14. Furthermore, these protocols require 1-50 million cells
(FAIRE) or 50 million cells (DNase-seq), likely due to their complex workflows13,14 (Fig
1B). In comparison to pre-existing methods, ATAC-seq enables rapid and efficient
library generation while radically reducing sample input requirements. Extensive
analyses show that ATAC-seq provides accurate and sensitive measure of chromatin
accessibility genome-wide. We carried out ATAC-seq on 50,000 and 500 unfixed nuclei
25
isolated from GM12878 lymphoblastoid cell line (ENCODE Tier 115) for comparison and
validation with chromatin accessibility data sets, including DNase-seq13 and FAIRE-
seq16. At a locus previously highlighted by others5, (Fig. 1C), ATAC-seq has a signal-to-
noise ratio similar to DNase-seq, which was generated from approximately 3 to 5 orders-
of-magnitude more cells13,14. Peak intensities were highly reproducible between technical
replicates (R=0.98), and highly correlated between ATAC-seq and DNase-seq (R=0.79
and R=0.83). Highly sensitive open chromatin detection is maintained even when using
5,000 or 500 human nuclei as starting material, although sensitivity is diminished for
smaller numbers of input material, as can be seen in Fig 1C.
Insert size yields information regarding nucleosome packing and positioning
Unlike pre-existing assays that measure chromatin accessibility, ATAC-seq
paired-end reads produce detailed information about nucleosome packing and
positioning. The insert size distribution of sequenced fragments from human chromatin
has clear periodicity of approximately 200 base pairs, suggesting many fragments are
protected by integer multiples of nucleosomes (Fig 2A). This fragment size distribution
also shows clear periodicity equal to the helical pitch of DNA11. By partitioning insert
size distribution according to functional classes of chromatin as defined by previous
models17, and normalizing to the global insert distribution (Methods) we observe clear
class-specific enrichments across this insert size distribution (Fig. 2B), demonstrating that
these functional states of chromatin have an accessibility fingerprint that can be read
out with ATAC-seq. These differential fragmentation patterns are consistent with the
putative functional state of these classes, as insulator regions are enriched for short
26
fragments of DNA, while transcription start sites are differentially depleted for mono-, di-
and tri-nucleosome associated fragments. Transcribed and promoter flanking regions are
enriched for longer multi-nucleosomal fragments, suggesting they are more compacted
than other states that require access to DNA by regulatory factors. Finally, repressed
regions are differentially depleted for short fragments, consistent with their expected
compacted state. These data suggest that ATAC-seq reveals differentially compacted
forms of chromatin, which have been long hypothesized to exist in vivo2,18,19.
To explore nucleosome positioning within accessible chromatin in the GM12878
cell line, we partitioned our data into reads generated from putative nucleosome free
regions of DNA, and reads likely derived from nucleosome associated DNA. Using a
simple heuristic that positively weights nucleosome associated fragments and negatively
weights nucleosome free fragments, we calculated a data track used to call nucleosome
positions within regions of accessible chromatin20. An example locus (Fig. 3A) contains a
putative bidirectional promoter with CAGE data showing two transcription start sites
(TSS) separated by ~700bps. ATAC-seq reveals in fact two distinct nucleosome free
regions, separated by a single well-positioned mononucleosome (Fig. 3A). Compared to
MNase-seq21, ATAC-seq data is more amenable to detecting nucleosomes within putative
regulatory regions, as the majority of reads are concentrated within accessible regions of
chromatin (Fig. 3B). By averaging signal across all active TSSs, we note nucleosome free
fragments are enriched at a canonical nucleosome free promoter region overlapping the
TSS, while our nucleosome signal is enriched both upstream and downstream of the
active TSS, and displays characteristic phasing of upstream and downstream
nucleosomes6,7 (Fig. 3C). Because ATAC-seq reads are concentrated at regions of open
27
chromatin, we see strong nucleosome signal at the +1 nucleosome, which decreases at the
+2, +3 and +4 nucleosomes, in contrast, MNase-seq nucleosome signal increases at larger
distances from the TSS likely due to over digestion of more accessible nucleosomes.
Additionally, MNase-seq (4 billion reads) assays all nucleosomes requiring orders of
magnitude more sequencing than ATAC-seq (198 million paired reads) to reach similar
resolution at regulatory nucleosomes (Fig. 3B,C). Using our nucleosome calls, we further
partitioned putative distal regulatory regions and TSSs into regions that were nucleosome
free and regions that were predicted to be nucleosome bound. We note that TSSs were
enriched for nucleosome free regions when compared to distal elements, which tend to
remain nucleosome rich (Fig. 3D). These data suggest ATAC-seq can provide high-
resolution readout of nucleosome associated and nucleosome free regions in regulatory
elements genome wide.
ATAC-seq reveals distinct classes of factor-nucleosome spacing
ATAC-seq high-resolution regulatory nucleosome maps can be used to
understand the relationship between nucleosomes and DNA binding factors. Using ChIP-
seq data, we plotted the position of a variety of DNA binding factors with respect to the
dyad of the nearest nucleosome. Unsupervised hierarchical clustering (Figure 3E)
revealed major classes of binding with respect to the proximal nucleosome, including 1) a
strongly nucleosome avoiding group of factors with binding events stereotyped at ~180
bases from the nearest nucleosome dyad (comprising C-FOS, NFYA and IRF3), 2) a
class of factors that nestle up precisely to the expected end of nucleosome DNA
contacts, which notably includes chromatin looping factors CTCF and cohesion complex
28
subunits RAD21 and SMC3; 3) a large class of primarily transcription factors that have
gradations of nucleosome avoiding or nucleosome-overlapping binding behavior, and 4)
a class whose binding sites tend to overlap nucleosome associated DNA. Interestingly,
this final class includes chromatin remodeling factors such as CHD1 and SIN3A as well
as RNA polymerase II, which appears to be enriched at the nucleosome boundary8. The
interplay between precise nucleosome positioning and locations of DNA binding factor
immediately suggests specific hypotheses for mechanistic studies, a potential advantage
of ATAC-seq.
Footprints can be used to infer factor occupancy genome-wide
ATAC-seq enables accurate inference of DNA binding factor occupancy genome-
wide. We reasoned that DNA sequences directly occupied by DNA-binding proteins are
protected from transposition; the resulting sequence footprint reveals the presence of
the DNA-binding protein at each site, analogous to DNase digestion footprints22. At a
specific CTCF binding site on chromosome 1, we observed a clear footprint (a deep
notch of ATAC-seq signal), similar to footprints seen by DNase-seq23,24, at the precise
location of the CTCF motif that coincides with the summit of the CTCF ChIP-seq signal
in GM12878 cells (Fig 4A). We averaged ATAC-seq signal over all expected locations of
CTCF within the genome and observed a well-stereotyped footprint (Fig. 4B). Similar
results were obtained for a variety of common TFs. We inferred the CTCF binding
probability from motif consensus score, evolutionary conservation, and ATAC-seq
footprinting data to generate a posterior probability of CTCF binding at all loci (Fig.
4C)25. Results using ATAC-seq closely recapitulate ChIP-seq binding data in this cell line
29
and compare favorably to DNase-based factor occupancy inference, suggesting that
factor occupancy data can be extracted from these ATAC-seq datasets, and allowing
reconstruction of regulatory networks.
Using ATAC-seq footprints we generated the occupancy profiles of 89
transcription factors in proband T-cells, enabling systematic reconstruction of regulatory
networks. With this personalized regulatory map, we compared the genomic distribution
of the same 89 transcription factors between GM12878 and proband CD4+ T-cells.
Transcription factors that exhibit large variation in distribution between T-cells and B-
cells are enriched for T-cell specific factors (Fig. 4D). This analysis shows NFAT is
differentially regulating, while canonical CTCF occupancy is highly correlated within
these two cell types (Fig. 4D).
Discussion
Epigenomic studies of chromatin accessibility have yielded tremendous biological
insights, but are currently limited in application by their complex workflows and large
cell number requirements. ATAC-seq offers potentially unique advantages over pre-
existing ChIP-, MNase- and DNase-seq methods. ATAC-seq is an information rich assay,
allowing simultaneous interrogation of factor occupancy, nucleosome positions in
regulatory sites, and chromatin compaction genome-wide. These insights are derived
from both the position of insertion and the distribution of insert lengths captured during
the transposition reaction. While extant methods such as DNase- and MNase-seq can
provide some subsets of the information in ATAC-seq, they each require separate assays
with large cell numbers, which increases the time, cost, and limits applicability to many
30
systems. ATAC-seq also provides insert size fingerprints of biologically relevant
genomic regions, suggesting that it capture information on chromatin compaction. We
expect ATAC-seq to have broad applicability, significantly add to the genomics toolkit,
and improve our understanding of gene regulation, particularly when integrated with
other powerful rare cell techniques, such as FACS, laser capture microdissection (LCM)
and recent advancements in RNA-seq26,27. In summary, we believe that the attractive
combination of speed, simplicity and low input requirements of ATAC-seq will enable
new gene regulatory insights into biology and medicine.
31
Figure 1. ATAC-seq is a sensitive, accurate probe of open chromatin state. A)
ATAC-seq reaction schematic. Transposase (green), loaded with sequencing adapters
(red and blue), inserts only in regions of open chromatin (nucleosomes in grey) and
generates sequencing library fragments that can be PCR amplified. B) Approximate input
material and sample preparation time requirements for genome-wide methods of open
chromatin analysis. C) A comparison of ATAC-seq to other open chromatin assays at a
locus in GM12878 lymphoblastoid cells displaying high concordance. Lower ATAC-seq
track was generated from 500 FACS-sorted cells.
32
Figure 2. ATAC-seq provides genome-wide information on chromatin compaction.
A) ATAC-seq fragment sizes generated from GM12878 nuclei (red) indicate chromatin-
dependent periodicity with a spatial frequency consistent with nucleosomes, as well as a
high frequency periodicity consistent with the pitch of the DNA helix for fragments less
than 200 bp. (Inset) log-transformed histogram shows clear periodicity persists to 6
nucleosomes. B) Normalized read enrichments for 7 classes of chromatin state previously
defined17.
33
Figure 3 ATAC-seq provides genome-wide information on nucleosome positioning
in regulatory regions. A) An example locus containing two transcription start sites
(TSSs) showing nucleosome free read track, calculated nucleosome track (Methods), as
well as DNase, MNase, and H3K27ac, H3K4me3, and H2A.Z tracks for comparison. B)
ATAC-seq (198 million paired reads) and MNase-seq (4 billion single-end reads)
nucleosome signal shown for all active TSSs (n=64,836), TSSs are sorted by CAGE
expression. C) TSSs are enriched for nucleosome free fragments, and show phased
nucleosomes similar to those seen by MNase-seq at the -2, -1, +1, +2, +3 and +4
positions. D) Relative fraction of nucleosome associated vs. nucleosome free (NFR)
bases in TSS and distal sites (see Methods). E) Hierarchical clustering of DNA binding
34
factor position with respect to the nearest nucleosome dyad within accessible chromatin
reveals distinct classes of DNA binding factors. Factors strongly associated with
nucleosomes are enriched for chromatin remodelers.
35
Figure 4: ATAC-seq assays genome-wide factor occupancy. A) CTCF footprints
observed in ATAC-seq and DNase-seq data, at a specific locus on chr1. B) Aggregate
ATAC-seq footprint for CTCF (motif shown) generated over binding sites within the
genome C) CTCF predicted binding probability inferred from ATAC-seq data, position
weight matrix (PWM) scores for the CTCF motif, and evolutionary conservation
(PhyloP). Right-most column is the CTCF ChIP-seq data (ENCODE) for this GM12878
cell line, demonstrating high concordance with predicted binding probability. D) Cell
type-specific regulatory network from proband T cells compared with GM12878 B-cell
line. Each row or column is the footprint profile of a TF versus that of all other TFs in the
same cell type. Color indicates relative similarity (yellow) or distinctiveness (blue) in T
versus B cells. NFAT is one of the most highly differentially regulated TFs (red box)
whereas canonical CTCF binding is essentially similar in T and B cells.
36
References
1. Kornberg, R. D. Chromatin structure: a repeating unit of histones and DNA.

Science 184, 868871 (1974).
2. Kornberg, R. D. & Lorch, Y. Chromatin Structure and Transcription. Annu. Rev.
Cell. Biol. 8, 563587 (1992).
3. Mellor, J. The Dynamics of Chromatin Remodeling at Promoters. Molecular Cell
19, 147157 (2005).
4. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
Chromatin across the Genome. Cell 132, 311322 (2008).
Nature 489, 7582 (2012).
6. Schones, D. E. et al. Dynamic Regulation of Nucleosome Positioning in the
Human Genome. Cell 132, 887898 (2008).
7. Valouev, A. A. et al. Determinants of nucleosome organization in primary human
cells. Nature 474, 516520 (2011).
8. Barski, A. et al. High-Resolution Profiling of Histone Methylations in the Human
Genome. Cell 129, 823837 (2007).
9. Gerstein, M. B. et al. Architecture of the human regulatory network derived from
ENCODE data. Nature 489, 91100 (2012).
10. Goryshin, I. Y. & Reznikoff, W. S. Tn5 in vitro transposition. J. Biol. Chem. 273,
73677374 (1998).
11. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment
libraries by high-density in vitro transposition. Genome Biol 11, R119 (2010).
12. Gangadharan, S., Mularoni, L., Fain-Thornton, J., Wheelan, S. J. & Craig, N. L.
DNA transposon Hermes inserts into DNA in nucleosome-free regions in vivo.
Proceedings of the National Academy of Sciences 107, 2196621972 (2010).
13. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harb Protoc 2010, (2010).
14. Simon, J. M., Giresi, P. G., Davis, I. J. & Lieb, J. D. Using formaldehyde-assisted
isolation of regulatory elements (FAIRE) to isolate active regulatory DNA. Nature
Protocols 7, 256267 (2012).
15. Consortium, T. E. P. A User's Guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol 9, e1001046 (2011).
16. Giresi, P. G. & Lieb, J. D. Isolation of active regulatory elements from eukaryotic
chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory
Elements). Methods 48, 233239 (2009).
17. Hoffman, M. M. et al. Integrative annotation of chromatin elements from
ENCODE data. Nucleic Acids Res. 41, 827841 (2013).
18. Kornberg, R. D. & Lorch, Y. Chromatin and transcription: where do we go from
here. Current Opinion in Genetics & Development 12, 249251 (2002).
19. Zhou, J., Fan, J. Y., Rangasamy, D. & Tremethick, D. J. The nucleosome surface
regulates chromatin compaction and couples it with transcriptional repression. Nat
Struct Mol Biol 14, 10701076 (2007).
20. Chen, K. et al. DANPOS: Dynamic analysis of nucleosome position and
37
occupancy by sequencing. Genome Research 23, 341351 (2013).
21. Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin
environment at regulatory elements. Genome Research 22, 1735 (2012).
22. Hesselberth, J. R. et al. Global mapping of protein-DNA interactions in vivo by
digital genomic footprinting. Nat Meth 6, 283289 (2009).
23. Boyle, A. P. et al. High-resolution genome-wide in vivo footprinting of diverse
transcription factors in human cells. Genome Research 21, 456464 (2011).
24. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
25. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
sequence and chromatin accessibility data. Genome Research 21, 447455 (2011).
26. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Meth
6, 377382 (2009).
27. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression
and splicing in immune cells. Nature 498, 236240 (2013).
38
CHAPTER FOUR Single-cell accessibility reveals principles of regulatory
variation3
Introduction
Heterogeneity within cellular populations has been evident since the first
microscopic observations of individual cells. Recent proliferation of powerful methods
for interrogating single cells1-5 has allowed detailed characterization of this molecular
variation, and provided deep insight into characteristics underlying developmental
plasticity6,7, cancer heterogeneity8, and drug resistance9. In parallel, genome-wide
mapping of regulatory elements in large ensembles of cells have unveiled tremendous
variation in chromatin structure across cell-types, particularly at distal regulatory
regions10. Methods for probing genome-wide DNA accessibility, in particular, have
proven extremely effective in identifying regulatory elements across a variety of cell
types11 quantifying changes that lead to both activation and repression of gene
expression. Given this broad diversity of activity within regulatory elements when
comparing phenotypically distinct cell populations, it is reasonable to hypothesize that
heterogeneity at the single cell level extends to accessibility variability within cell types
at regulatory elements. However, the lack of methods to probe DNA accessibility within
individual cells has prevented quantitative dissection of this hypothesized regulatory
variation.
Single-cell ATAC-seq a measure of chromatin accessibility genome-wide
We have developed a single-cell Assay for Transposase-Accessible Chromatin

3
Portions of this chapter were taken from Buenrostro et al. Single-cell chromatin accessibility
reveals principles of regulatory variation. Nature. 2015. doi:10.1038/nature14590.
39
(scATAC-seq), improving on the state-of-the-art12 sensitivity by >500-fold. ATAC-seq
uses the prokaryotic Tn5 transposase13,14 to tag regulatory regions by inserting sequencing
adapters into accessible regions of the genome. In scATAC-seq individual cells are
captured and assayed using a programmable microfluidics platform (C1 single-cell Auto
Prep System, Fluidigm) with methods optimized for this task (Fig. 1a). After
transposition and PCR on the Integrated Fluidics Circuit (IFC), libraries are collected and
PCR amplified with cell-identifying barcoded primers. Single-cell libraries are then
pooled and sequenced on a high-throughput sequencing instrument. Using single-cell
ATAC-seq we generated DNA accessibility maps from 254 individual GM12878
lymphoblastoid cells. Aggregate profiles of scATAC-seq data closely reproduce
ensemble measures of accessibility profiled by DNase-seq and ATAC-seq generated from
107 or 104 cells respectively (Fig. 1b,c). Data from single cells recapitulate several
characteristics of bulk ATAC-seq data, including fragment size periodicity corresponding
to integer multiples of nucleosomes, and a strong enrichment of fragments within regions
of accessible chromatin (Fig. S1a,b). Microfluidic chambers generating low library
diversity or poor measures of accessibility, which correlate with empty chambers or dead
cells, were excluded from further analysis (Fig. 1d). Chambers passing filter yielded an
average of 7.3x104 fragments mapping to the nuclear genome. We further validated the
approach by measuring chromatin accessibility from a total of 1,632 IFC chambers
representing 3 tier 1 ENCODE cell lines15 (H1 human embryonic stem cells [ESCs],
K562 chronic myelogenous leukemia and GM12878 lymphoblastoid cells) as well as
from V6.5 mouse ESCs, EML6 (mouse hematopoietic progenitor), TF-1 (human
erythroblast), HL-60 (human promyeloblast) and BJ fibroblasts (human foreskin
40
fibroblast).
Cell-cell variability in trans
Because regulatory elements are generally present at two copies in a diploid
genome, we observe a near digital (0 or 1) measurement of accessibility at individual
elements within individual cells. For example, within a typical single cell we estimate a
total of 9.4% of promoters are represented in a typical scATAC-seq library (Fig. S1c-f).
The sparse nature of scATAC-seq data makes analysis of cellular variation at individual
regulatory elements impractical. We therefore developed an analysis infrastructure to
measure regulatory variation using changes of accessibility across sets of genomic
features (Fig. 2a,b and Fig. S2a-f). To quantify this variation we first choose a set of
open chromatin peaks, identified using the aggregate accessibility track, which share a
common characteristic (such as transcription factor binding motif, ChIP-seq peaks, cell
cycle replication timing domains, etc.). We then calculate the observed fragments in these
regions minus the expected fragments, down sampled from the aggregate profile, within
individual cells. To correct for bias, we divide this by the root mean square of fragments
expected from a background signal (BS) constructed to estimate technical and sampling
error within single-cell data sets. Herein, we refer to this metric as deviation. Finally,
for any set of features, we aggregate the deviation measurements across cells (Fig 2b) to
obtain an overall variability score, a metric of excess variance over the background
signal.
We first focused our analysis on K562 myeloid leukemia cells, a cell type with
extensive epigenomic data sets16,17. To comprehensively characterize variability
41
associated with trans-factors within individual K562 cells, we computed variability
across all available ENCODE ChIP-seq, transcription factor motifs and regions that
differed in replication timing (as determined from Repli-Seq data sets18) (Fig. 2c,d). We
found measures of cell-to-cell variability were highly reproducible across biological
replicates (Fig. S2g-i). As expected from proliferating cells, we find increased variability
within different replication timing domains, representing variable ATAC-seq signal
associated with changes in DNA content across the cell cycle. In addition, we discover a
set of trans-factors associated with high variability. These factors include sequence-
specific transcription factors (TFs), such as GATA1/2, JUN, and STAT2, and chromatin
effectors, such as BRG1 and P300. Immunostaining followed by microscopy or flow
cytometry (Fig. 2e and Fig. S3a-d) confirmed heterogeneous expression of GATA1 and
GATA2. Principal component (PC) analysis of single-cell deviations across all trans-
factors show seven significant PCs, with PC 5 describing changes in DNA abundance
throughout the cell cycle. This analysis suggests that high-variance trans-factors are
variable independent of the cell-cycle (Fig. 2f and Fig. S3e-g). The remaining PCs show
contributions from several TFs, suggesting that variance across sets of trans-factors
represent distinct regulatory states in individual cells.
Trans-factors synergize to induce cell-cell variability
We hypothesized that variation associated with different trans-factors can
synergize, either through cooperative or competitive binding, to induce or suppress site-
to-site variability in chromatin accessibility. For example, the most variant factors in
K562 cells GATA1 and GATA2 display expression heterogeneity and also bind an
42
identical consensus sequence GATA, suggesting these factors may compete for access
to DNA sequences. In support of this hypothesis, we find regulatory elements with both
GATA1 and GATA2 ChIP-seq signals show increased variability in accessibility,
whereas sites with only GATA1 or GATA2 show substantially less variability (Fig. 2g
and Fig. S3h-i). In contrast, we find no substantial change in variability of GATA1
binding sites that co-occur with JUN or CEBPB. We also find peaks unique to GATA1
binding are significantly more accessible than peaks unique to GATA2 (Fig. S3j-k)
supporting the hypothesis that GATA1, an activator of accessibility, competes with
GATA2 to induce single-cell variability. Extending this analysis to all TF ChIP-seq data
sets revealed a trans-factor synergy landscape for accessibility variation (Fig. 2g). For
example, chromatin accessibility variance associated with GATA2 binding is
significantly enhanced when the same region could also be bound by GATA1, TAL1 or
P300. In contrast, CTCF, SUZ12, and ZNF143 appear to act as general suppressors of
accessibility variance, unless associated with proximal binding of ZNF143 or SMC3, the
latter a cohesin subunit involved in chromosome looping17,19. Thus, single cell
accessibility profiles nominate distinct trans-factors that, in combination, induce or
suppress cell-to-cell regulatory variation.
Cell-state and chemical perturbation effects on cell-cell variability
To validate our ability to detect changes in accessibility variance, we used
chemical inhibitors to modulate potential sources of cell-cell variability. Inhibition of
cyclin-dependent kinases 4 and 6 (CDK4/6), essential components of the cell cycle,
caused a marked reduction of variability within peaks associated with DNA replication
43
timing domains (Repli-seq) (Fig. 3a). The addition of inhibitors of JUN or BCR-ABL
kinases (JNKi and Imatinib, respectively) increased G1/S-associated variability
suggesting an increase in the subpopulation of G1/S cells, which was validated with flow
cytometry (Fig. S4). JUN variability was one of the top changes caused by JNKi but not
Imatinib, suggesting that high-variance trans-factors can also be specifically and
pharmacologically modulated. Tumor necrosis factor (TNF) treatment of GM12878 cells
specifically modulated accessibility variability at NF-kB sites (Fig. 3b), consistent with
the known stochastic and oscillatory property of nuclear shuttling in this system20.
Together, these results show that variability can be experimentally modulated and further
demonstrates that variability is not solely dependent on the cell-cycle.
We observe that trans-factors associated with high variability are generally cell
type specific. Hierarchical bi-clustering of single-cell deviations generated from three cell
lines reveals cell-type specific sets of transcription factor motifs associated with high
variability (Fig. 3c). This analysis also shows cells from different biological replicates
cluster with their cell type of origin (with a single exception), suggesting scATAC-seq
can also be used to deconvolve heterogeneous cellular mixtures. Systematic analysis of
all assayed cell types identified high-variance trans-factor motifs that are generally
unique to specific cell types (Fig. 3d). For example, regions associated with GATA TFs
are most variant in K562s while regions associated with master pluripotency TFs Nanog
and Sox2 are most variant in mouse embryonic stem cells (ESCs), consistent with
previous observations of expression variation of these factors21,22. Importantly we also
find high variability of GATA1 and PU.1 (SPI1) binding accessibility in EML cells, a
cell type previously shown to have >200x GATA1 and >15x PU.1 expression differences
44
within clonal cellular subpopulations6. Interestingly, the complete set of identified high-
variance trans-factors contains a number of TFs previously reported to dynamically
localize into the nucleus, including NF-kB, JUN, and ETS/ERG20,23,24, suggesting that
temporal fluctuations in TF concentration may be driving observed chromatin
accessibility heterogeneity. Finally, we find BJ fibroblasts and HL-60s exhibit less
variance among this set of annotated trans-factor motifs, suggesting differences in the
global levels of trans-factor variability across cell lines. Overall these findings suggest
that trans-factors promote cell-type specific chromatin accessibility variation genome-
wide.
Single-cells vary in cis
Patterns of variation in accessibility along the linear genome in individual cells
reveal an unexpected connection to higher order chromosome folding. We calculated
single cell deviations within sliding windows across the genome, each encompassing a
fixed number of peaks (N=25) (Fig. 4a). We then determined which windows co-varied
within individual cells by calculating the co-correlation of each window across all others
within the same chromosome within individual cells. We then further enhanced this co-
correlation matrix using a secondary correlation analysis using methods similar to those
employed in chromosome conformation studies25. The resulting matrix, which identifies
pairs of positions in the genome where accessibility co-varies within individual cells,
yields Mb-scale correlation domains highly concordant with previously observed
chromatin domains26 (Fig. 4b-d) (R=0.61 for chromosome 1). These data provide
independent biological validation of large-scale compartmentalization of higher-order
45
chromatin structure25,26. Moreover, these results suggest that higher-order chromatin
interactions may drive regulatory variability in cis (elements that are close together tend
to be open together), and that ensemble chromosome conformation data may arise in part
from the statistical properties of single cell variation in co-regulated accessibility, a
hypothesis also supported by single-cell FISH measurements of interactions between
DNA loci27.
Discussion
Using scATAC-seq we dissected single-cell epigenomic heterogeneity and linked
cis- and trans- effectors to variability in accessibility profiles within individual
epigenomes. We identify trans-factors associated with increased accessibility variance,
which we call high-variance trans-factors. Additionally, other trans-factors such as
CTCF appear to buffer variability, perhaps by providing a stable anchor of chromatin
accessibility or insulator function that dampens potential fluctuations. Conversely, co-
occurance with other factors such as P300 appears to amplify variability, perhaps due to
synergistic interactions. Lineage-specific master regulators are associated with cell-type
specific single-cell epigenomic variability across several cell types, suggesting that
control of single-cell variance is a fundamental characteristic of different biological
states. Finally, variation of chromatin accessibility in cis is highly correlated with
previously reported chromosome compartments, opening the intriguing possibility that
this component of epigenomic noise has its roots in higher-order chromatin organization.
All together these data provide exciting new hypothesis of regulatory mechanisms that
give rise to single-cell heterogeneity.
46
We envision that future studies will enhance the utility of scATAC-seq by further
improving the recovery of DNA fragments, increasing throughput, and refining methods
of data analysis. Improvements to throughput and new statistical tools will enable single-
cells to be partitioned by cell-state and analyzed in aggregate to find the individual peaks
that drive variability (Fig. S5). In addition, we anticipate scATAC-seq may be paired
with existing approaches in microscopy and single-cell RNA-seq to provide opportunities
for systems analysis of individual cells. Such an approach will link regulatory variation to
details of phenotypic variation, promising new insight into the molecular underpinnings
of cellular heterogeneity. We believe scATAC-seq will likewise enable the interrogation
of the epigenomic landscape of small or rare biological samples allowing for detailed,
and potentially de novo, reconstruction of cellular differentiation or disease at the
fundamental unit of investigation the single cell.
47
Figure 1. Single-cell ATAC-seq provides an accurate measure of chromatin
accessibility genome-wide. (a) Workflow for measuring single epigenomes using
scATAC-seq on a microfluidic device (Fluidigm). (b) Aggregate single-cell accessibility
profiles closely recapitulate profiles of DNase-seq and ATAC-seq. (C) Genome-wide
accessibility patterns observed by scATAC-seq are correlated with DNase-seq data (R =
0.80). (d) Library size versus percentage of fragments in open chromatin peaks (filtered
as described in methods) within K562 cells (N=288). Dotted lines (15% and 10,000)
represent cutoffs used for downstream analysis.
48
Figure 2. Trans-factors are associated with single-cell epigenomic variability. (a)
Schematic showing two cellular states (TF high and TF low) leading to differential
chromatin accessibility. (b) Analysis infrastructure, which uses a calculated background
signal (BS; see Supplemental Methods section 3.2) to calculate TF deviations and
variability from scATAC-seq data. The TF value is calculated by subtracting the number
of expected fragments from the observed fragments per cell (see Supplemental Methods
section 3.1). (c) Observed cell-to-cell variability within sets of genomic features
associated with ChIP-seq peaks, transcription factor motifs, and replication timing (error
estimates shown in grey, see Methods for details). Variability measured from permuted
background (see Methods) is shown in grey dots. (d) Distribution of normalized
deviations from expected accessibility signal for GATA1 sites in individual cells,
histogram of cells shown in grey, density profile shown in purple (see Methods). (e)
49
Immunostaining of GATA1 (green) and GATA2 (red) shows protein expression in
K562s. (f) Principal components ranked by fraction of variance explained from observed
data (purple) and permuted data (orange). Bar plot of observed data shown in grey. (g)
Calculated changes in associated variability of factors when present together versus
independently, depicting a context-specific trans-factor variability landscape (see
Methods). Venn-diagrams show variability associated with GATA1 and/or GATA2 and
CTCF and/or SMC3 (co-) occurring ChIP-seq sites.
50
Figure 3. Cell type specific epigenomic variability. Change of cellular variability due
to chemical perturbations using (a) CDK4/6 cell-cycle inhibitor (K562) or (b) TNF-alpha
stimulation (GM12878), error bars (shown in grey) represent 1 standard deviation of
bootstrapped cells across the two conditions. (c) Heat map of deviations from expected
accessibility signal across trans-factors (rows) and of single cells (columns) from 3 cell
types. Bottom color map represents assignment classification from hierarchical
clustering. (d) Variability associated with trans-factor motifs across 7 cell types. Each
row is normalized to the maximum variability for that motif across cell types (shown
left).
51
Figure 4. Structured cis variability across single epigenomes. (a) Per-cell deviations
of expected fragments across a region within chromosome 1 (see Methods). For display,
only large deviation cells are shown (N=186 cells). (b) Pearson correlation coefficient
representing topological domain signal (see Methods) of interaction frequency from a
chromatin conformation capture assay (left, data from Kalhor et al.26) or doubly
correlated normalized deviations of scATAC-seq (right) from chromosome 1 (see
Methods). Data in white represents masked regions due to highly repetitive regions. (c)
Permuted cis-correlation map for chromosome 1 (analyzed identically to (b)). (d) Box
highlights a representative region depicting long-range covariability.
52
Supplemental Figure 1. scATAC-seq data recapitulate bulk assays. (a) Histogram of
aggregated read starts around all TSSs (in K562 cells) comparing ensemble approaches to
scATAC-seq shows high enrichment above background level of reads. (b) DNA fragment
size distribution of ATAC-seq fragments from single cells (grey) and the average of all
single cells (red) display characteristic nucleosome-associated periodicity. (c)
Accessibility across all peaks (n=50,000) in GM12878 cells. (d) Accessibility across all
annotated promoters in GM12878 cells. Typical promoters used for subsequent analysis
are boxed with dotted lines. Recovery of typical promoters shown in (a) within single-
cells within (e) observed data and (f) extrapolated data using measures of predicted
library complexity.
53
Supplemental Figure 2. scATAC-seq data analysis pipeline and validation of bias
normalization. Standard deviation of log fold change in reads across cells within peaks
binned by deciles of (a) peak intensity, (b) Tn5 bias and (c) GC bias. Variability scores
(incorporating bias normalization) within the same peaks shown in (a-c), peaks are
binned by deciles of (d) peak intensity, (e) Tn5 bias and (f) GC bias. (g-i) Observed
changes in variability comparing the merged set of replicates (K562) to each individual
biological replicate. Error bars represent 1 standard deviation of the variability scores
after bootstrapping cells from each replicate.
54
Supplemental Figure 3. Characterization of high-variance trans-factors in K562
cells. (a-d) Distribution of (a) GATA1, (b) GATA2, (c) actin and (d) CTCF fluorescence
observed by flow cytometry. Distributions in grey depict isotype controls. (e) Bi-
clustered heat map of single cell deviations as observed within K562 cells (N=239).
Labels on right identify co-clustering of related factors. (f) Bi-clustered heat map of
single-cell deviations observed from permuted data. (g) Projection of factor loadings onto
principal component 1 versus 5 from principal component (PC) analysis of heatmap from
Fig. 2d. Factor loadings do not vary along PC5, while peaks associated with regions with
different replication timings (RepliSeq) have strong variation along this axis. Venn-
diagrams showing variability of (h) GATA1 and/or GATA2, (i) CJUN and/or GATA2
and CEBPB and/or GATA2 (co-) occurring ChIP-seq sites. (h) Distribution of
accessibility among GATA1 only, GATA2 only, and shared sites. (i) Mean accessibility
55
from GATA1 only, GATA2 only, and shared sites in (k), error bars represent 1 standard
deviation generated by bootstrapping ChIP-seq peaks.
56
Supplemental Figure 4. Drug treatments modulate factor variability. (a-b) Change in
variability of untreated K562 cells versus cells treated with (a) Imatinib and (b) JUN
inhibitor show increase of variability in factors associated with the cell cycle or s-phase
and JUN factors respectively. (c-f) Flow cytometry data depicting DNA content, using
DAPI or PI, in (c) control K562 cells or cells showing altered cell-cycle status after
treatment with (d) cell-cycle inhibitor, (e) Imatinib and (f) JUN inhibitor.
57
Supplemental Figure 5. Measurements of individual peaks within single-cells. (a)
The distribution of GATA1 deviation scores for single K562 cells. Volcano plots of (b)
non-GATA1 peaks and (c) GATA1 peaks in K562 cells, p-values were calculated using a
binomial test. (d) The distribution of NF-kB deviation scores for single GM12878 cells.
Volcano plots of (e) non-NFKB peaks and (f) NF-kB peaks in GM12878 cells, p-values
were calculated using a binomial test. Inset numbers show the number of points in upper
left or upper right quadrants of the panel. (g) Accessibility at a genomic locus, showing
58
(top) aggregate NFKB low (blue) and NFKB high (red) profiles, (middle) single
GM12878 cells ranked by NFKB deviations scores and (bottom) unranked single-cells.
59
References
1. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug

responses across a human hematopoietic continuum. Science 332, 687696 (2011).
2. Raj, A., Rifkin, S. A., Andersen, E. & van Oudenaarden, A. Variability in gene
expression underlies incomplete penetrance. Nature 463, 913918 (2010).
3. Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free
decomposition of tissues into cell types. Science 343, 776779 (2014).
4. Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing
epigenetic heterogeneity. Nat Meth 11, 817820 (2014).
5. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-wide detection of single-
nucleotide and copy-number variations of a single human cell. Science 338, 1622
1626 (2012).
6. Chang, H. H., Hemberg, M., Barahona, M., Ingber, D. E. & Huang, S.
Transcriptome-wide noise controls lineage choice in mammalian progenitor cells.
Nature 453, 544547 (2008).
7. Imayoshi, I. et al. Oscillatory control of factors determining multipotency and fate
in mouse neural progenitors. Science 342, 12031208 (2013).
8. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in
primary glioblastoma. Science 344, 13961401 (2014).
9. Michor, F. et al. Dynamics of chronic myeloid leukaemia. Nature 435, 12671270
(2005).
10. Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 5774 (2012).
Nature 489, 7582 (2012).
12131218 (2013).
13. Goryshin, I. Y. & Reznikoff, W. S. Tn5 in vitro transposition. J. Biol. Chem. 273,
73677374 (1998).
14. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment
libraries by high-density in vitro transposition. Genome Biol 11, R119 (2010).
15. Consortium, T. E. P. A User's Guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol 9, e1001046 (2011).
16. Gerstein, M. B. et al. Architecture of the human regulatory network derived from
ENCODE data. Nature 489, 91100 (2012).
17. Xie, D. et al. Dynamic trans-Acting Factor Colocalization in Human Cells. Cell
155, 713724 (2013).
18. Hansen, R. S. et al. Sequencing newly replicated DNA reveals widespread
plasticity in human replication timing. Proceedings of the National Academy of
Sciences 107, 139144 (2010).
19. Parelho, V. et al. Cohesins Functionally Associate with CTCF on Mammalian
Chromosome Arms. Cell 132, 422433 (2008).
20. Tay, S. et al. Single-cell NF-kB dynamics reveal digital activation and analogue
60
information processing. Nature 466, 267271 (2010).
21. Grn, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-
cell transcriptomics. Nat Meth 11, 637640 (2014).
22. Singer, Z. S. et al. Dynamic Heterogeneity and DNA Methylation in Embryonic
Stem Cells. Molecular Cell 55, 319331 (2014).
23. Cai, L., Dalal, C. K. & Elowitz, M. B. Frequency-modulated nuclear localization
bursts coordinate gene regulation. Nature 455, 485490 (2008).
24. Levine, J. H., Lin, Y. & Elowitz, M. B. Functional roles of pulsing in genetic
circuits. Science 342, 11931200 (2013).
25. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions
reveals folding principles of the human genome. Science 326, 289293 (2009).
26. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. Genome architectures
revealed by tethered chromosome conformation capture and population-based
modeling. Nat. Biotechnol. 30, 9098 (2012).
27. Giorgetti, L. et al. Predictive Polymer Modeling Reveals Coupled Fluctuations in
Chromosome Conformation and Transcription. Cell 157, 950963 (2014).
61
CHAPTER FIVE The epigenomic determinants of human hematopoiesis4
Introduction
The entire human hematopoietic system is maintained by the activity of a small
number of self-renewing hematopoietic stem cells (HSCs). These cells are long-lived and
retain the ability to give rise to multiple distinct lineages of blood cells. During the course
of a single day, more than 200 billion blood cells are produced1, highlighting the need for
tightly controlled regulatory programs that balance self-renewal of the apex stem cells
with downstream production of differentiated effector cells. Despite its functional
complexity, the hematopoietic system is the most extensively characterized adult stem
cell hierarchy whereby many diverse cell types can be isolated through the use of multi-
parameter fluorescence activated cell sorting (FACS)2. This enables interrogation of the
precise transcriptional dynamics that govern the cell state transitions associated with
differentiation and lineage commitment.
Genome-wide sequencing methods are exquisite sensors for assessing the
molecular determinants governing these distinct regulatory programs. Previous studies
have profiled gene expression patterns in mouse3-5 and human6,7 hematopoiesis providing
a rich resource for characterizing these cellular states. However, measuring gene
expression alone provides limited information regarding the causative regulators of cell
identity. The dynamic expression of key transcription factors (TFs) can dramatically alter
the regulatory landscape, which defines the expression of nearby genes, and forms the
molecular basis of specialized regulatory programs. Genome-wide chromatin-based
assays measuring chromatin accessibility or chromatin bound proteins are sensitive

4
Portions of this chapter were taken from Corces R & Buenrostro et al. Lineage-specific and
single cell chromatin accessibility charts human hematopoiesis and leukemia evolution.
Submitted.
62
methods for assaying cellular regulation. Importantly, chromatin accessibility measures
nucleosome-free or nucleosome-depleted sites throughout the genome, which demarcate
active regulatory elements and hotspots for TF binding. However, these assays require
millions of cells, hindering efforts to comprehensively catalogue hematopoietic
regulation and limiting their application to either cell lines8 or whole tissues9 which do
not accurately represent individual primary cell types. Recent developments have enabled
genome-wide chromatin immunoprecipitation5 or chromatin accessibility10 profiling in
rare cellular populations, which has successfully lead to the identification of regulatory
elements within mouse hematopoiesis5. These remarkable improvements in speed and
efficiency have afforded the unique opportunity to provide such comprehensive and data-
rich measurements in human hematopoietic cells and provide a platform for
understanding the molecular underpinnings of human blood development and disease.
We previously described the Assay for Transposase Accessible Chromatin using
sequencing (ATAC-seq), a method capable of measuring chromatin accessibility in rare
cellular populations10. Here, we report the development of an improved ATAC-seq
protocol, optimized for human blood cells, that allows for more rapid high-quality
measurements with 10-fold fewer cells. We apply this optimized protocol to cells isolated
from 9 healthy human donors, studying 13 of the major cell types of the normal
hematopoiesis. In addition, we measure the transcriptomes of the same healthy donors to
derive paired expression data. This atlas of normal human hematopoiesis provides a data
rich resource for discovering the molecular determinants of human hematopoiesis and
allows for the deconvolution of complex biological data.
63
Identification of chromatin accessibility landscape in primary blood cells
To better understand the regulatory networks controlling human hematopoietic
differentiation and leukemogenesis, we sought to create a reference regulome and
transcriptome map of the normal hematopoietic hierarchy (Fig. 1a,b). Although ATAC-
seq is highly efficacious for a variety of cell sources, further optimizations were required
to profile rare primary human blood cells from cryopreserved specimens. This protocol,
termed Fast-ATAC, was optimized for use on primary blood cells and relies on a 1-step
membrane permeabilization and transposition using the lysis reagent digitonin. We found
that this simplified protocol provides extremely high quality data (Fig. S1a-c), requires
just 5,000 cells, offers an approximately 10-fold improvement in sensitivity, and reduces
the frequency of mitochondrial reads by ~5 fold (Fig. S1d). However, we note that
digitonin is a gentle detergent and may not be ideal for cell lines and other cell types that
are more resistant to lysis. Overall, Fast-ATAC provided hematopoietic epigenomes with
i) increased speed, ii) fewer cells and iii) lower cost, making it readily adaptable for
large-scale studies of rare cellular populations.
Using Fast-ATAC and RNA-seq, we profiled the chromatin accessibility
landscape (regulomes) and transcriptomes from 13 distinct cellular populations from
the human hematopoietic hierarchy via fluorescence activated cell sorting (FACS) (Fig.
1a and Fig. S2). Cells were taken directly from donor bone marrow or peripheral blood
without further in vitro manipulation or treating donors with agents such as granulocyte
colony-stimulating factor (G-CSF). These analyses excluded mature granulocytes due to
high endogenous RNases and proteases as well as mature megakaryocytes, which proved
difficult to isolate in adequate cell numbers. The isolated cell populations included 7
64
unique stem and progenitor and 6 differentiated cell types spanning the myeloid,
erythroid, and lymphoid lineages2,11-13. All together, we performed ATAC-seq and RNA-
seq on 3-4 adult donors for each cell population totaling 49 transcriptomes and 77
regulomes (Fig. S1e).
With this dataset we identified a total of 590,650 hematopoietic accessible peaks.
Each individual cell type of the hematopoietic hierarchy displayed a set of uniquely
expressed genes and uniquely open peaks mapping to genes known to be involved in
cellular functions important for the given cell type (Fig. 1c and Fig. S1f,g). Additionally,
the sets of uniquely open peaks were enriched for motifs of transcription factors known to
be involved in the biological processes of the cell type of interest (Fig. S1h).
We found Fast-ATAC profiles to be highly reproducible between technical
(R=0.93, Fig. 1d) and biological (R=0.93, Fig. 1e) replicates. We also observed a
significant correlation (R=0.73) between Fast-ATAC and DNase-seq (data from the
Epigenomic Roadmap Consortium) of bone-marrow derived CD34+ HSPCs (Fig. 1f).
Importantly, we find that hematopoietic stem cells (HSCs), a CD34+ subpopulation, can
have significantly different chromatin profiles than the bulk CD34+ HSPC pool (R=0.77,
Fig. 1g), highlighting the value of analysis of highly purified stem and progenitor cell
subpopulations.
Chromatin accessibility at distal elements delineates the hematopoietic hierarchy
Paired regulome and transcriptome data provide a unique opportunity to
understand the regulatory networks of human hematopoiesis. Unsupervised hierarchical
clustering of our RNA-seq and ATAC-seq data shows robust classification of cell types
65
among technical and biological replicates (Fig. 2a,b). In this analysis, ATAC-seq appears
to be more adept at classifying cell types as quantified by the cluster purity14, suggesting
that chromatin accessibility is more cell type-specific and better captures cell identity.
Intriguingly, and in line with previous studies of murine cell subsets5, epigenomes and
transcriptomes also provide different conclusions about the relationship between various
cell types. By RNA-seq, the common myeloid progenitor (CMP) clusters with the
megakaryocyte erythroid progenitor (MEP) (Fig. 2c), whereas, by ATAC-seq CMP
clusters more closely with the HSC and the multipotent progenitor cell (MPP) (Fig. 2d),
which is more consistent with its role in the hematopoietic hierarchy. We reasoned that
inaccurate lineage classification by RNA-seq may be resolved by ATAC-seq, with the
latter revealing the cell-type specific and combinatorial logic of regulatory elements that
control the expression of nearby genes. When regulatory elements were subdivided to
gene promoters versus putative distal enhancers (>1000 bp away from the closest TSS),
we find that distal enhancers provide significantly improved cell-type classification
compared to promoters and transcription profiles (Fig. 2e,f). Notably, we find that
promoter elements are largely invariant within CD34+ stem and progenitor cells,
suggesting that chromatin remodeling associated with these linked developmental lineage
decisions occurs predominantly within distal regulatory elements. This observation is
clearly illustrated by the region surrounding the TET2 gene, a gene expressed within a 2-
fold range in all cell types throughout the hematopoietic hierarchy. Despite the invariant
expression of TET2 and ubiquitous accessibility of TET2 promoter, we find highly
diverse accessibility profiles within nearby distal regulatory elements, clearly
distinguishing HSPCs, NK cells, and T cells (Fig. 2g).
66
Enhancer cytometry prospectively deconvolves complex cell populations
Given the accuracy with which regulatory landscapes delineate cell types, we
hypothesized that Fast-ATAC can be used to deconvolve highly complex cellular
populations into their constitutive subsets. For instance, the Epigenomic Roadmap
Consortium has provided multiple datasets on heterogeneous tissues, and in particular,
mixtures of CD34+ HSPCs. These data are very useful for understanding the biology of
these cells; however, these tissues represent an ensemble average of multiple distinct cell
types. While some regulatory elements are ubiquitous among all HSPCs, others show
high cell type specificity (Fig. 3a). For example, the accessible site near micro-RNA 1915
shows a robust peak exclusively in CMP cells but shows almost no accessibility in the
CD34+ DNaseseq data. In fact, regulatory elements that are highly cell-type specific are
averaged out and difficult to detect in this bulk CD34+ data (Fig. 3a).
The highly cell type-specific nature of our ATAC-seq data enabled the
development of a strategy we term enhancer cytometry, wherein we enumerate the
frequency of cell types in complex cellular mixtures based on chromatin accessibility
data. To do this, we employ the deconvolution algorithm CIBERSORT14 to quantify the
contribution of each individual cell type to the ensemble profile. Analogous to flow
cytometry of cell surface markers, enhancer cytometry with CIBERSORT uses the
presence or absence of accessibility at tens of thousands of elements to match pre-defined
patterns of cell identity. To do this, we filtered for high-quality distal regulatory elements
and removed promoter signal (see methods) and applied CIBERSORT to define an array
of cell-type specific regulatory elements (Fig. 3b). CIBERSORT employs support vector
67
regression (SVR) for deconvolution, a method shown to be robust to noise, unknown
mixture content, and multicollinearity14. We validated this approach using a leave-one-
out cross validation and found that enhancer cytometry proved to be highly robust for
classification of all normal hematopoietic cell types (Fig. 3c,d). One exception is the
MPP that showed reasonable but lower accuracy than other cell types. However, we note
that when MPP cells are misclassified, they are most frequently misclassified as HSCs,
their closest normal cell type. Next, we prospectively tested enhancer cytometry on bulk
CD34+ HSPCs and performed flow cytometry in parallel. We found that enhancer
cytometry yielded highly accurate enumeration of the constituent cell types when
compared to flow cytometry (R2 = 0.95, Fig. 3e,f). Notably, this cell type deconvolution
was not as accurate without restriction to distal regulatory elements (R2 = 0.91). In
addition, we found that enhancer cytometry can also be used to deconvolve CD34+
DNase-seq data (p < 0.001), suggesting that ATAC-seq with enhancer cytometry may be
a general strategy for identifying and counting cells within complex cellular mixtures.
Regulatory networks of normal hematopoiesis
To better understand the mechanisms governing these diverse regulatory
landscapes, we sought to quantify the effect of specific trans-factors at each
developmental transition. To do this we adapted a computational framework we
previously developed to measure accessibility across regulatory elements sharing a
common feature, i.e. TF motif15. In brief, we classified hematopoietic regulatory elements
by their underlying transcription factor motifs and calculated a bias corrected deviation
score, which represents a differential gain or loss of accessibility across peaks sharing a
68
given motif for each transition in the hematopoietic hierarchy. We note that, unlike
current methods for TF footprinting16, this measure of TF accessibility is highly robust to
the number of sequenced reads, DNA sequence bias, and signal-to-noise bias. We
therefore chose this approach to measure the effect that a given TF motif enacts on the
accessible genome at each stage of hematopoiesis; for subsequent visualization, we
condensed similar motifs to create a non-overlapping list (see methods). We find TF
motifs such as GATA, RUNX, and SPI1 to be dominant regulators of chromatin
accessibility (Fig. 4a and Fig. S3a). Notably, these factors have also been previously
shown to be governing master regulators of hematopoiesis17-19. We find that activation of
these TFs are highly cell-type specific, often displaying step-wise gains across
developmental lineages. This is exemplified by the GATA and PAX motifs which
are strongly enriched in erythroid and lymphoid lineages respectively (Fig. 4b,c). To
validate this approach for determining global TF motif regulators of cell identity, we
compared GATA TF footprints20 between MEPs (GATA high) and common lymphoid
progenitors (CLPs) (GATA low) and found that CLPs had no detectable binding at
GATA sites when compared to MEPs (Fig. 4d). For further validation, we employed
PIQ21, a TF footprinting algorithm, and found drastically fewer GATA footprints in CLPs
compared to MEPs (N=173 and N=27,292 respectively), thus, confirming our analytical
strategy for measuring TF binding.
We reasoned that the accessibility of a given motif should correlate with the
expression of the associated transcription factor. However, the underlying motif sequence
does not identify the precise causative regulator of accessibility at those motif instances.
This is a common issue in epigenomic studies and particularly important for cases in
69
which many factors share identical or near-identical TF motifs. For example, the GATA
motif is shared among 6 TFs (GATA1-6), while the PAX motif is shared among 9 TFs. In
an effort to assign motifs to transcription factors, we integrated our ATAC-seq and RNA-
seq data to predict causative regulators of motif accessibility. To do this, we employed
CIS-BP22, a comprehensive database of in vitro and in silico derived motifs, to create an
association table linking hematopoietic TF motifs to 806 genes by motif similarity (Fig.
S3b-e). Next, we calculated correlation coefficients for the expression of all known TFs23
to deviation scores across hematopoiesis. Using this approach we find a striking
correlation of motif usage with the expression of known master regulators of
hematopoiesis (Fig. 4e). For example, the expression of GATA1 and PAX5 are highly
correlated with accessibility at GATA and PAX motifs, respectively (R = 0.75, P = 10-18
and R = 0.88, P = 10-230, Fig. 4e-g and Fig. S3f). Interestingly, for some motifs, such as
the HOX motif, we find many putative regulators with weak correlations (N = 11; Fig.
S3g,h), suggesting that regulation of HOX accessibility is more complex. Together, these
results highlight the utility of a systems-level analysis of epigenome and transcriptome
data.
Accessibility profiles of purified cell populations identify the ontogeny of human
diseases
In addition to enhancing our understanding of developmental gene regulation, the
hematopoietic regulome can trace the ontogeny of activity in the noncoding genome that
impacts human disease. Many genome-wide association studies (GWAS) have linked
diseases to polymorphisms, but have not been able to pinpoint the cells responsible for
70
those phenotypes. By measuring the activity of regulatory elements that overlap regions
with predicted sites of functional variation from GWAS, it is now possible to more
accurately predict the specific cell types impacted by genetic variants linked to diverse
human diseases24-26. To do this we first filtered for GWAS that were significantly
enriched in hematopoietic cells (Fig. S4a,b; see methods), then calculated deviation
scores for each GWAS across the hematopoietic hierarchy as described above. We found
that each of these associations can be traced through the hematopoietic lineage to predict
the developmental point at which each variant may first exert its effects, thus enriching
our understanding of developmental origins of human disease (Fig. 4h-k and Fig. S4c).
As a positive control example, polymorphisms linked to mean corpuscular volume
(MCV), a measure of the average volume of an erythrocyte cell, are most strongly
enriched in erythroblasts (Fig. 4h). Intriguingly, many regions associated with MCV
polymorphisms first become accessible at the CMP stage and increase in accessibility in
MEP cells. These non-coding polymorphisms are predicted to affect transcription factor
binding and would, therefore, lead to closure of sites that would otherwise be accessible.
From this, MCV-associated polymorphisms found in the accessible regions of CMPs and
MEPs suggest that these polymorphisms exert their effects prior to full erythroid lineage
commitment. As a second example, polymorphisms associated with rheumatoid arthritis
(RA) show a strong enrichment in B cells (Fig. 4i). This association is consistent with the
known role of autoantibodies and pathogenic B cells in the pathogenesis of RA, as well
as the documented success of B cell depletion therapy in the treatment of RA27,28.
We find a more complex pattern in the disease alopecia areata, an autoimmune
disease in which hair is lost from some or all areas of the body. The autoimmunity
71
driving this disease has recently been associated with both innate and adaptive immune
responses29, a result consistent with the enrichment of polymorphisms for alopecia areata
in both CD4+ and CD8+ T cells and monocytes (Fig. 4j). B cells also harbor many active
elements associated with alopecia areata but have not been studied in this disease,
suggesting a new direction of investigation. Importantly, the disease associations that are
highlighted by our data are not limited to diseases canonically associated with
hematopoietic cells; polymorphisms linked to Alzheimers disease show a strong
enrichment in B cells and monocytes, two cell types that have predicted roles in the
pathogenesis of the disease24,30,31 (Fig. 4k).
Discussion
Here we report a rich resource charting the epigenomic and transcriptomic
landscape of 13 unique blood cell types. This resource relies on the accurate and precise
determination of the epigenomic landscapes in primary human blood cells, made possible
by Fast-ATAC. The chromatin accessibility profiles of blood cells are highly cell type
specific and allow for a much more robust classification system than more frequently
used transcriptional profiles. Unsupervised clustering of accessible chromatin regions,
specifically distal enhancers, groups individual cell types with extremely high cluster
purity, demonstrating that these distal regulatory elements more precisely define cell
identity and developmental trajectory. Enhancer cytometry harnesses this specificity and
proves to be a useful strategy to navigate regulome data. By matching patterns of distal
element accessibility to known profiles of pure cell types, enhancer cytometry
enumerates the frequencies of pure cell types in complex cell mixtures. This technique
72
enabled the accurate deconvolution of data derived from CD34+ bone marrow cells into
the constituent highly-similar HSPC cell types. Flow cytometry has become a standard
technique, but it is typically limited to a handful of cell surface markers, each requiring a
different antibody that may have off-target binding and gating idiosyncrasies. In contrast,
enhancer cytometry employs a universal probe system (a transposase) to simultaneously
interrogate hundreds of thousands of regulatory elements, empowering an extremely
robust classification system. An important limitation of enhancer cytometry is that the
method destroys the cell as the measurements are made, and thus does not permit
prospective cell purification at present. We note that while we have used well-
characterized cell types with known cell surface immunophenotypes to generate pure cell
type reference maps, single cell ATAC-seq with enhancer cytometry may be used as an
unbiased measure of cell type identity within a population providing archetypal cell
profiles within complex cellular populations. In principle, this general approach may be
used to resolve cellular heterogeneity in any tissue or organism.
This atlas of human hematopoiesis enriches the interpretation of GWAS results in
several ways. First, we identify strong associations of disease-linked polymorphisms with
the open chromatin landscapes of specific hematopoietic cell types, notably the
developmental contexts in which the disease-relevant elements first become active. In the
case of mean corpuscular volume, a measurement of the size of red blood cells, the
strongest association occurs in erythroblast cells, but a significant association can be seen
as early as the common myeloid progenitor stage (CMP). These results are consistent
with the concept that many enhancers are developmentally primed prior to their
activation following cell differentiation5. Given our in-depth characterization of known
73
human HSPC subtypes, we are able to identify the earliest progenitor cells that may be
relevant in the pathogenesis of specific diseases and elucidate putative targets for
corrective action. It is now well accepted that effective genetic correction of coding
mutations needs to take place in the stem cell compartment - e.g. the HSC in blood or
basal cells in epithelia - in order to achieve long lasting phenotypic correction in the
tissue. The same logic applies to genetic variants in the noncoding genome and suggests
the need to map the developmental ontogeny of regulatory elements. Comprehensive and
cell type-specific regulome maps will help to nominate hypotheses of relevant cell types
in diseases.
Lastly, this resource provides a platform to identify specific trans-acting
regulators that drive blood cell identity and function. Integration of ATAC-seq and RNA-
seq data improves motif-transcription factor pairing and enables the accurate
determination of causative regulators of chromatin accessibility throughout hematopoietic
differentiation. We anticipate this combined data set, which represents a dynamic
developmental process, to be a rich resource for continued efforts to build computational
tools that model both cis32 and trans33 determinants of chromatin accessibility and gene
expression.
74
Figure 1. Interrogation of chromatin landscapes in primary blood cells. (a)
Schematic of the human hematopoietic hierarchy shows the 13 primary cell types
analyzed in this work. Granulocytes and megakaryocytes were excluded. (b) Diagram of
analyses performed using paired ATAC-seq and RNA-seq data in both primary human
blood cells and primary patient AML cells. (c) Normalized ATAC-seq profiles at
developmentally important genes. Profiles represent the union of all technical and
biological replicates for each cell type. See Supplementary Table 1 for the exact number
75
of technical and biological replicates for each cell type. (d-g) Scatter plot showing
correlation of (d) technical replicates, (e) different human donors, (f) ATAC-seq and
DNase-seq data derived from CD34+ HSPCs, and (g) ATAC-seq HSCs with bulk CD34+
HSPCs.
76
Figure 2. Distal regulatory elements enable accurate classification of the
hematopoietic hierarchy. (a,b) Hierarchical clustering of (a) RNA-seq (N=49) and (b)
ATAC-seq (N=77) data from all biological replicates of 13 normal hematopoietic cell
types. Values shown are Pearson correlation coefficients. Cluster purity quantifies the
degree that cells of the same lineage (color coded in the key) are clustered together. (c,d)
Phylogenetic dendrograms of (c) RNA-seq and (d) ATAC-seq data showing inter-cell
type correlations derived from aggregate averages of all biological and technical
replicates. Length of tree branches represents Euclidean distance. Data represents the
union of all technical and biological replicates for each cell type. (e,f) Hierarchical
clustering of ATAC-seq profiles (N=77) mapping to (e) promoters and (f) distal
regulatory elements. (g) ATAC-seq peaks in the TET2 locus show highly variable distal
regulatory landscapes (left) and relatively constitutive expression of TET2 (right). Data
represents the union of all technical and biological replicates for each cell type.
77
Figure 3. Enhancer cytometry allows for deconvolution of the hematopoietic
hierarchy. (a) Normalized ATAC-seq profiles of HSPC subsets and ensemble CD34+
HSPC DNase-seq profiles illustrating heterogeneity amongst CD34+ HSPC
subpopulations. Predicted cell fractions are shown on the left and nearest annotated genes
are shown on the bottom. (b) Schematic of enhancer cytometry, including methods to
define a signature matrix of highly cell-type specific enhancers (right panel, N=735).
(c,d) Benchmarking of enhancer cytometry using randomly permuted synthetic mixtures
to test robustness to (c) sequential subtraction and (d) randomized mixture content. Test
data and training data are non-overlapping. Error bars in (c) represent the standard
deviation of 100 random permutations. (e) Enhancer cytometry of ATAC-seq data
derived from FACS sorted bulk CD34+ HSPCs identifies fractional contribution from all
expected cell types. (f) Correlation of predicted fractional contribution of each HSPC cell
type by enhancer cytometry versus flow cytometric ground truth data of input CD34+
cells.
78
Figure 4. Integrative analysis of the hematopoietic regulome refines transcriptional
circuitry driving cell specification and enriches the understanding of human disease
(a) Transcription factor dynamics showing major TFs driving hematopoietic regulomes.
The size of the circle represents the effect of that motif in driving accessibility in human
blood cells. The relative distance between circles represents the co-occurrence of motifs
throughout hematopoietic differentiation (see methods). (b,c) Usage of the (b) GATA and
(c) PAX motif throughout hematopoietic differentiation. Values represent the relative
deviation of the motif accessibility, a measure of motif usage, compared to that in HSCs.
(d) Footprint analysis of the GATA motif in MEP and CLP cells. (e) Correlation
(Pearson) of motif accessibility and significance of gene expression for GATA (top) and
PAX (bottom). Red dots represent DNA-binding factors annotated to bind the given
79
motif, gray dots represent all other DNA-binding factors. (f,g) Expression of (f) GATA1
and (g) PAX5 phenocopies the usage of the GATA motif throughout hematopoietic
differentiation (h-k) Relative deviation scores of chromatin accessibility within
hematopoietic regulatory elements with GWAS SNPs for (h) mean corpuscular volume,
(i) rhuematoid arthritis, (j) alopecia areata, and (k) Alzheimers disease.
80
Supplementary Figure 1. Data processing pipelines. (a) ATAC-seq insert size
distribution for three biological replicates of HSCs. (b,c) Enrichment of signal at
annotated transcription start sites (TSS) from Fast-ATAC data compared to (b) DNase-
seq and (c) previously published ATAC-seq data using the original ATAC-seq protocol10.
(d) Fraction of total mitochondrial reads derived from the original ATAC-seq protocol
and the fast-ATAC protocol. (e) Accessible chromatin landscapes surrounding a
constitutively accessible region of the genome. Profiles represent the union of all
technical and biological replicates for each cell type. (f,g) GO Term analyses from unique
81
(f) gene expression and (g) accessible peaks from normal hematopoietic cells. (h)
Enrichment of developmentally relevant motifs in accessible peaks.
82
Supplementary Figure 2. Cell sorting strategies. (a) Representative examples of
sorting strategies for the seven CD34+ HSPC populations isolated in this study.
83
Supplementary Figure 3. Trans regulators of hematopoiesis. (a) Summary of motif
deviations across hematopoiesis normalized by maximum and minimum signal. Scale is
represented above each column. (b) Clustering of hematopoiesis TF motifs (N=46) with
CIS-BP motifs (N=806) using Pearson correlation (see methods). (c,d) Example of
clustered motifs for (c) GATA4 and (d) MEIS1. (e) Histogram of all correlation values
shown in (b) with lists of putative hematopoietic regulators highlighted (N=255). (f)
Correlation of motif deviations to gene expression changes in hematopoiesis for two
developmentally important TFs, GATA1 and PAX5. (g,h) Summary list of putative TF
(g) positive and (h) negative regulators of hematopoiesis. Motifs are listed on the left and
genes are listed on the right. Values represent correlation coefficients (Pearson).
84
Supplementary Figure 4. GWAS enrichments across hematopoiesis. (a)
Representative example of GWAS enrichment across tissues (see methods). Colors as
shown in (b). (b) Hierarchical clustering of all GWAS (N=235) across diverse tissues. (c)
Summary of GWAS deviations across hematopoiesis normalized by maximum and
minimum signal.
85
References
1. Quesenberry, P. J. & Colvin, G. A. Hematopoietic Stem Cells, Progenitor Cells,

and Cytokines. In Williams Hematology. 153174 (McGraw-Hill, 2005).
2. Seita, J. & Weissman, I. L. Hematopoietic stem cell: self-renewal versus
differentiation. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 2,
640653 (2010).
3. Ji, H. et al. Comprehensive methylome map of lineage commitment from
haematopoietic progenitors. Nature 467, 338342 (2010).
4. Inlay, M. A. et al. Ly6d marks the earliest stage of B-cell specification and
identifies the branchpoint between B-cell and T-cell development. Genes and
Development 23, 23762381 (2009).
5. Lara-Astiaso, D. et al. Chromatin state dynamics during blood formation. Science
55, 110 (2014).
6. Chen, L. et al. Transcriptional diversity during lineage commitment of human
blood progenitors. Science 345, 12510331251033 (2014).
7. Novershtern, N. et al. Densely interconnected transcriptional circuits control cell
states in human hematopoiesis. Cell 144, 296309 (2011).
8. Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 5774 (2012).
9. Consortium, R. E. et al. Integrative analysis of 111 reference human epigenomes.
Nature 518, 317330 (2015).
12131218 (2013).
11. Majeti, R., Park, C. Y. & Weissman, I. L. Identification of a hierarchy of
multipotent hematopoietic progenitors in human cord blood. Cell Stem Cell 1,
63545 (2007).
12. Manz, M. G., Miyamoto, T., Akashi, K. & Weissman, I. L. Prospective isolation of
human clonogenic common myeloid progenitors. Proceedings of the National
Academy of Sciences of the United States of America 99, 1187211877 (2002).
13. Kohn, L. A. et al. Lymphoid priming in human bone marrow begins before
expression of CD10 with upregulation of L-selectin. Nature Immunology 13, 963
971 (2012).
14. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression
profiles. Nat Meth 12, 110 (2015).
16. He, H. H. et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias
in transcription factor footprint identification. Nat Meth 11, 7378 (2013).
17. Weiss, M. J. & Orkin, S. H. GATA transcription factors: key regulators of
hematopoiesis. Experimental Hematology 23, 99107 (1995).
18. Burns, C. E., Traver, D., Mayhall, E., Shepard, J. L. & Zon, L. I. Hematopoietic
stem cell fate is established by the Notch-Runx pathway. Genes & development 19,
233142 (2005).
86
19. Nerlov, C. & Graf, T. PU.1 induces myeloid lineage commitment in multipotent
hematopoietic progenitors. Genes & development 12, 24032412 (1998).
20. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
21. Sherwood, R. I. et al. Discovery of directional and nondirectional pioneer
transcription factors by modeling DNase profile magnitude and shape. Nat.
Biotechnol. 32, 1718 (2014).
22. Weirauch, M. T. et al. Determination and Inference of Eukaryotic Transcription
Factor Sequence Specificity. Cell 158, 14311443 (2014).
23. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A
census of human transcription factors: function, expression and evolution. Nature
Reviews Genetics 10, 252263 (2009).
24. Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal
immune basis of Alzheimers disease. Nature 518, 365369 (2015).
25. Farh, K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune
disease variants. Nature 518, 337343 (2015).
26. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated
Variation in Regulatory DNA. 337, 11901195 (2012).
27. De Vita, S. et al. Efficacy of selective B cell blockade in the treatment of
rheumatoid arthritis: evidence for a pathogenetic role of B cells. Arthritis &
Rheumatology 46, 202933 (2002).
28. Coenen, M. J. H. & Gregersen, P. K. Rheumatoid arthritis: a view of the current
genetic landscape. Genes and Immunity 10, 101111 (2009).
29. Petukhova, L. et al. Genome-wide association study in alopecia areata implicates
both innate and adaptive immunity. Nature 466, 113117 (2010).
30. Butovsky, O., Kunis, G., Koronyo-Hamaoui, M. & Schwartz, M. Selective
ablation of bone marrow-derived dendritic cells increases amyloid plaques in a
mouse Alzheimer's disease model. European Journal of Neuroscience 26, 413416
(2007).
31. Khoury, El, J. et al. Ccr2 deficiency impairs microglial accumulation and
accelerates progression of Alzheimer-like disease. Nat Med 13, 432438 (2007).
32. Gonzlez, A. J., Setty, M. & Leslie, C. S. Early enhancer establishment and
regulatory locus complexity shape transcriptional programs in hematopoietic
differentiation. Nature Genetics (2015).
33. Whitaker, J. W., Chen, Z. & Wang, W. Predicting the human epigenome from
DNA motifs. Nat Methods 12, 265272 (2015).
87
CHAPTER SIX The regulatory landscape of acute myeloid leukemia5
Introduction
Dysregulation of the intricate regulatory networks of the hematopoietic system
has been shown to play a critical role in the development of hematologic malignancies1.
Despite a low overall mutation rate2 and prolonged periods between cell divisions3, the
long lifespan of HSCs makes them susceptible to the accumulation of mutations over
time. Recent work4-8 has demonstrated that HSCs constitute a cellular reservoir for
mutation acquisition that plays a causative role in multiple hematopoietic malignancies.
In particular, in the case of acute myeloid leukemia (AML), HSCs isolated from leukemia
patients have been shown to harbor some but not all of the genetic alterations found in the
frankly leukemic cells and have, therefore, been termed pre-leukemic HSCs. Importantly,
many of the genes found to be recurrently mutated during the pre-leukemic phase of
AML have been shown to regulate the epigenome5,6 such as DNA methyltransferase 3A
(DNMT3A)9, ten-eleven translocated 2 (TET2)10, and isocitrate dehydrogenase 1 and 2
(IDH1/2)11,12. However, the role of these epigenetic mutations during the evolutionary
process of leukemogenesis and their effects on the regulatory networks that govern
normal hematopoiesis remains poorly understood. Notably, longstanding debates have
centered on how cell fate choices are corrupted in human leukemias13 whether leukemic
cells truly harbor multiple lineage-specific regulatory programs at once (termed lineage
infidelity) or merely maintain bipotential progenitor states that normally exist in
development (termed lineage promiscuity)--fundamental issues that may be resolved
by modern epigenomic technologies.

5
Portions of this chapter were taken from Corces R & Buenrostro et al. Lineage-specific and
single cell chromatin accessibility charts human hematopoiesis and leukemia evolution.
Submitted.
88
With hematopoiesis as a reference of normal development, we measure the effects
on the leukemogenic process of both early mutations in epigenetic modifiers and late
mutations in proliferative oncogenes, providing the first characterization of the full
evolutionary process of leukemogenesis. Through direct comparison of hematopoietic
cells isolated from normal human bone marrow and patient-matched pre-leukemic HSCs,
leukemia stem cells, and leukemic blast cells, we chart the genetic and epigenetic
progression from normal to malignant in AML. We demonstrate that the vast majority of
epigenetic and transcriptomic change that occurs during leukemogenesis is derived from
normal hematopoietic differentiation. Moreover, diverse genetic mutations can lead to
similar epigenetic alterations, suggesting a common path for the leukemogenic process.
Our results provide key insights into the evolutionary process of leukemogenesis and
identify important transcriptional programs that could be targeted to disrupt this process
during its earliest stages. In summary, this work serves as a rich resource for the study of
regulatory dynamics in normal and malignant hematopoiesis.
Leukemogenesis and cancer evolution in AML
We sought to characterize the evolution of AML, one of the most aggressive
hematopoietic malignancies14, in the context of normal hematopoiesis. To this end, we
first identified 3 distinct stages of AML evolution, pre-leukemic HSCs (pHSCs),
leukemia stem cells (LSCs), and leukemic blast cells (blasts). Each of these leukemic cell
populations can be enriched based on immunophenotype via FACS purification.
Unmutated HSCs serve as the reservoir for mutation acquisition during the early phases
of leukemogenesis (Fig. 1a). Acquisition of mutations, typically in genes that regulate the
89
epigenome, creates pHSCs that expand to create a pre-leukemic clone. Subsequent
acquisition of progressor mutations, typically in genes that lead to increased proliferation,
generates LSCs that are capable of self-renewal and the production of AML blasts (Fig.
1a).
Importantly, the population of HSCs isolated from leukemia patients by FACS
represents a heterogeneous mixture of healthy unmutated HSCs and pre-leukemic HSCs.
To quantify this heterogeneity, we define the pre-leukemic burden as the percentage of
HSCs isolated from a leukemia patient that harbor at least the first mutation. We profiled
the mutation frequency of known leukemogenic driver mutations in HSCs, T cells, and
blast cells from 39 AML patients. Pre-leukemic burden is highly variable in this cohort
with some patients exhibiting a complete repopulation of the HSC compartment with pre-
leukemic cells and others exhibiting undetectable levels of pre-leukemic mutations (Fig.
1b). The pre-leukemic mutations found in this large cohort recapitulate previous
findings5,6 showing that early mutations tend to occur in genes that modify the epigenome
while later mutations occur in genes involved in activated signal transduction.
AML represents a cooption of normal myelopoiesis
The AML leukemogenic process provides a novel system to study the genesis and
evolution of cancer at the level of the epigenome through the lens of normal
hematopoiesis. We performed Fast-ATAC and compared the chromatin accessibility
landscapes of patient-matched pHSCs, LSCs, and blasts. The optimized Fast-ATAC
protocol produced robust accessibility profiles from cryopreserved primary patient AML
cells (Fig. 1c). This allowed us to quantify the heterogeneity exhibited among the
90
different stages in leukemia evolution. We find that the level of epigenetic variance
between all samples of the same cell type increases through progressive stages of
leukemia evolution (Fig. 1d, see methods). As expected, all AML cell types exhibit more
inter-donor variance than normal hematopoietic cells. This may be the consequence of
the epigenetic mutations present in the leukemic cell types or a manifestation of the point
along the normal hematopoietic hierarchy at which the particular AML cell types exist.
Indeed, key developmentally-associated genes such as GATA2 and CEBPB show
variation amongst the AML cell types consistent with different developmental stages
(Fig. 1e). When overlaid across the principal components derived from normal
hematopoiesis, we find that the first four principal components from normal
hematopoietic differentiation account for 60% of the variation observed in our leukemia
samples (Fig. 1f). Assigning a score to the myeloid differentiation component of our data,
we find that the various stages of AML spread across the trajectory from HSC to
monocyte, indicating that the process of leukemogenesis largely mirrors the process of
normal myelopoiesis (Fig. 1g). Consistent with their functional ability to produce both
lymphoid and myeloid cells in xenotransplantation assays6,15,16, pHSCs are most closely
related to HSCs and MPPs (Fig. 1g). As shown previously17, LSCs show strong similarity
to GMP and LMPP cells and leukemic blast cells show a wider distribution with less
differentiated blasts clustering with GMP cells and more differentiated blasts clustering
with monocyte cells18,19 (Fig. 1g). These results indicate that the majority of inter-patient
variation in AML is derived from the developmental position along the normal myeloid
differentiation trajectory where each leukemia has arrested.
91
AML cell types exhibit lineage infidelity with regulatory contributions from multiple
normal blood cell types
These intermediate positions across myelopoiesis suggest that each patient-
specific AML might harbor a unique collection of multiple distinct normal regulatory
programs. Using enhancer cytometry, we quantified the contribution of each normal cell
type for each leukemic sample assayed (Fig. 2a). We found that each patient, at each
stage of leukemogenesis, harbors multiple distinct regulatory networks contributing to the
epigenetic diversity of leukemic cell types. Importantly, we find that the majority of the
patient donors have AML blasts that are clonally derived and harbor all the leukemic
mutations at comparable allele frequencies. Together, these findings raise the intriguing
possibility that AML cell types may either i) exist in stable intermediate cell states that
are not normally maintained during normal hematopoiesis, or ii) show developmental
heterogeneity within individual clonally derived cells. Traditional ensemble genome-
wide approaches for measuring regulatory elements average over cellular states and
cannot distinguish between these two hypotheses; however, we recently developed
single-cell ATAC-seq (scATAC-seq)20 and reasoned that scATAC-seq with enhancer
cytometry would be able to resolve these two pressing hypotheses (Fig. 2b).
To discriminate between these two possibilities, we performed scATAC-seq on
purified LSCs and blast cells from patient SU070. Although CIBERSORT could
accurately deconvolve bulk populations, we found that individual regulatory elements
within single cells often contained 0, 1 or 2 fragments, consistent with our previous
work20, and was simply too sparse for existing deconvolution methods such as
CIBERSORT. Rather than relying on individual regulatory elements, we reasoned that
92
principle component analysis (PCA) of the regulome, learned from normal bulk
hematopoiesis, could be used to assign chromatin accessibility at all enhancers to
developmental lineages and enable enhancer cytometry in single-cells (Fig. 2b). Indeed,
we found that with this approach, single cell accessibility profiles could be projected onto
hematopoietic principal components with high accuracy (Fig. 2c,d and Fig. S1b,c; see
methods). To better visualize and quantify heterogeneity within these cell subsets we
flattened these components onto a one-dimensional myelopoietic developmental
progression (Fig. 2e). Using these projections, we find that primary patient derived LSCs
and blast cells are remarkably homogenous and indeed exist at intermediate cell states.
This observation is corroborated by enhancer cytometry of a widely used clonal AML
cell line HL60, which also shows mixed normal cell contributions using ensemble (Fig.
S1a) and single-cell (Fig. 2e) enhancer cytometry. To further test our ability to project
single-cells onto hematopoietic components, we performed scATAC-seq on FACS-
purified MEP cells. Intriguingly, we find single MEPs show a predominant peak centered
at the MEP position with a prominent tail towards CMP along erythropoietic
differentiation (Fig. 2f and Fig. S1c). This observation is consistent with post-sort
analysis of MEPs suggesting a low level of contribution of CMP or CMP-to-MEP
transitional cell-states (Fig. S1d). Importantly, we also find that biological replicates of
scATAC-seq from the erythroleukemia cell line (K562) show highly reproducible
measures of erythroid differentiation (Fig. 2f). Together, these results corroborate a
lineage infidelity model wherein primary human AML cells and AML-derived cell lines
can simultaneously access two normally independent regulatory programs within the
same cell.
93
Generation of synthetic normal analogs for assessment of AML-specific biology
The ability to accurately quantify the contribution of each normal cell regulome to
the epigenetic profile of a leukemic cell type enables a more robust identification of
AML-specific regulatory elements. In particular, analyses of leukemic cell types in the
past have relied on comparing the malignant cells to a carefully chosen normal cell type.
Our data (Fig. 2a) shows that this may not be sufficient, and that multiple distinct normal
regulatory patterns are contributing to the biology of AML cells. Due to these mixed
lineages, we suspect that past epigenomic and transcriptomic cancer studies may be
highly biased towards the rediscovery of normal and developmentally dynamic genes
rather than bona fide cancer-specific genes. We reasoned that effective removal of this
normal contribution is possible through the generation of synthetic normal analogs
which represent admixtures of various normal cells defined by enhancer cytometry (see
methods). While comparison of AML cell types to their closest normal cell analogs yields
a high correlation (R = 0.86, Fig. 2g), comparison of AML cell types to their synthetic
normal analogs yields an even higher correlation (R = 0.91, Fig. 2h) and, more
importantly, leads to a reduction in the number of predicted AML-specific peaks (N =
10,954 to N = 8,003). Notably, we found that comparison of AML epigenomes to
synthetic normal analogs consistently resulted in higher Pearson correlation values (Fig.
S1e) and provided fewer cancer-specific peaks than comparison to the closest normal
analog (Fig. 2i and Fig. S1f).
By examining co-association of AML-specific peaks, we identified 6 regulatory
modules that are utilized by AML cells (Fig. 3a and Fig. S2a). We can track the usage of
94
these modules through leukemogenesis and identify patterns related to specific AML cell
types (Fig. 3b). Additionally, each module shows enrichment for peaks associated with
different key transcription factors (Fig. 3c). For example, modules 1 and 2 show strong
enrichment for JUN and FOS activity, indicating the activation of AP-1-dependent stress
response pathways in these cells. This increase in accessibility of JUN/FOS motifs is
echoed by an increase in expression of these factors by RNA-seq (Fig. S2b) and is
maintained through the stages of leukemogenesis, identifying inhibition of these
pathways as a potential therapeutic strategy in AML. Indeed, JNK inhibition showed a
moderate but consistent selective targeting of AML blasts (Fig. S2c-e). This observation
is consistent with previous publications that identify JNK as a therapeutic target in
AML21 and indicates that similar strategies may prove efficacious in targeting pre-
leukemic HSC.
Mechanism and clinical consequences of pre-leukemic HSC clonal advantage
Despite previous work on the acquisitions of mutations during the pre-leukemic
phase of AML evolution5, it remains unclear whether pre-leukemic HSC represent a
unique functional state or merely serve as long-lived reservoirs for mutation
accumulation. Moreover, functional epigenetic consequences of pre-leukemic mutations
in primary AML samples have not been characterized. Using ATAC-seq and enhancer
cytometry we show that pHSCs share many regulatory programs with HSCs and MPPs
(Fig. 6a). Nevertheless, comparison to synthetic normal analogs identifies a distinct
regulatory module (module 6) that shows decreased accessibility in pHSCs, representing
the earliest known event of AML evolution (Fig. 3b). This repressed regulatory module is
95
enriched for motifs associated with HSPCs (i.e. HOX and GATA) and provides direct
evidence to support a model where pHSCs maintain a unique epigenetic and functional
state.
In order to better understand the consequences of a loss in accessibility at motifs
associated with HSPCs, we probed pHSCs for phenotypic changes related to self-renewal
and differentiation. When pushed to differentiate down the myeloid and erythroid
lineages (Fig. S2f), pHSCs showed a strong resistance towards differentiation, instead
favoring maintenance of the stem cell state (Fig. 3d,e). Given the decreased accessibility
of module 6, this suggests that accessibility at certain stem cell-related motifs may confer
the ability to properly differentiate rather than properly self-renew. We have previously
assessed the effect of depletion of GATA1 and GATA2 on HSPC differentiation and self-
renewal(Mazumdar et al., 2015 in press), finding that knockdown of GATA2 led to a
decrease in self-renewal of HSPCs while knockdown of GATA1 had no effect. This
observation excludes these GATA factors from mediating the defects in differentiation
associated with repression of module 6. Given the well-studied role of HOX factors in
stem cells22, in particular the role of HOXA9 in HSCs, we hypothesized that HOXA9
might mediate the observed stemness phenotype. In fact, previous studies have shown an
increase in the number of HSCs in mice deficient for HOXA923. From this, we reasoned
that loss of accessibility at HOXA9 target sites may confer an increase in stemness and
prevent proper differentiation, a hallmark of AML. Indeed, we found depletion of
HOXA9 by short hairpin RNA (shRNA) knockdown (Fig. S2g) in umbilical cord blood
CD34+ HSPCs led to a retention of stemness in the context of both myeloid (Fig. 3f) and
erythroid (Fig. 3g) differentiation. Moreover, a concomitant decrease in differentiated
96
granulocytes and erythroid cells was also observed (Fig. S2h,i), consistent with results
from mouse models of HOXA9 deficiency23,24. In addition, we note that this retention of
stemness is also observed in the absence of a differentiation stimulus (Fig. S2j). Together,
these results suggest that decreased HOX accessibility in pHSCs may promote retention
of stemness and prevent differentiation of these cells.
The retention of stemness in pHSCs caused by loss of accessibility at HOXA9
motifs helps to explain the observation that pHSCs outcompete their normal HSC
counterparts in vivo (Fig. S7k). Retention of stemness provides pHSCs with an
evolutionary advantage in that resisting differentiation maintains cells in an HSC-like
state, which increases the likelihood of acquiring additional leukemogenic mutations.
One implication of this model is that pre-leukemic burden may have adverse effects on
patient survival, despite the fact that pHSCs do not confer disease in xenograft transplant
assays4,6,16. Characterization of our patient cohort shows that pre-leukemic burden
inversely correlates with overall survival and relapse-free survival (Fig. 3h,i). High pre-
leukemic burden is associated with approximately 300% increased likelihood of death or
leukemia relapse (hazard ratio = 3.30 for overall survival and 2.99 for relapse free
survival, p < 0.05). These results further implicate pHSCs in AML pathology and suggest
a mechanism wherein AML arises from the presence of a pre-leukemic clone that is
capable of outcompeting its normal HSC counterparts (Fig. S7k) and predispose patients
to more aggressive or refractory leukemia. In sum, detailed analysis of AML-specific
regulomes enables the identification of novel features of pHSC biology that have
important prognostic implications.
97
Discussion
The study of acute myeloid leukemia sheds light on the biology and step-wise
progression of leukemia evolution. We measured regulomes in patient-matched pre-
leukemic HSC, LSC, and blast cells representing three distinct time points in AML
evolution. Examination of the average epigenetic variance across the genome shows that
variance increases through the stages of leukemia evolution with the majority of this
variance being explained by differences observed during normal hematopoietic
differentiation. The epigenetic landscapes of AML blast cells isolated from various
patients are extremely divergent, highlighting the need for personalized approaches to
adequately target each patients unique cancer cells.
A longstanding debate in cancer biology is how cancer cells violate cell lineage
rules. Cancer cells with markers or morphologies of one cell type have been shown to
also express markers of a different cell type25, which raises diagnostic challenges and
treatment conundrums. Two classic but competing models posited (i) lineage infidelity
a single cancer cell simultaneously accesses two normally distinct regulatory programs;
or alternatively (ii) lineage promiscuitya normally bipotential progenitor cell exists,
and the cancer cell is simply an expansion of this rare but physiologic bipotential state.
By using our comprehensive map of hematopoiesis, patient-matched AML cell subsets,
and single-cell ATAC-seq of hundreds of individual leukemic cells, we show direct
evidence of lineage infidelitya single cell accessing a mixed regulatory program. This
result has potentially important diagnostic and mechanistic implications, and we build
upon both classical models to address this challenge. Comparison of cancer to matched
normal cells is one of the most basic and commonplace experiments in cancer biology,
98
but lineage infidelity demonstrates that there may be no appropriate normal for
comparison in epigenomic and transcriptomic studies. Instead, we use enhancer
cytometry to construct synthetic normalsproportionally matching the fractional
contribution of cell type-specific regulomesin order to pinpoint cancer-specific
aberrations.
This approach streamlined the discovery of candidate drivers and led us to
discover the loss of HOXA9-mediated accessibility as the most consistent defect in pre-
leukemic HSCs. We found that HOXA9 loss can, in fact, cause defects in differentiation
as observed in these pre-leukemic HSC and confer an evolutionary advantage.
Importantly, higher pre-leukemic burden is predictive of poor overall and relapse-free
survival in AML, indicating an important role for pre-leukemic HSC in disease
pathogenesis. These results provide potential avenues for therapeutic intervention during
the earliest stages of leukemogenesis. Moreover, we anticipate that lineage infidelity is a
widespread phenomenon in many types of cancer, and that our integrative approach using
enhancer cytometry to construct synthetic normal analogs should be broadly applicable to
many disease pathologies.
99
Figure 1. Acute myeloid leukemia regulomes reveal a cooption of normal
myelopoiesis. (a) Schematic of the leukemogenic process. HSCs serve as a reservoir of
mutation acquisition. Early mutations in epigenetic modifiers such as DNMT3A, TET2,
and IDH1/2 generate pre-leukemic HSCs. Downstream acquisition of genes involved in
activated signal transduction such as FLT3 and RAS lead to generation of leukemia stem
cells which both self-renew and produce leukemic blast cells. (b) Genotype and mutation
frequencies of HSCs isolated from AML patients (N=39). Color indicates the percent of
cells mutated as estimated from the variant allele frequency. Gray color indicates a
mutation known to be present in leukemic cells but not observed during the pre-leukemic
phase of AML evolution (i.e. a late mutation event). Asterisks indicate the predicted first
mutation. If a mutation is bi-allelic, the representative bar is divided in half. Patients with
more than 20% of HSCs harboring a pre-leukemic mutation were classified as high
100
burden and those patients with less than 20% of HSCs harboring a pre-leukemic
mutation were classified as low burden. (c) Normalized sequencing track of control
loci on chromosome 19 from FACS-purified AML cell types. Profiles represent the union
of all biological replicates for each cell type. (d) Mean variance of chromatin accessibility
across the genome as calculated by a moving average across each leukemic cell stage (see
methods). (e) Normalized sequencing tracks of developmentally-associated genes
GATA2 (left) and CEBPB (right). Profiles represent the union of all biological replicates
for each cell type pHSC (N=12), LSC (N=8), Blasts (N=12). (f) Cumulative variance of
AML ATAC-seq data explained by the first N principal components derived from normal
hematopoiesis. (g) Myeloid development score in normal blood cell types (N=4
biological replicates) and AML cell types. The myeloid score is calculated from the first
principal component which encompasses the majority of variation observed in
myelopoiesis.
101
Figure 2. Enhancer cytometry and single-cell regulomes support a model of lineage
infidelity and allow for deconvolution of AML-specific biology. (a) Enhancer
cytometry deconvolution showing contribution of various normal cell types to the
epigenetic landscape of different AML cell types. (b) Schematic of single-cell ATAC-seq
protocol and analysis. (c,d) Projection of ATAC-seq data derived from (c) single SU070
LSCs and (d) single SU070 blast cells onto the principal components derived from the
normal hematopoietic hierarchy. (e,f) Relative density of (e) single SU070 LSCs, SU070
blasts, and HL60 and (f) single MEP and K562 cells projected onto a one-dimensional
representation of the myeloid and erythroid progression, respectively. Two biological
replicates of K562 cells are marked as K562-1 and K562-2. (g) Scatter plot showing
the correlation of ATAC-seq data derived from SU353 blast cells with the closest normal
102
cell type (GMP) (R=0.86). Using a log2(fold change) cutoff of 4 we identify 8,209 peaks
depleted and 10,954 peaks enriched in SU353 blast cells. (h) Scatter plot, as shown in (g),
showing the correlation between SU353 blast cells with the enhancer cytometry-defined
synthetic normal analog (R=0.91). Using a log2(fold change) cutoff of 4 identifies 5,887
peaks enriched in the synthetic normal analog and 8,003 peaks enriched in SU353 blast
cells. (i) Comparison of AML cell types to synthetic normal analogs. The closest normal
is shown in color. The percent of the total significant peaks that are removed by
comparison to synthetic normal analogs is plotted for each sample.
103
Figure 3. Early chromatin accessibility alterations within pHSCs promote stemness
which predicts adverse patient outcomes. (a) K-means clustering of cancer-specific
peaks identifies 6 distinct regulatory modules. (b) Enrichment of each module, identified
in Figure 7a identifies activated and repressed patterns in leukemogenic progression.
Gray bars shown represent 1 S.D. across all samples of that given cell type. (c)
Enrichment and hierarchical clustering of motifs in AML-specific regulatory modules.
(d,e) Retention of stemness as measured by flow cytometric analysis of CD34 protein
expression after 6 days of enforced differentiation down the (d) myeloid lineage and (e)
erythroid lineage. Error bars represent 1 S.D. Experiments done in triplicate. (f,g) Fold
change in the percent of cells expressing CD34 as measured by flow cytometric analysis
of human umbilical cord blood-derived HSCs transduced with shRNAs targeting HOXA9
104
or a non-targeting control. Percent CD34+ cells measured after 6 days of enforced
differentiation down the (f) myeloid lineage and (g) erythroid lineage. Only GFP+
transduced cells analyzed. Error bars represent 1 S.D. Experiments done in triplicate. (h)
Overall and (i) relapse-free survival of patients stratified by pre-leukemic burden as
described in Figure 5b (High pre-leukemic burden, N=24; Low pre-leukemic burden
N=15). High pre-leukemic burden defined as greater than or equal to 20% of HSCs
harboring at least the first pre-leukemic mutation. Survival analysis was performed using
the Kaplan-Meier estimate method. All patients were included for the analysis regardless
of their treatment. P values comparing two Kaplan-Meier survival curves were calculated
using the log-rank (Mantel-Cox) test. Hazard ratios were determined using the Mantel-
Haenszel approach. **p<0.01, ***p<0.001, ****p<0.0001 derived from two-tailed t-test.
105
Supplementary Figure 1. Validation of enhancer cytometry in AML cell lines and
primary cells by single-cell ATAC-seq. (a) Enhancer cytometry of ATAC-seq data
derived from various blood cell lines demonstrates mixed regulatory contribution from
various normal hematopoietic cell types. (b) Projection of down sampled bulk
hematopoiesis data onto myeloid (left) and erythroid (right) progression. (c) Projection of
single MEPs onto hematopoiesis principal components 2 and 3. (d) Post-sort analysis of
MEPs used in scATAC-seq analyses presented in Figure 6f and Supplementary Figure 6c
gated for CMP (2.54%), MEP (97.5%) and GMP (0%). (e) Pearson correlations of AML
cell types with the closest normal analog (color) and the enhancer cytometry-derived
106
synthetic normal (gray). (f) Total significant peaks observed after comparison of AML
cell types to synthetic normal analogs. Significance measured as log2(fold change) > 3.
107
Supplementary Figure 2. Validation of regulatory network analysis in AML cell
types. (a) Principal component analysis of the log2(fold change) values of each AML cell
type compared to its synthetic normal. (b) Expression of JUN in various normal
hematopoietic cells, pHSCs, and blasts. *p<0.05, two-tailed t-test. (c-e) The effect of
JNK/ERK inhibition by (a) JNK-IN-8, (b) SP600125, and (c) SCH772984 was
determined by IC50 of sorted primary AML blast cells in comparison to CD34+ HSPCs
derived from umbilical cord blood. Viability determined by flow cytometric assessment
of Annexin V and DAPI. (f) Strategy for in vitro differentiation of HSPCs down the
myeloid and erythroid lineages. HSPCs are grown in defined culture media for 6 days
108
and then analyzed for cell surface markers of stemness or differentiation. Immature cells
at day 6 express CD34 and have not yet upregulated CD33. (g) Quantitative reverse-
transcriptase PCR validation of HOXA9 knockdown via shRNA. Knockdown performed
in THP1 cells for 72 hours and validated with two separate primer sets. (h,i) Fold change
in the percent of (h) CD15+ granulocytes or (i) CD71+GPA+ erythroblasts between cord
blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9
shRNAs after 6 days of differentiation down the (h) myeloid or (i) erythroid lineage.
***p<0.001, ****p<0.0001 by two-tailed t-test. (j) Fold change in the percent of CD34+
HSPCs after 6 days of culture in stemness retention media (see methods) between cord
blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9
shRNAs. (k) Burden of mutations in DNMT3A, TET2, IDH1/2, or other genes when
detected in pre-leukemic HSC. *p < 0.05, **p < 0.01 by two-tailed t-test
109
References
1. Shih, A. H., Abdel-Wahab, O., Patel, J. P. & Levine, R. L. The role of mutations in
epigenetic regulators in myeloid malignancies. Nature Reviews Cancer 263, 2235
(2015).
2. Araten, D. J. et al. A quantitative measurement of the human somatic mutation
rate. Cancer research 65, 81117 (2005).
3. Sun, J. et al. Clonal dynamics of native haematopoiesis. Nature (2014).
4. Jan, M. et al. Clonal evolution of preleukemic hematopoietic stem cells precedes
human acute myeloid leukemia. Science translational medicine 4, 110 (2012).
5. Corces-Zimmerman, M. R. & Majeti, R. Pre-leukemic evolution of hematopoietic
stem cells: the importance of early mutations in leukemogenesis. Leukemia 28,
22762282 (2014).
6. Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in
acute leukaemia. Nature 506, 328333 (2014).
7. Lindberg, J. et al. Clonal Hematopoiesis and Blood-Cancer Risk Inferred from
Blood DNA Sequence. N Engl J Med 371, 24772487 (2014).
8. Jaiswal, S. et al. Age-Related Clonal Hematopoiesis Associated with Adverse
Outcomes. N Engl J Med 371, 24882498 (2014).
9. Okano, M., Xie, S. & Li, E. Cloning and characterization of a family of novel
mammalian DNA ( cytosine-5 ) methyltransferases Non-invasive sexing of
preimplantation stage mammalian embryos. Nature Genetics 19, 219220 (1998).
10. Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in
mammalian DNA by MLL partner TET1. Science 324, 9305 (2009).
11. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate.
Nature 462, 73944 (2009).
12. Figueroa, M. E. et al. Leukemic IDH1 and IDH2 Mutations Result in a
Hypermethylation Phenotype, Disrupt TET2 Function, and Impair Hematopoietic
Differentiation. Cancer Cell 18, 553567 (2010).
13. Greaves, M. F., Chan, L. C., Furley, A. J. W., Watt, S. M. & Molgaard, H. V.
Lineage Promiscuity in Hemopoietic Differentiation and Leukemia. Blood 67, 1
11 (1986).
14. Dohner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute Myeloid Leukemia. N
Engl J Med 373, 113652 (2015).
15. Jan, M. & Majeti, R. Clonal evolution of acute leukemia genomes. Oncogene 16
(2012).
16. Corces-Zimmerman, M. R., Hong, W.-J., Weissman, I. L., Medeiros, B. C. &
Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect
epigenetic regulators and persist in remission. Proceedings of the National
Academy of Sciences of the United States of America 111, 254853 (2014).
17. Goardon, N. et al. Coexistence of LMPP-like and GMP-like Leukemia Stem Cells
in Acute Myeloid Leukemia. Cancer Cell 19, 138152 (2011).
18. Bennet, J. M. et al. Proposals for the classification of the acute leukaemias.
French-American-British (FAB) co-operative group. British Journal of
Haematology 33, 4518 (1976).
19. van't Veer, M. B. The diagnosis of acute leukemia with undifferentiated or
110
minimally differentiated blasts. Annals of Hematology 64, 1615 (1992).
21. Volk, A. et al. Co-inhibition of NF- B and JNK is synergistic in TNF-expressing
human AML. Journal of Experimental Medicine 211, 10931108 (2014).
22. Abramovich, C. & Humphries, R. K. Hox regulation of normal and leukemic
hematopoietic stem cells. Current opinion in hematology 12, 210216 (2005).
23. Magnusson, M., Brun, A. C. M., Lawrence, H. J. & Karlsson, S.
Hoxa9/hoxb3/hoxb4 compound null mice display severe hematopoietic defects.
Experimental Hematology 35, 1421.e11421.e9 (2007).
24. Lawrence, H. J. et al. Mice bearing a targeted interruption of the homeobox gene
HOXA9 have defects in myeloid, erythroid, and lymphoid hematopoiesis. Blood
89, 19221930 (1997).
25. Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals
Progenitor-like Cells that Correlate with Prognosis. Cell 162, 184197 (2015).
111
CHAPTER SEVEN Conclusion
Methods for gene regulation
The thesis work presented leverages high-throughput methodologies in effort to
provide a quantitative understanding of cellular regulation. Using an in vitro approach we
make >107 quantitative measurements of the biophysical parameters defining an RNA-
protein interaction across sequence mutants. This platform may be extended to profile a
diversity of RNA-protein interactions and may form the basis of methods that quantify
protein-protein interactions or chromatin-TF interactions. Such efforts promise to provide
a principled biochemical understanding of the sequence and structure determinants of
trans-factor binding.
Measuring these regulatory processes in vivo provides unique insight into the
characteristics and potential of cellular behavior. Preceding this work, methods for
measuring chromatin structure genome-wide often required tens of millions of cells and
included complex experimental workflows. We have developed ATAC-seq and scATAC-
seq for profiling chromatin accessibility within rare cellular populations and/or from
single-cells. Together, these methods enable genome-wide chromatin accessibility
measurements of carefully isolated or de novo defined cellular populations, and the
inference of the trans-acting regulatory proteins that define them. In addition, these
methods can measure chromatin accessibility in in vivo derived human tissues, as
demonstrated by our efforts to understand human hematopoiesis and leukemogenesis.
Future work
Regulatory rules describing promoter-enhancer interactions and their effect on the
expression of nearby genes would greatly enhance our ability to causally link the
112
epigenome to gene expression and subsequently disease mutations to phenotypes. Such a
lofty endeavor will require new experimental and computational methodologies.
Specifically, TF-TF interactions, TF-remodeler or other TF-protein interactions are
critical for understanding TF binding landscapes and gene expression in vivo. Further
development of in vitro methods or high-throughput in vivo reporter assays may be used
to further elucidate these mechanisms.
Furthermore, combining genome-wide assays within single-cells provides a
unique opportunity to develop regulatory models, wherein natural variation within single
cells can be used to infer causal changes of expression at nearby genes. Integrating
ATAC-seq, RNA-seq and protein measurements in the same single-cell at high-
throughput may serve to quantify trans-acting regulators, their binding to cis regulatory
elements and the effect of expression in nearby genes.
Together, these approaches provide deep insight into individual regulatory
patterns, however, only a combined or systems approach to these data would yield a
complete understanding of cellular regulation within single-cells. To this effort,
computational models that integrate these data sets and infer causality, for example the
expression of a gene or cellular response to a stimulus, are required. In summary, a
multidisciplinary and collaborative approach promises to enrich our understanding of
cellular regulation and form the basis of our understanding of human disease.
113

Buenrostro Thesis Augmented

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Buenrostro Thesis Augmented

Transféré par

Droits d'auteur :

Formats disponibles

METHODS FOR QUANTIATIVE DISSECTION

SUBMITTED TO THE GENETICS DEPARTMENT

AND THE COMMITTEE ON GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

This work is licensed under a Creative Commons Attribution-

This dissertation is online at: http://purl.stanford.edu/mn616fx6627

William Greenleaf, Primary Adviser

Howard Chang, Co-Adviser

Michael Snyder, PhD

Approved for the Stanford University Committee on Graduate Studies.

I have had an incredible experience at Stanford University, throughout my time

in life, and an incredible amount of patience throughout my graduate experience.

support of all my scientific endeavors.

my research here and providing such a wonderful environment to start my academic

Wu, experiences that have significantly enriched my graduate experience.

me an amazing opportunity to join the Stanford community as a research assistant in

presented here would not have been possible.

Importantly, Id like to thank my family who has provided life-long support.

CHAPTER ONE - Introduction .......................................................................................... 1

Cellular regulation in cis and trans

each providing a specialized and context-specific function. The establishment and

maintenance of a cells identity is largely determined by defined regulatory programs

degradation, or protein modifications. The expression of transcription factors (TFs) and

chromatin remodelers drive chromatin accessibility, which spans a continuum from

nucleosome-free and nucleosome-associated, to higher-order chromatin compaction.

Highly compacted chromatin is sequestered from regulatory machinery, whereas

nucleosome-free chromatin demarcates regions of active regulation in cells. Distal

which together determine the expression of nearby genes.

(RBPs), which have diverse effects on post-transcriptional processes. Here, eukaryotic

or occluded binding substrates for trans-acting regulators. A quantitative and genome-

The advent of high throughput sequencing1 methodologies has enabled unbiased

throughput assays measuring chromatin bound proteins (ChIP-seq)2 or RNA bounds

identifying the binding locations of trans-acting proteins. In addition, assays for

genome-wide analysis of the structural determinants of this binding landscape. However,

structural determinants of trans-factor binding to chromatin or RNA.

A quantitative and high-throughput approach to binding

Carefully controlled in vitro assays can be used to determine the biophysical

or not quantitative. In this thesis, I will describe the development of a high-throughput

MaP5. Here, we repurpose a high-throughput sequencing instrument to serve as massively

an RNA binding protein to >107 mutants of an RNA stem loop.

Current methods to profile chromatin accessibility require millions of cells,

chromatin structures within different cell states leading to an incorrect understanding of

regulatory processes within these tissues. In response, we have developed genome-wide

populations6 or within single-cells7. With such methods defined cellular populations

cell-type isolation, in contrast single-cell ATAC-seq (scATAC-seq) may be used to

unprecedented view of chromatin structure in vivo.

understanding dynamic cellular behavior, is the hematopoietic hierarchy (Fig. 1b),

these intricate regulatory networks lead to a multitude of hematologic malignancies11. In

and acute myeloid leukemia (AML) in effort to elucidate governing biochemical

principles defining normal human development and disease.

1. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible

RNA-protein interactions drive a wide variety of critical biological processes

from gene expression1 to viral assembly2. Up to 10% of the eukaryotic proteome is

play a role in epigenetic state changes during differentiation7, perhaps through

scaffolding chromatin remodelers8,9. Furthermore, RNA-protein interactions have

A biophysical understanding of the nucleic-acid sequence determinants of RNA-

Unlike double-stranded DNA, RNA substrates demonstrate diverse intramolecular

interactionsincluding, mismatched base bulges, stem loops, pseudo knots, g-quartets,

divalent cation interactions, and non-canonical base pairsthat determine three-

proteins (RBPs)15. The combinatorial nature of RNA sequence and intramolecular

methods has precluded a high-resolution, predictive understanding of both the sequence

dependence of affinity and the resulting evolutionary constraints imposed by these

making bioinformatic identification of functional RNAs difficult16.