Vous êtes sur la page 1sur 120

METHODS FOR QUANTIATIVE DISSECTION

OF GENE REGULATION

A DISSERTATION

SUBMITTED TO THE GENETICS DEPARTMENT

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Jason Buenrostro

December 2015
2016 by Jason Daniel Buenrostro. All Rights Reserved.
Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-


Noncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/mn616fx6627

ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

William Greenleaf, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Howard Chang, Co-Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Gerald Crabtree

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jin Li

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Michael Snyder, PhD

Approved for the Stanford University Committee on Graduate Studies.


Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.

iii
Acknowledgements

I have had an incredible experience at Stanford University, throughout my time

here I have had the immense privilege of working with many talented and amazing

people. That list begins with Will Greenleaf, my primary mentor and friend for the last 5

years in graduate school. Will has consistently provided thoughtful advice in science and

in life, and an incredible amount of patience throughout my graduate experience.

Throughout this time I have also had the privilege of mentorship from Howard Chang, I

thank Howard for his creative and thoughtful ideas, but more importantly his unwavering

support of all my scientific endeavors.

I thank Stanford and the Stanford Genetics department for the opportunity to do

my research here and providing such a wonderful environment to start my academic

career. I also thank my faculty committee Mike Snyder, Lars Steinmetz, Jerry Crabtree,

Billy Li and Ravi Majeti for their service, insights and support. In addition, I thank Mike

Snyder for the opportunity to collaborate and for the opportunity to work with Beijing

Wu, experiences that have significantly enriched my graduate experience.

I thank past mentors Hanlee Ji and Georges Natsoulis for mentorship and amazing

start to my scientific life. Specifically, I thank Hanlee Ji for my start in science, providing

me an amazing opportunity to join the Stanford community as a research assistant in

2009. I also thank Samuel Myllykangas, for his mentorship and friendship throughout

this time.

In addition to the long list of mentors, I thank many close collaborators for

making this work possible. I thank Lauren Chircus and Carlos Araya for their

determination, hard work and creativity, an essential component for the development of

iv

the RNA array. I thank Paul Gerisi for teaching me everything I know about chromatin.

Ulli Litzenberger, Dave Ruff and the Fluidigm team for their wonderful insight and

dedication to the development of single-cell ATAC-seq. I also thank new friends, Ryan

Corces and Ansu Satpathy, who have brought new and fresh perspectives to my scientific

thinking. I also deeply thank Beijing Wu, I am thankful for her unwavering dedication to

our work, personal support, and general thoughtfulness, without her much of this work

presented here would not have been possible.

Importantly, Id like to thank my family who has provided life-long support.

Specifically, my parents Miguel and Martha Buenrostro, who have made tremendous

sacrifices throughout our lives to provide me with the opportunity and preparation to

pursue my dreams. I also thank my brother, sisters, and niece, Michael, Michelle, Erika

and Sam, for their love, support and patience. I also thank my roommate and partner Sara

Prescott, who has been there for me at my best and my worst, and continues to be my

closest ally in science and in life. Lastly, I thank all of my friends, family, past mentors

and other collaborators, whom I regret for not having enough space to mention here.

v

TABLE OF CONTENTS

CHAPTER ONE - Introduction .......................................................................................... 1


Cellular regulation in cis and trans .................................................................................. 1
Genome-wide methods .................................................................................................... 2
A quantitative and high-throughput approach to binding ............................................... 2
Measuring chromatin accessibility in rare cells .............................................................. 3
CHAPTER TWO Quantitative dissection of millions of sequence variants ................... 5
Introduction ..................................................................................................................... 5
A high-throughput RNA array platform for quantitative binding measurements ........... 7
The RNA-array enables quantitative measurement of both binding and dissociation .... 8
Binding affinity can be partitioned between primary and secondary structure............... 9
Changes in association rate substantially contribute to changes in binding energies ... 11
Discussion ..................................................................................................................... 12
Chapter 2 - Figures and Figure Legends ....................................................................... 15
References ..................................................................................................................... 21
CHAPTER THREE Measuring accessibility in rare cellular populations..................... 24
Introduction ................................................................................................................... 24
ATAC-seq measures chromatin accessibility using Tn5 transposase ........................... 25
Insert size yields information regarding nucleosome packing and positioning ............ 26
ATAC-seq reveals distinct classes of factor-nucleosome spacing ................................ 28
Footprints can be used to infer factor occupancy genome-wide ................................... 29
Discussion ..................................................................................................................... 30
Chapter 3 - Figures and Figure Legends ....................................................................... 32
References ..................................................................................................................... 37
CHAPTER FOUR Single-cell accessibility reveals principles of regulatory variation 39
Introduction ................................................................................................................... 39
Single-cell ATAC-seq a measure of chromatin accessibility genome-wide ................. 39
Cell-cell variability in trans ........................................................................................... 41
Trans-factors synergize to induce cell-cell variability .................................................. 42
Cell-state and chemical perturbation effects on cell-cell variability ............................. 43
Single-cells vary in cis .................................................................................................. 45
Discussion ..................................................................................................................... 46
Chapter 4 - Figures and Figure Legends ....................................................................... 48
References ..................................................................................................................... 60
CHAPTER FIVE The epigenomic determinants of human hematopoiesis ................... 62
Introduction ................................................................................................................... 62
Identification of chromatin accessibility landscape in primary blood cells .................. 64
Chromatin accessibility at distal elements delineates the hematopoietic hierarchy...... 65
Enhancer cytometry prospectively deconvolves complex cell populations .................. 67
Regulatory networks of normal hematopoiesis ............................................................. 68

vi

Accessibility profiles of purified cell populations identify the ontogeny of human
diseases .......................................................................................................................... 70
Discussion ..................................................................................................................... 72
Chapter 5 - Figures and Figure Legends ....................................................................... 75
References ..................................................................................................................... 86
CHAPTER SIX The regulatory landscape of acute myeloid leukemia ......................... 88
Introduction ................................................................................................................... 88
Leukemogenesis and cancer evolution in AML ............................................................ 89
AML represents a cooption of normal myelopoiesis .................................................... 90
AML cell types exhibit lineage infidelity with regulatory contributions from multiple
normal blood cell types ................................................................................................. 92
Generation of synthetic normal analogs for assessment of AML-specific biology ...... 94
Mechanism and clinical consequences of pre-leukemic HSC clonal advantage .......... 95
Discussion ..................................................................................................................... 98
Chapter 6 - Figures and Figure Legends ..................................................................... 100
References ................................................................................................................... 110
CHAPTER SEVEN Conclusion .................................................................................. 112
Methods for gene regulation ....................................................................................... 112
Future work ................................................................................................................. 112

vii

CHAPTER ONE - Introduction

Cellular regulation in cis and trans

The human body is comprised of a large collection of highly diverse cell types,

each providing a specialized and context-specific function. The establishment and

maintenance of a cells identity is largely determined by defined regulatory programs

effecting diverse cellular processes such as chromatin accessibility, RNA localization and

degradation, or protein modifications. The expression of transcription factors (TFs) and

chromatin remodelers drive chromatin accessibility, which spans a continuum from

nucleosome-free and nucleosome-associated, to higher-order chromatin compaction.

Highly compacted chromatin is sequestered from regulatory machinery, whereas

nucleosome-free chromatin demarcates regions of active regulation in cells. Distal

nucleosome-free regulatory elements can have highly divergent interactions with gene

promoters in cis, acting as: i) activators, or enhancers, ii) repressors or iii) insulators,

which together determine the expression of nearby genes.

Analogous principles hold true for RNA regulation, wherein RNA structure

defines the binding landscape of micro RNAs (miRNAs) and RNA binding protein

(RBPs), which have diverse effects on post-transcriptional processes. Here, eukaryotic

RNAs can fold into simple 2D or complex 3D folded structures, which define permissive

or occluded binding substrates for trans-acting regulators. A quantitative and genome-

wide understanding of these dynamic cellular structures would provide unique insight

into the binding determinants of trans-acting regulators, drivers of cellular function and

cellular potential.

1
Genome-wide methods

The advent of high throughput sequencing1 methodologies has enabled unbiased

and genome-wide characterization of these diverse cellular processes. For example, high-

throughput assays measuring chromatin bound proteins (ChIP-seq)2 or RNA bounds

proteins (RIP-seq and CLIP-seq)3, have been shown to be sensitive methods for

identifying the binding locations of trans-acting proteins. In addition, assays for


4,5
measuring chromatin accessibility (DNase-seq) or RNA structure (PARS)6 enable a

genome-wide analysis of the structural determinants of this binding landscape. However,

as described in the following sections, these methods are limited in several ways. In the

following thesis I will discuss the development of new methods, which focus on the

structural determinants of trans-factor binding to chromatin or RNA.

A quantitative and high-throughput approach to binding

Carefully controlled in vitro assays can be used to determine the biophysical

parameters defining a binding interaction. However, current methods are low throughput

or not quantitative. In this thesis, I will describe the development of a high-throughput

and generalizable platform for performing biochemical assays of RNA called RNA-

MaP5. Here, we repurpose a high-throughput sequencing instrument to serve as massively

parallel biochemistry platform. We use this platform to describe the kinetic parameters of

an RNA binding protein to >107 mutants of an RNA stem loop.

2
Measuring chromatin accessibility in rare cells

Current methods to profile chromatin accessibility require millions of cells,

limiting their application to either cell lines or whole tissues. Applying these methods to

complex cellular populations, derived from tissues, averages over the rich diversity of

chromatin structures within different cell states leading to an incorrect understanding of

regulatory processes within these tissues. In response, we have developed genome-wide

methods for measuring chromatin accessibility (ATAC-seq)(Fig. 1a) within rare cellular

populations6 or within single-cells7. With such methods defined cellular populations

within complex tissues can be isolated using flow cytometry and profiled using ATAC-

seq. However, this approach is also limited in that it requires established protocols for

cell-type isolation, in contrast single-cell ATAC-seq (scATAC-seq) may be used to

partition cells into relevant subtypes de novo. Together, these assays offer an

unprecedented view of chromatin structure in vivo.

Of particular importance to human health and disease, and an excellent model for

understanding dynamic cellular behavior, is the hematopoietic hierarchy (Fig. 1b),

wherein a single hematopoietic stem cell (HSC) can give rise to a multitude of distinct

cellular populations ranging from enucleated red blood cells (RBCs) to specialized

immune cells (CD4 and CD8 T cells, B cells and more). Importantly, dysregulation of

these intricate regulatory networks lead to a multitude of hematologic malignancies11. In

this work we also apply ATAC-seq and scATAC-seq to normal human hematopoiesis

and acute myeloid leukemia (AML) in effort to elucidate governing biochemical

principles defining normal human development and disease.

3
References

1. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible


terminator chemistry. Nature 456, 5359 (2008).
2. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of
in vivo protein-DNA interactions. Science 316, 14971502 (2007).
3. Zhao, J. et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP-
seq. Molecular Cell 40, 939953 (2010).
4. Thurman, R. E. et al. The accessible chromatin landscape of the human genome.
Nature 489, 7582 (2012).
5. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
Chromatin across the Genome. Cell 132, 311322 (2008).
6. Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the
transcriptome through RNA structure. Nature Reviews Genetics 12, 641655 (2011).
7. Buenrostro, J. D. et al. Quantitative analysis of RNA-protein interactions on a
massively parallel array reveals biophysical and evolutionary landscapes. Nat.
Biotechnol. 32, 562568 (2014).
8. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).
9. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of
regulatory variation. Nature 523, 486490 (2015).

4
CHAPTER TWO Quantitative dissection of millions of sequence variants1

Introduction

RNA-protein interactions drive a wide variety of critical biological processes

from gene expression1 to viral assembly2. Up to 10% of the eukaryotic proteome is

estimated to bind RNA3, and recent work has begun to uncover a web of RNA-protein

interactions4-6 that can control gene expression through splicing, RNA localization, and

other post-transcriptional processes. Protein interactions with long noncoding RNAs also

play a role in epigenetic state changes during differentiation7, perhaps through

scaffolding chromatin remodelers8,9. Furthermore, RNA-protein interactions have

proven powerful tools in synthetic biology, allowing gene expression control through

post-transcriptional regulation10,11.

A biophysical understanding of the nucleic-acid sequence determinants of RNA-

protein interactions lags behind our growing realization of their biological importance.

Unlike double-stranded DNA, RNA substrates demonstrate diverse intramolecular

interactionsincluding, mismatched base bulges, stem loops, pseudo knots, g-quartets,

divalent cation interactions, and non-canonical base pairsthat determine three-

dimensional RNA structure12-14 and set the landscape for interactions with RNA-binding

proteins (RBPs)15. The combinatorial nature of RNA sequence and intramolecular

interactions, coupled with the relative paucity of data produced from current biophysical

methods has precluded a high-resolution, predictive understanding of both the sequence

dependence of affinity and the resulting evolutionary constraints imposed by these

requirements. Because the relationship between sequence and binding is often opaque,

1
Portions of this chapter were taken from Buenrostro et al. Quantitative analysis of RNA-protein
interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Nature
Biotechnology. 2014. doi:10.1038/nbt.2880.

5
little is understood regarding the evolutionary constraints on these RNA structures,

making bioinformatic identification of functional RNAs difficult16.

Current methods for investigating the sequence dependence of RNA-protein

interactions include medium-throughput microfluidic methods17 and high-throughput

methods coupling affinity-based selection with high-throughput DNA sequencing or

array hybridization18 and recently have been used to generate a catalogue of RNA binding

motifs19. While powerful, selection and sequencing methods bias results towards high-

activity variants and do not directly and quantitatively measure the biophysical

parameters that underlie biological function20. Recently, methods have been developed to

quantitatively measure catalysis21,22, however, no such high-throughput methods exists for

determining binding parameters kon, koff, and Kd for RNA-protein interactions.

The technological innovations that have propelled the high-throughput sequencing

revolution provide the foundations for massively parallel, fluorescence-based

observations over a large variety of nucleic acid structures immobilized on a surface23-26.

Recent work characterizing DNA-protein interactions26 has demonstrated the utility of

these instruments for high-throughput binding affinity assays across large DNA sequence

space. In this work, we have leveraged the Illumina DNA sequencing platform, an

instrument that integrates solid-phase molecular biology, fluidics, and high-throughput

TIRF imaging for massively parallel DNA sequencing27, to create a platform for direct,

ultra-high throughput measurement of RNA-protein interactions. In addition, we have

developed quantitative image analysis tools for large-scale analysis of these data, and

demonstrate measurement of both equilibrium binding constants and dissociation

kinetics. We apply these methods to the MS2 coat protein2,28-31, a system with widespread

6
applications in affinity purification32, RNA imaging33 and synthetic biology10,11. This

approach enables quantitative measurement of binding and dissociation of a protein to

>107 RNA targets generated directly on the flow cell surface, providing massive

biophysical datasets enabling predictive models for affinity tuning, decomposition of

binding energies between primary and secondary structures, and quantitative analysis of

evolutionary trajectories across sequence space.

A high-throughput RNA array platform for quantitative binding measurements

To generate a library of RNA targets, we first generated an Illumina sequencing

library containing an E. coli RNA polymerase (RNAP) initiation and stall sequence, and

a region coding for diverse sequence variants of the MS2 RNA hairpin synthesized using

doped oligonucleotides (Fig. 1a,b). To ensure multiple measurements of each RNA

variant and reduce sequencing error34, we introduced single-molecule barcodes 5 of the

RNAP initiation sequence. The barcoding strategy serves to identify individual molecules

within a population by uniquely tagging each molecule using a barcode library. We then

bottlenecked the DNA variant population by diluting ~8x105 molecules into a

subsequent PCR amplification reaction. Bottlenecking allowed for each barcoded

molecular species to be sequenced a median of 15 times per sequencing lane, allowing

for multiple redundant measurements across the flow cell. The sequencing process

converted individual molecules within the library to ~1 m diameter clusters of ~1,000

clonal DNA molecules on the flow cell surface27, and provided the sequence and position

of the DNA templates across the 2D array.

Following sequencing, we removed the sequenced DNA strand, and regenerated

7
double stranded DNA (dsDNA) using DNA polymerase to extend a biotinylated primer.

We then saturated the flow cell with streptavidin to create a terminal biotin-streptavidin

roadblock on these dsDNA fragments. To synthesize RNA we adapted methods from

single molecule investigations35 designed to generate a single RNA per DNA template.

First, we initiated E. coli RNA polymerase holoenzyme (RNAP) in CTP-starved

conditions, which allows RNAP to generate 26 bases of RNA (the footprint of RNAP)

before stalling at the first guanine on the DNA template strand. Second, we washed

excess RNA polymerase from solution and introduced all 4 nucleotides, allowing RNAP

to transcribe the variable region and stall at the biotin-streptavidin roadblock. This

procedure results in transcribed RNA tethered to its parent DNA via RNA polymerase

(Fig. 1a). The resulting RNA array contained 1.2 x 107 distinct RNA features comprising

1.48 x 105 unique sequences in a single sequencing lane.

The RNA-array enables quantitative measurement of both binding and dissociation

To measure binding energies, we flowed fluorescent SNAP-Surface 549-MS2

over the RNA array, and imaged bound MS2 at equilibrium using total internal reflection

fluorescence (TIRF) at 10 increasing concentrations. After the final measurement, we

perfused 1.8 M unlabeled MS2 and recorded the fluorescence decay caused by

dissociation (Fig. 1c). The high-concentration of unlabeled MS2 protein blocks other

binding sites on the array, preventing re-binding of fluorescently labeled MS2. To

quantify bound MS2 we developed image analysis tools that cross-correlate cluster

centers from sequencing data to acquired images and fit the observed binding in each

cluster to a 2D Gaussian (Fig. S1 and S2). Using this approach, we quantified the

fluorescence signal for each cluster in 6,240 images representing 120 tiles imaged in two

8
fluorescence color channels across 11 equilibrium MS2 concentrations and 15

dissociation time points. Fluorescence signals from single clusters fit canonical

dissociation (Fig. 1d) and binding curves (Fig. 1e, f), yielding binding energy estimates

in excellent agreement with published measurements (R = 0.94, slope = 1.08, Fig. 1g)

and in vitro binding assays (R = 0.92, slope = 0.76).

We calculated off rates (koff) for 3,029 sequences and dissociation constants (Kds)

for 129,248 sequences, encompassing 57 single (100%), 1,539 double (100%), and

24,181 triple (92.4%) mutants (Fig. 2a, b). To investigate how sequence variation in the

RNA hairpin impacts MS2 binding, we examined differential binding energies for all

single-mutants compared to the consensus sequence (Gconsensus=0 kBT). The average

binding energy change from all possible single-base changes at each position reveals a

sensitivity to mutation throughout the hairpin that complements the effects of mutating

individual residues on the binding surface of MS2 to alanine36 (Fig. 2c). Specifically, we

observe high mutation sensitivity at base-paired positions near the loop and at specific

single-stranded positions, suggesting significant primary sequence and secondary

structure requirements for RNA recognition.

Binding affinity can be partitioned between primary and secondary structure

To comprehensively examine these primary and secondary structure effects on

binding, we calculated the G of all double-mutants (Fig. 2d). We observed high

positive epistasis in a population of compensating mutants, suggesting that these pairs

of mutations preserve hairpin structure and maintain high binding affinities (Fig. 2e). We

also observed negative epistasis in non-compensating mutants near the base of the stem,

potentially due to cooperative effects on hairpin destabilization in these regions.

9
Reciprocal mapping of positive epistasis signatures (1 s.d.) allowed de novo

reconstruction of the bound hairpin structure, identifying base-paired, loop, and bulge

positions demonstrating the feasibility of reconstructing molecular RNA structures from

large-scale sequence-function data.

We modeled the contributions of base-specificity (primary structure) and base-

pairing (secondary structure) to binding energy at each position in the hairpin with a

linear regression model from a set of 121 training sequences. This model provides two

free parameters for each unpaired base accounting for primary sequence changes in the

form of transitions or transversions. For each pair of interacting bases, the model

provides a total of 6 free parameters one for transition and transversion of each base in

the pair (4 parameters) as well as one parameter to account for disruption due to the loss

of base-pairing and one parameter representing possible non-canonical base-pairing

interactions. These parameters were optimized jointly, in order to identify (via

regression) the energetic contributions of primary sequence changes (i.e. transitions or

transversions that occur while holding secondary structure constant) and secondary

structure changes (i.e. inferred energetic consequences of secondary structure disruptions

or formation of non-canonical bases in isolation from primary sequence perturbations).

To quantify the sensitivity for non-canonical base-pairing at positions in the hairpin stem,

we trained the model 8 separate times (once for each possible non-canonical pairing) with

one free parameter representing the energetic cost of the respective non-canonical

pairing. This re-fitting analysis allowed the model to incorporate a different energetic

penalty for having non-canonical base pairs at a specific position instead of the energetic

penalty for a full loss of base-pairing. In this analysis, G:U base pairs caused substantially

10
less disruption to the binding energy than other non-canonical base pairs (Fig. 3a),

consistent with the formation of a wobble base pair at G:U positions that allows partial

rescue of the secondary structure12,37. Our final model, which incorporated a free

parameter for G:U non-canonical base pairs, captured 92% of the variance in binding

energy of the training set and predicted the binding energy of second and third mutations

for variants with mutations in both paired and unpaired positions with correlation

coefficients R=0.94 and R=0.83, respectively (Fig. 3b).

The model fit parameters allowed quantitative decomposition of primary and

secondary determinants of affinity across the RNA structure (Fig. 3c, d). Energetic

penalties for disrupting base-pairing increase with proximity to the loop, while non-

canonical G:U base pairs cause substantially less energetic disruption at the -8:-3 and -

11:-1 positions. Altering the primary sequence at -10A (bulge) and -4A (loop), residues

that interact with the Lys61 binding pocket on alternate halves of the dimer29, confers

energetic costs that exceed disrupting the hairpin structure at any single base pair. We

also observed important roles for the -7A and -5C residues, consistent with stacking

interactions at these positions38. Altering the primary sequence on the 5 side of the

hairpin confers a greater energetic penalty compared with altering the 3 side, which we

speculate results from direct interactions with MS2 on the 5 side36.

Changes in association rate substantially contribute to changes in binding energies

We sought to quantify how changes in association and dissociation rates

contribute to measured G values for all mutants with measurable kinetic data. We

calculated the energetic contributions to G from changes in dissociation rates [


!"#$%# !"#$%#$&$ !"#$%#
log(!!"" /!!"" ) log (!!"" )], and inferred the contribution from changes in

11
!"#$%# !"#$%#$&$ !"#$%#
association rates, [log(!!" /!!" ) log (!!" )]. Because log(koff) +

log(kon) = G, we treated these parameters as pseudo-energies. Using this

decomposition, we examined the fractional contribution of change in dissociation rates to

G across single and double mutants (Fig. 4a). At the base of the hairpin, only a small

fraction of G measurements are explained by dissociation rate changes. This small

effect suggests that mutations at these positions modulate association rates, possibly by

causing fraying of the hairpin and/or allowing competition with alternate RNA structures,

thereby reducing the per-collision probability of productive binding. This interpretation

is reinforced by examining log(koff) and log(kon) in this region (Fig. 4b, c).

Dissociation rates change little while inferred association rates remain similar to that of

the consensus sequence only for structures that maintain base-pairing through

compensating mutations. Across all measured variants, we observe a significant

population of structures with G driven by association rates (Fig. 4d; P < 2.2 x 10-16,

Wilcoxon signed rank test, = 0.5). These results suggest the kinetic drivers of observed

affinity changes are position-specific and often operate through modulating association

rates, likely by changing hairpin stability.

Discussion

Using in situ transcription and inter-molecular tethering of RNA to DNA, we

have converted a high-throughput DNA sequencing flow cell into an RNA array for

quantitatively measuring both binding kinetics and thermodynamics at an unprecedented

scale. Using this quantitative deep mutational profiling approach we report, to our

knowledge, the largest collection of binding affinities and kinetic constants for an

intermolecular interaction. Using this dataset, we addressed long-standing biophysical

12
questions, including i) the relative contributions of primary and secondary structure

elements to binding energy, ii) the sequence-dependent kinetic contributions to observed

affinities, iii) the context-dependence of preference for G:U intermediates in secondary

structure.

Our predictive model for RNA-protein affinity across thousands of point

mutations provides a map for quantitative tuning of both the association rate and the

equilibrium constants of this RNA-protein interaction. We anticipate this resource of

sequence variants will enable affinity tuning of MS2-based RNA sensors enabling new

applications in synthetic biology. Additionally, these data provide a useful framework for

understanding the effect of primary sequence, secondary structure and non-canonical

base-pairing, creating a valuable framework for understanding the design and evolution

of new RNA aptamers.

We hypothesize that inferred changes in on-rates are due to destabilization of the

RNA hairpin formation or competition with alternate secondary structure, reducing the

number of productive binding collisions39. These observations suggest the data provided

here may also provide a rich resource for modeling the RNA hairpin stability and

alternate structure formation. While this is an area of inquiry beyond the focus of this

work, the potential for formation of alternate structures and the effects of local sequence

on native folding of RNA are well suited for study using this platform, as the RNA

transcripts are synthesized by E. coli RNAP and folded co-transcriptionally, closely

approximating synthesis conditions in vivo.

We anticipate this RNA-MaP methodology will be a powerful addition to select

and sequence methods. In addition, the technique might provide quantitative information

13
on RNA libraries generated by systematic enrichment of ligands by exponential

enrichment (SELEX), allowing affinity tuning for the design of biological parts. While

SELEX methods often begin with large libraries (~1014) and produce a small number of

selected molecules, this RNA array methodology allows characterization of a much larger

library subset (~105), opening the door to a detailed understanding of the sequence-

specific rules driving acquisition of affinity in the selection process. Alternatively, this

platform might be coupled to sequenced in vivo RNA immunoprecipitation libraries40,41

and used to directly quantify molecular affinities on in vitro generated RNA, providing

measurements of interactions in well-defined conditions. The multicolor imaging

capabilities of the sequencer enables measurement of more complex biological

interactions such as cooperativity between differentially labeled binding partners or RNA

structure inference via fluorescence resonance energy transfer (FRET). In addition, the

sequencing platform is capable of generating DNA clusters >1kb42, enabling transcription

of long RNAs and allowing investigations of long non-coding RNAs and catalytic

ribozymes. In short, we believe future application of RNA-MaP to diverse RNA-protein

and RNA-RNA interactions promises to enable quantitative prediction and engineering of

binding affinities and functional RNA molecules, as well as the identification and

understanding of evolutionary sequence constraints based on underlying biophysical

parameters.

14
Chapter 2 - Figures and Figure Legends

Figure 1. A massively parallel RNA array for quantitative, high-throughput

biochemistry. (a) Steps for generating RNA tethered to DNA clusters on a high-

throughput DNA sequencing flow cell. (b) Structure of the MS2 coat protein homodimer

bound to the 19 nt hairpin RNA (PDB ID: 2BU1)31. (c) Images of fluorescently bound to

RNA clusters at increasing concentrations of protein and at time points following

perfusion of unlabeled MS2 competitor. Below, fitted sum of Gaussians used to assign

fluorescence to clusters. (d) Fluorescence decay of MS2 dissociating from clusters

containing the consensus sequence (-5C) (t1/2=8.39 minutes). (e) Fit binding curves to

clusters labeled in panel (c). (f) The probability distribution of binding energies from all

clusters with labeled variants; mean Kd = 2.57 nM, 36.8 nM, and 415 nM for the -5C, -

5U, and -5A variants, respectively. (g) Correlation between binding energies reported in

the literature and measured on the RNA array (squares, Carey et al.28, circles, Romaniuk

et al.30). (Grey bar indicates our affinity measurement cutoff.)

15
Figure 2. A quantitative map of MS2 binding across RNA sequence variants. (a)

Distribution of observed RNA variants by number of mutations. (b) Clusters measured

per molecular variant as a function of mutation number. A median of ~11 clusters are

observed for sequences with 4 mutations. Affinities for the consensus sequence come

from NC=909,385 clusters. (c) Average G of point mutations per position. The G

of alanine36 substitutions to the MS2 binding surface are shown in parentheses (kBT).

Solid and dashed lines represent base and phosphate interactions, respectively. (d) Matrix

of G for single and double mutants of the consensus sequence. Inset contains the

matrix of G for single and double mutants of the +1G variant. All energies are

calculated relative to the consensus (-5C) sequence (arrow, G=0), and the number of

quality-filtered double mutants in each matrix is indicated (M2). (e) Epistasis matrix

derived from (d) allows de novo reconstruction of the hairpin structure.

16
Figure 3. Binding affinity is dependent on primary sequence and secondary RNA

structure. (a) Fit parameters for linear regression model showing position-specific

contributions. Energetic components for all possible base pair combinations are shown

below. (b) Predicted binding energies of variants with second (M2) and third mutations

(M3) in both single- and double-stranded regions. Primary (i.e. mean energetic

contributions of transitions and transversions) (c) and secondary (d) structure

contributions to affinity derived from a, were mapped onto the hairpin (PDB ID:

1ZDH)38.

17
Figure 4. Sequence-specific contributions of association and dissociation rates to

binding affinity. (a) Fractional contribution of dissociation rates for 31 single and 289

double mutants with measurable affinities and dissociation rates. Positions at the base of

the hairpin are highlighted. (b) log(koff) and (c) log(kon) at the base of the hairpin. M2 =

number of quality-filtered double mutants. (d) Distribution of fractional contributions of

association (blue, =0.57) and dissociation (red, =0.43) rates to G for all measured

mutants (N=3,029).

18
Supplementary Figure1. Data Analysis Workflow. (a) Sequencing cluster centers were

derived from the fastq files from the sequencing run. X/Y and tile positions were

extracted from the fastq header lines. Data were cross-correlated with the observed

images to define a global offset. Images were then cleaned to mask any saturated pixels.

Images were broken into smaller sub regions (24x24 pixels) and the fluorescence was

fitted to a sum of overlapping 2D Gaussians. This process was repeated for all 120 tiles

of the GAIIx sequencing lane and across the 26 image series (3,120 images). (b) Binding

images were normalized for RNA content using the all RNA image (Alexa647 oligo

hybridized to the stall sequence). Data was aggregated across the image series by cluster

ID, and the fluorescence values for each cluster across concentrations was fit to a binding

curve. The fit binding energies were grouped by hairpin sequence, and median binding

energies for each sequence were reported.

19
Supplementary Figure 2: Correlating sequencing data and fitting 2D Gaussians to

acquired images. We found that a simple cross-correlation was sufficient to map x/y

positions from the sequencing data to both the (a) all RNA image and the (b) MS2

binding images (cluster centers shown in green). Shown are unaligned images and cluster

centers (left), the cross-correlation value (middle), and the resulting mapped cluster

centers (right). The plotted cluster centers were adjusted using the least squares image fit.

Images were fit to 2D Gaussians and generated the following distribution for the relevant

parameters: (c) the fit amplitude and (d) the fit standard deviation from a representative

tile. Integrating these values generated (e) the distribution of the integrated fluorescence.

20
References

1. Keene, J. D. RNA regulons: coordination of post-transcriptional events. Nature


Reviews Genetics 8, 533543 (2007).
2. Carey, J., Cameron, V., De Haseth, P. L. & Uhlenbeck, O. C. Sequence-specific
interaction of R17 coat protein with its ribonucleic acid binding site. Biochemistry
22, 26012610 (1983).
3. Tsvetanova, N. G., Klass, D. M., Salzman, J. & Brown, P. O. Proteome-Wide
Search Reveals Unexpected RNA-Binding Proteins in Saccharomyces cerevisiae.
PLoS ONE 5, e12671 (2010).
4. Scherrer, T., Mittal, N., Janga, S. C. & Gerber, A. P. A Screen for RNA-Binding
Proteins in Yeast Indicates Dual Functions for Many Enzymes. PLoS ONE 5,
e15499 (2010).
5. Butter, F., Scheibe, M., Mrl, M. & Mann, M. Unbiased RNAprotein interaction
screen by quantitative proteomics. Proceedings of the National Academy of
Sciences 106, 1062610631 (2009).
6. Castello, A. et al. Insights into RNA Biology from an Atlas of Mammalian
mRNA-Binding Proteins. Cell 149, 13931406 (2012).
7. Wang, K. C. et al. A long noncoding RNA maintains active chromatin to
coordinate homeotic gene expression. Nature 472, 120124 (2011).
8. Tsai, M. C. et al. Long Noncoding RNA as Modular Scaffold of Histone
Modification Complexes. Science 329, 689693 (2010).
9. Guttman, M. & Rinn, J. L. Modular regulatory principles of large non-coding
RNAs. Nature 482, 339346 (2012).
10. Culler, S. J., Hoff, K. G. & Smolke, C. D. Reprogramming Cellular Behavior with
RNA Controllers Responsive to Endogenous Proteins. Science 330, 12511255
(2010).
11. Auslnder, S., Auslnder, D., Mller, M., Wieland, M. & Fussenegger, M.
Programmable single-cell mammalian biocomputers. Nature (2012).
doi:10.1038/nature11149
12. SantaLucia, J. & Turner, D. H. Measuring the thermodynamics of RNA secondary
structure formation. Biopolymers 44, 309319 (1997).
13. Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals
novel regulatory features. Nature (2013). doi:10.1038/nature12756
14. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. Genome-wide
probing of RNA structure reveals active unfolding of mRNA structures in vivo.
Nature (2013). doi:10.1038/nature12894
15. Ban, N., Nissen, P., Hansen, J., Moore, P. B. & Steitz, T. A. The Complete Atomic
Structure of the Large Ribosomal Subunit at 2.4 Resolution. Science 289, 905
920 (2000).
16. Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the
transcriptome through RNA structure. Nature Reviews Genetics 12, 641655
(2011).
17. Martin, L. et al. Systematic reconstruction of RNA functional motifs with high-
throughput microfluidics. Nat Meth 9, 11921194 (2012).
18. Ray, D. et al. Rapid and systematic analysis of the RNA recognition specificities

21
of RNA-binding proteins. Nat. Biotechnol. 27, 667670 (2009).
19. Ray, D. et al. A compendium of RNA-binding motifs for decoding gene
regulation. Nature 499, 172177 (2013).
20. Araya, C. L. et al. A fundamental protein property, thermodynamic stability,
revealed solely from large-scale measurements of protein function. Proceedings of
the National Academy of Sciences 109, 1685816863 (2012).
21. Pitt, J. N. & Ferre-D'Amare, A. R. Rapid Construction of Empirical RNA Fitness
Landscapes. Science 330, 376379 (2010).
22. Guenther, U.-P. et al. Hidden specificity in an apparently nonspecific RNA-
binding protein. Nature (2013). doi:10.1038/nature12543
23. Matzas, M. et al. High-fidelity gene synthesis by retrieval of sequence-verified
DNA identified using high-throughput pyrosequencing. Nat. Biotechnol. 28, 1291
1294 (2010).
24. Myllykangas, S., Buenrostro, J. D., Natsoulis, G., Bell, J. M. & Ji, H. P. Efficient
targeted resequencing of human germline and cancer genomes by oligonucleotide-
selective sequencing. Nat. Biotechnol. 29, 10241027 (2011).
25. Uemura, S. et al. Real-time tRNA transit on single translating ribosomes at codon
resolution. Nature 464, 10121017 (2010).
26. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-
throughput sequencing instrument. Nat. Biotechnol. 29, 659664 (2011).
27. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456, 5359 (2008).
28. Carey, J., Lowary, P. T. & Uhlenbeck, O. C. Interaction of R17 coat protein with
synthetic variants of its ribonucleic acid binding site. Biochemistry 22, 47234730
(1983).
29. Valegrd, K., Murray, J. B., Stockley, P. G., Stonehouse, N. J. & Liljas, L. Crystal
structure of an RNA bacteriophage coat proteinoperator complex. Nature 371,
623626 (1994).
30. Romaniuk, P. J., Lowary, P., Wu, H. N., Stormo, G. & Uhlenbeck, O. C. RNA
binding site of R17 coat protein. Biochemistry 26, 15631568 (1987).
31. Grahn, E. et al. Structural basis of pyrimidine specificity in the MS2 RNA hairpin-
coat-protein complex. RNA 7, 16161627 (2001).
32. Bardwell, V. J. & Wickens, M. Purification of RNA and RNA-protein complexes
by an R17 coat protein affinity method. Nucleic Acids Res. 18, 65876594 (1990).
33. Bertrand, E. et al. Localization of ASH1 mRNA particles in living yeast. 2, 437
445 (1998).
34. Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular
identifiers. Nat Meth 9, 7274 (2011).
35. Greenleaf, W. J., Frieda, K. L., Foster, D. A., Woodside, M. T. & Block, S. M.
Direct observation of hierarchical folding in single riboswitch aptamers. Science
319, 630633 (2008).
36. Hobson, D. & Uhlenbeck, O. C. Alanine Scanning of MS2 Coat Protein Reveals
ProteinPhosphate Contacts Involved in Thermodynamic Hot Spots. Journal of
Molecular Biology 356, 613624 (2006).
37. Gabriele Varani, W. H. M. The GU wobble base pair: A fundamental building
block of RNA structure crucial to RNA function in diverse biological systems.

22
EMBO Reports 1, 1823 (2000).
38. Valegrd, K. et al. The three-dimensional structures of two complexes between
recombinant MS2 capsids and RNA operator fragments reveal sequence-specific
protein-RNA interactions. Journal of Molecular Biology 270, 724738 (1997).
39. Gell, C. et al. Single-Molecule Fluorescence Resonance Energy Transfer Assays
Reveal Heterogeneous Folding Ensembles in a Simple RNA StemLoop. Journal
of Molecular Biology 384, 264278 (2008).
40. Licatalosi, D. D. et al. HITS-CLIP yields genome-wide insights into brain
alternative RNA processing. Nature 456, 464469 (2008).
41. Zhao, J. et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP-
seq. Molecular Cell 40, 939953 (2010).
42. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).

23
CHAPTER THREE Measuring accessibility in rare cellular populations2

Introduction

Eukaryotic genomes are hierarchically packaged into chromatin1, and the nature

of this packaging plays a central role in gene regulation2,3. Major insights into the

epigenetic information encoded within the nucleoprotein structure of chromatin have

come from high-throughput, genome-wide methods for separately assaying the chromatin
4,5
accessibility (open chromatin) , nucleosome positioning6-8, and transcription factor

occupancy9. While powerful, existing methods require millions of cells as starting

material, complex and time-consuming sample preparations, and cannot simultaneously

probe the interplay of nucleosome positioning, chromatin accessibility, and transcription

factor binding.

These limitations are problematic in three major ways: First, current methods can

average over and drown out heterogeneity in cellular populations. Second, cells must

often be grown ex vivo to obtain sufficient biomaterials, perturbing the in vivo context

and modulating the epigenetic state in unknown ways. Third, input requirements often

prevent application of these assays to well-defined clinical samples, precluding

generation of personal epigenomes in diagnostic timescales. Here we report a robust

and sensitive method for epigenomic profiling that can provide a comprehensive portrait

of gene regulatory processes.


2
Portions of this chapter were taken from Buenrostro et al. Transposition of native chromatin for
fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and
nucleosome position. Nature Methods. 2013;10(12):12131218. doi:10.1038/nmeth.2688.US

24
ATAC-seq measures chromatin accessibility using Tn5 transposase

Hyperactive Tn5 transposase10,11, loaded in vitro with adapters for high-throughput DNA

sequencing, can simultaneously fragment and tag a genome with sequencing adapters

(previously described as tagmentation11). Because transposons have been shown to

integrate into active regulatory elements in vivo12, we hypothesized that transposition by

purified Tn5, a prokaryotic transposase, on small numbers of unfixed nuclei would

interrogate regions of accessible chromatin. Here we describe Assay for Transposase

Accessible Chromatin (ATAC-seq), ATAC-seq uses Tn5 transposase to integrate its

adapter payload into regions of accessible chromatin, whereas compaction and

sequestration of chromatin make transposition improbable. Therefore, amplifiable DNA

fragments suitable for high-throughput sequencing are generated at locations of open

chromatin (Fig 1A). The entire assay and library construction can be carried out in a

simple two-step process involving Tn5 insertion and PCR. In contrast, published DNase-

and FAIRE-seq protocols for assaying chromatin accessibility involve complex multi-

step protocols and many loss-prone steps, such as adapter ligation, gel purification and

reversal of crosslinks. Specifically, DNase-seq consist of 44 steps, and two overnight

incubations, while published FAIRE-seq protocols require two overnight incubations

carried out over at least 3 days13,14. Furthermore, these protocols require 1-50 million cells

(FAIRE) or 50 million cells (DNase-seq), likely due to their complex workflows13,14 (Fig

1B). In comparison to pre-existing methods, ATAC-seq enables rapid and efficient

library generation while radically reducing sample input requirements. Extensive

analyses show that ATAC-seq provides accurate and sensitive measure of chromatin

accessibility genome-wide. We carried out ATAC-seq on 50,000 and 500 unfixed nuclei

25
isolated from GM12878 lymphoblastoid cell line (ENCODE Tier 115) for comparison and

validation with chromatin accessibility data sets, including DNase-seq13 and FAIRE-

seq16. At a locus previously highlighted by others5, (Fig. 1C), ATAC-seq has a signal-to-

noise ratio similar to DNase-seq, which was generated from approximately 3 to 5 orders-

of-magnitude more cells13,14. Peak intensities were highly reproducible between technical

replicates (R=0.98), and highly correlated between ATAC-seq and DNase-seq (R=0.79

and R=0.83). Highly sensitive open chromatin detection is maintained even when using

5,000 or 500 human nuclei as starting material, although sensitivity is diminished for

smaller numbers of input material, as can be seen in Fig 1C.

Insert size yields information regarding nucleosome packing and positioning

Unlike pre-existing assays that measure chromatin accessibility, ATAC-seq

paired-end reads produce detailed information about nucleosome packing and

positioning. The insert size distribution of sequenced fragments from human chromatin

has clear periodicity of approximately 200 base pairs, suggesting many fragments are

protected by integer multiples of nucleosomes (Fig 2A). This fragment size distribution

also shows clear periodicity equal to the helical pitch of DNA11. By partitioning insert

size distribution according to functional classes of chromatin as defined by previous

models17, and normalizing to the global insert distribution (Methods) we observe clear

class-specific enrichments across this insert size distribution (Fig. 2B), demonstrating that

these functional states of chromatin have an accessibility fingerprint that can be read

out with ATAC-seq. These differential fragmentation patterns are consistent with the

putative functional state of these classes, as insulator regions are enriched for short

26
fragments of DNA, while transcription start sites are differentially depleted for mono-, di-

and tri-nucleosome associated fragments. Transcribed and promoter flanking regions are

enriched for longer multi-nucleosomal fragments, suggesting they are more compacted

than other states that require access to DNA by regulatory factors. Finally, repressed

regions are differentially depleted for short fragments, consistent with their expected

compacted state. These data suggest that ATAC-seq reveals differentially compacted

forms of chromatin, which have been long hypothesized to exist in vivo2,18,19.

To explore nucleosome positioning within accessible chromatin in the GM12878

cell line, we partitioned our data into reads generated from putative nucleosome free

regions of DNA, and reads likely derived from nucleosome associated DNA. Using a

simple heuristic that positively weights nucleosome associated fragments and negatively

weights nucleosome free fragments, we calculated a data track used to call nucleosome

positions within regions of accessible chromatin20. An example locus (Fig. 3A) contains a

putative bidirectional promoter with CAGE data showing two transcription start sites

(TSS) separated by ~700bps. ATAC-seq reveals in fact two distinct nucleosome free

regions, separated by a single well-positioned mononucleosome (Fig. 3A). Compared to

MNase-seq21, ATAC-seq data is more amenable to detecting nucleosomes within putative

regulatory regions, as the majority of reads are concentrated within accessible regions of

chromatin (Fig. 3B). By averaging signal across all active TSSs, we note nucleosome free

fragments are enriched at a canonical nucleosome free promoter region overlapping the

TSS, while our nucleosome signal is enriched both upstream and downstream of the

active TSS, and displays characteristic phasing of upstream and downstream

nucleosomes6,7 (Fig. 3C). Because ATAC-seq reads are concentrated at regions of open

27
chromatin, we see strong nucleosome signal at the +1 nucleosome, which decreases at the

+2, +3 and +4 nucleosomes, in contrast, MNase-seq nucleosome signal increases at larger

distances from the TSS likely due to over digestion of more accessible nucleosomes.

Additionally, MNase-seq (4 billion reads) assays all nucleosomes requiring orders of

magnitude more sequencing than ATAC-seq (198 million paired reads) to reach similar

resolution at regulatory nucleosomes (Fig. 3B,C). Using our nucleosome calls, we further

partitioned putative distal regulatory regions and TSSs into regions that were nucleosome

free and regions that were predicted to be nucleosome bound. We note that TSSs were

enriched for nucleosome free regions when compared to distal elements, which tend to

remain nucleosome rich (Fig. 3D). These data suggest ATAC-seq can provide high-

resolution readout of nucleosome associated and nucleosome free regions in regulatory

elements genome wide.

ATAC-seq reveals distinct classes of factor-nucleosome spacing

ATAC-seq high-resolution regulatory nucleosome maps can be used to

understand the relationship between nucleosomes and DNA binding factors. Using ChIP-

seq data, we plotted the position of a variety of DNA binding factors with respect to the

dyad of the nearest nucleosome. Unsupervised hierarchical clustering (Figure 3E)

revealed major classes of binding with respect to the proximal nucleosome, including 1) a

strongly nucleosome avoiding group of factors with binding events stereotyped at ~180

bases from the nearest nucleosome dyad (comprising C-FOS, NFYA and IRF3), 2) a

class of factors that nestle up precisely to the expected end of nucleosome DNA

contacts, which notably includes chromatin looping factors CTCF and cohesion complex

28
subunits RAD21 and SMC3; 3) a large class of primarily transcription factors that have

gradations of nucleosome avoiding or nucleosome-overlapping binding behavior, and 4)

a class whose binding sites tend to overlap nucleosome associated DNA. Interestingly,

this final class includes chromatin remodeling factors such as CHD1 and SIN3A as well

as RNA polymerase II, which appears to be enriched at the nucleosome boundary8. The

interplay between precise nucleosome positioning and locations of DNA binding factor

immediately suggests specific hypotheses for mechanistic studies, a potential advantage

of ATAC-seq.

Footprints can be used to infer factor occupancy genome-wide

ATAC-seq enables accurate inference of DNA binding factor occupancy genome-

wide. We reasoned that DNA sequences directly occupied by DNA-binding proteins are

protected from transposition; the resulting sequence footprint reveals the presence of

the DNA-binding protein at each site, analogous to DNase digestion footprints22. At a

specific CTCF binding site on chromosome 1, we observed a clear footprint (a deep

notch of ATAC-seq signal), similar to footprints seen by DNase-seq23,24, at the precise

location of the CTCF motif that coincides with the summit of the CTCF ChIP-seq signal

in GM12878 cells (Fig 4A). We averaged ATAC-seq signal over all expected locations of

CTCF within the genome and observed a well-stereotyped footprint (Fig. 4B). Similar

results were obtained for a variety of common TFs. We inferred the CTCF binding

probability from motif consensus score, evolutionary conservation, and ATAC-seq

footprinting data to generate a posterior probability of CTCF binding at all loci (Fig.

4C)25. Results using ATAC-seq closely recapitulate ChIP-seq binding data in this cell line

29
and compare favorably to DNase-based factor occupancy inference, suggesting that

factor occupancy data can be extracted from these ATAC-seq datasets, and allowing

reconstruction of regulatory networks.

Using ATAC-seq footprints we generated the occupancy profiles of 89

transcription factors in proband T-cells, enabling systematic reconstruction of regulatory

networks. With this personalized regulatory map, we compared the genomic distribution

of the same 89 transcription factors between GM12878 and proband CD4+ T-cells.

Transcription factors that exhibit large variation in distribution between T-cells and B-

cells are enriched for T-cell specific factors (Fig. 4D). This analysis shows NFAT is

differentially regulating, while canonical CTCF occupancy is highly correlated within

these two cell types (Fig. 4D).

Discussion

Epigenomic studies of chromatin accessibility have yielded tremendous biological

insights, but are currently limited in application by their complex workflows and large

cell number requirements. ATAC-seq offers potentially unique advantages over pre-

existing ChIP-, MNase- and DNase-seq methods. ATAC-seq is an information rich assay,

allowing simultaneous interrogation of factor occupancy, nucleosome positions in

regulatory sites, and chromatin compaction genome-wide. These insights are derived

from both the position of insertion and the distribution of insert lengths captured during

the transposition reaction. While extant methods such as DNase- and MNase-seq can

provide some subsets of the information in ATAC-seq, they each require separate assays

with large cell numbers, which increases the time, cost, and limits applicability to many

30
systems. ATAC-seq also provides insert size fingerprints of biologically relevant

genomic regions, suggesting that it capture information on chromatin compaction. We

expect ATAC-seq to have broad applicability, significantly add to the genomics toolkit,

and improve our understanding of gene regulation, particularly when integrated with

other powerful rare cell techniques, such as FACS, laser capture microdissection (LCM)

and recent advancements in RNA-seq26,27. In summary, we believe that the attractive

combination of speed, simplicity and low input requirements of ATAC-seq will enable

new gene regulatory insights into biology and medicine.

31
Chapter 3 - Figures and Figure Legends

Figure 1. ATAC-seq is a sensitive, accurate probe of open chromatin state. A)

ATAC-seq reaction schematic. Transposase (green), loaded with sequencing adapters

(red and blue), inserts only in regions of open chromatin (nucleosomes in grey) and

generates sequencing library fragments that can be PCR amplified. B) Approximate input

material and sample preparation time requirements for genome-wide methods of open

chromatin analysis. C) A comparison of ATAC-seq to other open chromatin assays at a

locus in GM12878 lymphoblastoid cells displaying high concordance. Lower ATAC-seq

track was generated from 500 FACS-sorted cells.

32
Figure 2. ATAC-seq provides genome-wide information on chromatin compaction.

A) ATAC-seq fragment sizes generated from GM12878 nuclei (red) indicate chromatin-

dependent periodicity with a spatial frequency consistent with nucleosomes, as well as a

high frequency periodicity consistent with the pitch of the DNA helix for fragments less

than 200 bp. (Inset) log-transformed histogram shows clear periodicity persists to 6

nucleosomes. B) Normalized read enrichments for 7 classes of chromatin state previously

defined17.

33
Figure 3 ATAC-seq provides genome-wide information on nucleosome positioning

in regulatory regions. A) An example locus containing two transcription start sites

(TSSs) showing nucleosome free read track, calculated nucleosome track (Methods), as

well as DNase, MNase, and H3K27ac, H3K4me3, and H2A.Z tracks for comparison. B)

ATAC-seq (198 million paired reads) and MNase-seq (4 billion single-end reads)

nucleosome signal shown for all active TSSs (n=64,836), TSSs are sorted by CAGE

expression. C) TSSs are enriched for nucleosome free fragments, and show phased

nucleosomes similar to those seen by MNase-seq at the -2, -1, +1, +2, +3 and +4

positions. D) Relative fraction of nucleosome associated vs. nucleosome free (NFR)

bases in TSS and distal sites (see Methods). E) Hierarchical clustering of DNA binding

34
factor position with respect to the nearest nucleosome dyad within accessible chromatin

reveals distinct classes of DNA binding factors. Factors strongly associated with

nucleosomes are enriched for chromatin remodelers.

35
Figure 4: ATAC-seq assays genome-wide factor occupancy. A) CTCF footprints

observed in ATAC-seq and DNase-seq data, at a specific locus on chr1. B) Aggregate

ATAC-seq footprint for CTCF (motif shown) generated over binding sites within the

genome C) CTCF predicted binding probability inferred from ATAC-seq data, position

weight matrix (PWM) scores for the CTCF motif, and evolutionary conservation

(PhyloP). Right-most column is the CTCF ChIP-seq data (ENCODE) for this GM12878

cell line, demonstrating high concordance with predicted binding probability. D) Cell

type-specific regulatory network from proband T cells compared with GM12878 B-cell

line. Each row or column is the footprint profile of a TF versus that of all other TFs in the

same cell type. Color indicates relative similarity (yellow) or distinctiveness (blue) in T

versus B cells. NFAT is one of the most highly differentially regulated TFs (red box)

whereas canonical CTCF binding is essentially similar in T and B cells.

36
References

1. Kornberg, R. D. Chromatin structure: a repeating unit of histones and DNA.


Science 184, 868871 (1974).
2. Kornberg, R. D. & Lorch, Y. Chromatin Structure and Transcription. Annu. Rev.
Cell. Biol. 8, 563587 (1992).
3. Mellor, J. The Dynamics of Chromatin Remodeling at Promoters. Molecular Cell
19, 147157 (2005).
4. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
Chromatin across the Genome. Cell 132, 311322 (2008).
5. Thurman, R. E. et al. The accessible chromatin landscape of the human genome.
Nature 489, 7582 (2012).
6. Schones, D. E. et al. Dynamic Regulation of Nucleosome Positioning in the
Human Genome. Cell 132, 887898 (2008).
7. Valouev, A. A. et al. Determinants of nucleosome organization in primary human
cells. Nature 474, 516520 (2011).
8. Barski, A. et al. High-Resolution Profiling of Histone Methylations in the Human
Genome. Cell 129, 823837 (2007).
9. Gerstein, M. B. et al. Architecture of the human regulatory network derived from
ENCODE data. Nature 489, 91100 (2012).
10. Goryshin, I. Y. & Reznikoff, W. S. Tn5 in vitro transposition. J. Biol. Chem. 273,
73677374 (1998).
11. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment
libraries by high-density in vitro transposition. Genome Biol 11, R119 (2010).
12. Gangadharan, S., Mularoni, L., Fain-Thornton, J., Wheelan, S. J. & Craig, N. L.
DNA transposon Hermes inserts into DNA in nucleosome-free regions in vivo.
Proceedings of the National Academy of Sciences 107, 2196621972 (2010).
13. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harb Protoc 2010, (2010).
14. Simon, J. M., Giresi, P. G., Davis, I. J. & Lieb, J. D. Using formaldehyde-assisted
isolation of regulatory elements (FAIRE) to isolate active regulatory DNA. Nature
Protocols 7, 256267 (2012).
15. Consortium, T. E. P. A User's Guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol 9, e1001046 (2011).
16. Giresi, P. G. & Lieb, J. D. Isolation of active regulatory elements from eukaryotic
chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory
Elements). Methods 48, 233239 (2009).
17. Hoffman, M. M. et al. Integrative annotation of chromatin elements from
ENCODE data. Nucleic Acids Res. 41, 827841 (2013).
18. Kornberg, R. D. & Lorch, Y. Chromatin and transcription: where do we go from
here. Current Opinion in Genetics & Development 12, 249251 (2002).
19. Zhou, J., Fan, J. Y., Rangasamy, D. & Tremethick, D. J. The nucleosome surface
regulates chromatin compaction and couples it with transcriptional repression. Nat
Struct Mol Biol 14, 10701076 (2007).
20. Chen, K. et al. DANPOS: Dynamic analysis of nucleosome position and

37
occupancy by sequencing. Genome Research 23, 341351 (2013).
21. Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin
environment at regulatory elements. Genome Research 22, 1735 (2012).
22. Hesselberth, J. R. et al. Global mapping of protein-DNA interactions in vivo by
digital genomic footprinting. Nat Meth 6, 283289 (2009).
23. Boyle, A. P. et al. High-resolution genome-wide in vivo footprinting of diverse
transcription factors in human cells. Genome Research 21, 456464 (2011).
24. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
25. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
sequence and chromatin accessibility data. Genome Research 21, 447455 (2011).
26. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Meth
6, 377382 (2009).
27. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression
and splicing in immune cells. Nature 498, 236240 (2013).

38
CHAPTER FOUR Single-cell accessibility reveals principles of regulatory

variation3

Introduction

Heterogeneity within cellular populations has been evident since the first

microscopic observations of individual cells. Recent proliferation of powerful methods

for interrogating single cells1-5 has allowed detailed characterization of this molecular

variation, and provided deep insight into characteristics underlying developmental

plasticity6,7, cancer heterogeneity8, and drug resistance9. In parallel, genome-wide

mapping of regulatory elements in large ensembles of cells have unveiled tremendous

variation in chromatin structure across cell-types, particularly at distal regulatory

regions10. Methods for probing genome-wide DNA accessibility, in particular, have

proven extremely effective in identifying regulatory elements across a variety of cell

types11 quantifying changes that lead to both activation and repression of gene

expression. Given this broad diversity of activity within regulatory elements when

comparing phenotypically distinct cell populations, it is reasonable to hypothesize that

heterogeneity at the single cell level extends to accessibility variability within cell types

at regulatory elements. However, the lack of methods to probe DNA accessibility within

individual cells has prevented quantitative dissection of this hypothesized regulatory

variation.

Single-cell ATAC-seq a measure of chromatin accessibility genome-wide

We have developed a single-cell Assay for Transposase-Accessible Chromatin


3
Portions of this chapter were taken from Buenrostro et al. Single-cell chromatin accessibility
reveals principles of regulatory variation. Nature. 2015. doi:10.1038/nature14590.

39
(scATAC-seq), improving on the state-of-the-art12 sensitivity by >500-fold. ATAC-seq

uses the prokaryotic Tn5 transposase13,14 to tag regulatory regions by inserting sequencing

adapters into accessible regions of the genome. In scATAC-seq individual cells are

captured and assayed using a programmable microfluidics platform (C1 single-cell Auto

Prep System, Fluidigm) with methods optimized for this task (Fig. 1a). After

transposition and PCR on the Integrated Fluidics Circuit (IFC), libraries are collected and

PCR amplified with cell-identifying barcoded primers. Single-cell libraries are then

pooled and sequenced on a high-throughput sequencing instrument. Using single-cell

ATAC-seq we generated DNA accessibility maps from 254 individual GM12878

lymphoblastoid cells. Aggregate profiles of scATAC-seq data closely reproduce

ensemble measures of accessibility profiled by DNase-seq and ATAC-seq generated from

107 or 104 cells respectively (Fig. 1b,c). Data from single cells recapitulate several

characteristics of bulk ATAC-seq data, including fragment size periodicity corresponding

to integer multiples of nucleosomes, and a strong enrichment of fragments within regions

of accessible chromatin (Fig. S1a,b). Microfluidic chambers generating low library

diversity or poor measures of accessibility, which correlate with empty chambers or dead

cells, were excluded from further analysis (Fig. 1d). Chambers passing filter yielded an

average of 7.3x104 fragments mapping to the nuclear genome. We further validated the

approach by measuring chromatin accessibility from a total of 1,632 IFC chambers

representing 3 tier 1 ENCODE cell lines15 (H1 human embryonic stem cells [ESCs],

K562 chronic myelogenous leukemia and GM12878 lymphoblastoid cells) as well as

from V6.5 mouse ESCs, EML6 (mouse hematopoietic progenitor), TF-1 (human

erythroblast), HL-60 (human promyeloblast) and BJ fibroblasts (human foreskin

40
fibroblast).

Cell-cell variability in trans

Because regulatory elements are generally present at two copies in a diploid

genome, we observe a near digital (0 or 1) measurement of accessibility at individual

elements within individual cells. For example, within a typical single cell we estimate a

total of 9.4% of promoters are represented in a typical scATAC-seq library (Fig. S1c-f).

The sparse nature of scATAC-seq data makes analysis of cellular variation at individual

regulatory elements impractical. We therefore developed an analysis infrastructure to

measure regulatory variation using changes of accessibility across sets of genomic

features (Fig. 2a,b and Fig. S2a-f). To quantify this variation we first choose a set of

open chromatin peaks, identified using the aggregate accessibility track, which share a

common characteristic (such as transcription factor binding motif, ChIP-seq peaks, cell

cycle replication timing domains, etc.). We then calculate the observed fragments in these

regions minus the expected fragments, down sampled from the aggregate profile, within

individual cells. To correct for bias, we divide this by the root mean square of fragments

expected from a background signal (BS) constructed to estimate technical and sampling

error within single-cell data sets. Herein, we refer to this metric as deviation. Finally,

for any set of features, we aggregate the deviation measurements across cells (Fig 2b) to

obtain an overall variability score, a metric of excess variance over the background

signal.

We first focused our analysis on K562 myeloid leukemia cells, a cell type with

extensive epigenomic data sets16,17. To comprehensively characterize variability

41
associated with trans-factors within individual K562 cells, we computed variability

across all available ENCODE ChIP-seq, transcription factor motifs and regions that

differed in replication timing (as determined from Repli-Seq data sets18) (Fig. 2c,d). We

found measures of cell-to-cell variability were highly reproducible across biological

replicates (Fig. S2g-i). As expected from proliferating cells, we find increased variability

within different replication timing domains, representing variable ATAC-seq signal

associated with changes in DNA content across the cell cycle. In addition, we discover a

set of trans-factors associated with high variability. These factors include sequence-

specific transcription factors (TFs), such as GATA1/2, JUN, and STAT2, and chromatin

effectors, such as BRG1 and P300. Immunostaining followed by microscopy or flow

cytometry (Fig. 2e and Fig. S3a-d) confirmed heterogeneous expression of GATA1 and

GATA2. Principal component (PC) analysis of single-cell deviations across all trans-

factors show seven significant PCs, with PC 5 describing changes in DNA abundance

throughout the cell cycle. This analysis suggests that high-variance trans-factors are

variable independent of the cell-cycle (Fig. 2f and Fig. S3e-g). The remaining PCs show

contributions from several TFs, suggesting that variance across sets of trans-factors

represent distinct regulatory states in individual cells.

Trans-factors synergize to induce cell-cell variability

We hypothesized that variation associated with different trans-factors can

synergize, either through cooperative or competitive binding, to induce or suppress site-

to-site variability in chromatin accessibility. For example, the most variant factors in

K562 cells GATA1 and GATA2 display expression heterogeneity and also bind an

42
identical consensus sequence GATA, suggesting these factors may compete for access

to DNA sequences. In support of this hypothesis, we find regulatory elements with both

GATA1 and GATA2 ChIP-seq signals show increased variability in accessibility,

whereas sites with only GATA1 or GATA2 show substantially less variability (Fig. 2g

and Fig. S3h-i). In contrast, we find no substantial change in variability of GATA1

binding sites that co-occur with JUN or CEBPB. We also find peaks unique to GATA1

binding are significantly more accessible than peaks unique to GATA2 (Fig. S3j-k)

supporting the hypothesis that GATA1, an activator of accessibility, competes with

GATA2 to induce single-cell variability. Extending this analysis to all TF ChIP-seq data

sets revealed a trans-factor synergy landscape for accessibility variation (Fig. 2g). For

example, chromatin accessibility variance associated with GATA2 binding is

significantly enhanced when the same region could also be bound by GATA1, TAL1 or

P300. In contrast, CTCF, SUZ12, and ZNF143 appear to act as general suppressors of

accessibility variance, unless associated with proximal binding of ZNF143 or SMC3, the

latter a cohesin subunit involved in chromosome looping17,19. Thus, single cell

accessibility profiles nominate distinct trans-factors that, in combination, induce or

suppress cell-to-cell regulatory variation.

Cell-state and chemical perturbation effects on cell-cell variability

To validate our ability to detect changes in accessibility variance, we used

chemical inhibitors to modulate potential sources of cell-cell variability. Inhibition of

cyclin-dependent kinases 4 and 6 (CDK4/6), essential components of the cell cycle,

caused a marked reduction of variability within peaks associated with DNA replication

43
timing domains (Repli-seq) (Fig. 3a). The addition of inhibitors of JUN or BCR-ABL

kinases (JNKi and Imatinib, respectively) increased G1/S-associated variability

suggesting an increase in the subpopulation of G1/S cells, which was validated with flow

cytometry (Fig. S4). JUN variability was one of the top changes caused by JNKi but not

Imatinib, suggesting that high-variance trans-factors can also be specifically and

pharmacologically modulated. Tumor necrosis factor (TNF) treatment of GM12878 cells

specifically modulated accessibility variability at NF-kB sites (Fig. 3b), consistent with

the known stochastic and oscillatory property of nuclear shuttling in this system20.

Together, these results show that variability can be experimentally modulated and further

demonstrates that variability is not solely dependent on the cell-cycle.

We observe that trans-factors associated with high variability are generally cell

type specific. Hierarchical bi-clustering of single-cell deviations generated from three cell

lines reveals cell-type specific sets of transcription factor motifs associated with high

variability (Fig. 3c). This analysis also shows cells from different biological replicates

cluster with their cell type of origin (with a single exception), suggesting scATAC-seq

can also be used to deconvolve heterogeneous cellular mixtures. Systematic analysis of

all assayed cell types identified high-variance trans-factor motifs that are generally

unique to specific cell types (Fig. 3d). For example, regions associated with GATA TFs

are most variant in K562s while regions associated with master pluripotency TFs Nanog

and Sox2 are most variant in mouse embryonic stem cells (ESCs), consistent with

previous observations of expression variation of these factors21,22. Importantly we also

find high variability of GATA1 and PU.1 (SPI1) binding accessibility in EML cells, a

cell type previously shown to have >200x GATA1 and >15x PU.1 expression differences

44
within clonal cellular subpopulations6. Interestingly, the complete set of identified high-

variance trans-factors contains a number of TFs previously reported to dynamically

localize into the nucleus, including NF-kB, JUN, and ETS/ERG20,23,24, suggesting that

temporal fluctuations in TF concentration may be driving observed chromatin

accessibility heterogeneity. Finally, we find BJ fibroblasts and HL-60s exhibit less

variance among this set of annotated trans-factor motifs, suggesting differences in the

global levels of trans-factor variability across cell lines. Overall these findings suggest

that trans-factors promote cell-type specific chromatin accessibility variation genome-

wide.

Single-cells vary in cis

Patterns of variation in accessibility along the linear genome in individual cells

reveal an unexpected connection to higher order chromosome folding. We calculated

single cell deviations within sliding windows across the genome, each encompassing a

fixed number of peaks (N=25) (Fig. 4a). We then determined which windows co-varied

within individual cells by calculating the co-correlation of each window across all others

within the same chromosome within individual cells. We then further enhanced this co-

correlation matrix using a secondary correlation analysis using methods similar to those

employed in chromosome conformation studies25. The resulting matrix, which identifies

pairs of positions in the genome where accessibility co-varies within individual cells,

yields Mb-scale correlation domains highly concordant with previously observed

chromatin domains26 (Fig. 4b-d) (R=0.61 for chromosome 1). These data provide

independent biological validation of large-scale compartmentalization of higher-order

45
chromatin structure25,26. Moreover, these results suggest that higher-order chromatin

interactions may drive regulatory variability in cis (elements that are close together tend

to be open together), and that ensemble chromosome conformation data may arise in part

from the statistical properties of single cell variation in co-regulated accessibility, a

hypothesis also supported by single-cell FISH measurements of interactions between

DNA loci27.

Discussion

Using scATAC-seq we dissected single-cell epigenomic heterogeneity and linked

cis- and trans- effectors to variability in accessibility profiles within individual

epigenomes. We identify trans-factors associated with increased accessibility variance,

which we call high-variance trans-factors. Additionally, other trans-factors such as

CTCF appear to buffer variability, perhaps by providing a stable anchor of chromatin

accessibility or insulator function that dampens potential fluctuations. Conversely, co-

occurance with other factors such as P300 appears to amplify variability, perhaps due to

synergistic interactions. Lineage-specific master regulators are associated with cell-type

specific single-cell epigenomic variability across several cell types, suggesting that

control of single-cell variance is a fundamental characteristic of different biological

states. Finally, variation of chromatin accessibility in cis is highly correlated with

previously reported chromosome compartments, opening the intriguing possibility that

this component of epigenomic noise has its roots in higher-order chromatin organization.

All together these data provide exciting new hypothesis of regulatory mechanisms that

give rise to single-cell heterogeneity.

46
We envision that future studies will enhance the utility of scATAC-seq by further

improving the recovery of DNA fragments, increasing throughput, and refining methods

of data analysis. Improvements to throughput and new statistical tools will enable single-

cells to be partitioned by cell-state and analyzed in aggregate to find the individual peaks

that drive variability (Fig. S5). In addition, we anticipate scATAC-seq may be paired

with existing approaches in microscopy and single-cell RNA-seq to provide opportunities

for systems analysis of individual cells. Such an approach will link regulatory variation to

details of phenotypic variation, promising new insight into the molecular underpinnings

of cellular heterogeneity. We believe scATAC-seq will likewise enable the interrogation

of the epigenomic landscape of small or rare biological samples allowing for detailed,

and potentially de novo, reconstruction of cellular differentiation or disease at the

fundamental unit of investigation the single cell.

47
Chapter 4 - Figures and Figure Legends

Figure 1. Single-cell ATAC-seq provides an accurate measure of chromatin

accessibility genome-wide. (a) Workflow for measuring single epigenomes using

scATAC-seq on a microfluidic device (Fluidigm). (b) Aggregate single-cell accessibility

profiles closely recapitulate profiles of DNase-seq and ATAC-seq. (C) Genome-wide

accessibility patterns observed by scATAC-seq are correlated with DNase-seq data (R =

0.80). (d) Library size versus percentage of fragments in open chromatin peaks (filtered

as described in methods) within K562 cells (N=288). Dotted lines (15% and 10,000)

represent cutoffs used for downstream analysis.

48
Figure 2. Trans-factors are associated with single-cell epigenomic variability. (a)

Schematic showing two cellular states (TF high and TF low) leading to differential

chromatin accessibility. (b) Analysis infrastructure, which uses a calculated background

signal (BS; see Supplemental Methods section 3.2) to calculate TF deviations and

variability from scATAC-seq data. The TF value is calculated by subtracting the number

of expected fragments from the observed fragments per cell (see Supplemental Methods

section 3.1). (c) Observed cell-to-cell variability within sets of genomic features

associated with ChIP-seq peaks, transcription factor motifs, and replication timing (error

estimates shown in grey, see Methods for details). Variability measured from permuted

background (see Methods) is shown in grey dots. (d) Distribution of normalized

deviations from expected accessibility signal for GATA1 sites in individual cells,

histogram of cells shown in grey, density profile shown in purple (see Methods). (e)

49
Immunostaining of GATA1 (green) and GATA2 (red) shows protein expression in

K562s. (f) Principal components ranked by fraction of variance explained from observed

data (purple) and permuted data (orange). Bar plot of observed data shown in grey. (g)

Calculated changes in associated variability of factors when present together versus

independently, depicting a context-specific trans-factor variability landscape (see

Methods). Venn-diagrams show variability associated with GATA1 and/or GATA2 and

CTCF and/or SMC3 (co-) occurring ChIP-seq sites.

50
Figure 3. Cell type specific epigenomic variability. Change of cellular variability due

to chemical perturbations using (a) CDK4/6 cell-cycle inhibitor (K562) or (b) TNF-alpha

stimulation (GM12878), error bars (shown in grey) represent 1 standard deviation of

bootstrapped cells across the two conditions. (c) Heat map of deviations from expected

accessibility signal across trans-factors (rows) and of single cells (columns) from 3 cell

types. Bottom color map represents assignment classification from hierarchical

clustering. (d) Variability associated with trans-factor motifs across 7 cell types. Each

row is normalized to the maximum variability for that motif across cell types (shown

left).

51
Figure 4. Structured cis variability across single epigenomes. (a) Per-cell deviations

of expected fragments across a region within chromosome 1 (see Methods). For display,

only large deviation cells are shown (N=186 cells). (b) Pearson correlation coefficient

representing topological domain signal (see Methods) of interaction frequency from a

chromatin conformation capture assay (left, data from Kalhor et al.26) or doubly

correlated normalized deviations of scATAC-seq (right) from chromosome 1 (see

Methods). Data in white represents masked regions due to highly repetitive regions. (c)

Permuted cis-correlation map for chromosome 1 (analyzed identically to (b)). (d) Box

highlights a representative region depicting long-range covariability.

52
Supplemental Figure 1. scATAC-seq data recapitulate bulk assays. (a) Histogram of

aggregated read starts around all TSSs (in K562 cells) comparing ensemble approaches to

scATAC-seq shows high enrichment above background level of reads. (b) DNA fragment

size distribution of ATAC-seq fragments from single cells (grey) and the average of all

single cells (red) display characteristic nucleosome-associated periodicity. (c)

Accessibility across all peaks (n=50,000) in GM12878 cells. (d) Accessibility across all

annotated promoters in GM12878 cells. Typical promoters used for subsequent analysis

are boxed with dotted lines. Recovery of typical promoters shown in (a) within single-

cells within (e) observed data and (f) extrapolated data using measures of predicted

library complexity.

53
Supplemental Figure 2. scATAC-seq data analysis pipeline and validation of bias

normalization. Standard deviation of log fold change in reads across cells within peaks

binned by deciles of (a) peak intensity, (b) Tn5 bias and (c) GC bias. Variability scores

(incorporating bias normalization) within the same peaks shown in (a-c), peaks are

binned by deciles of (d) peak intensity, (e) Tn5 bias and (f) GC bias. (g-i) Observed

changes in variability comparing the merged set of replicates (K562) to each individual

biological replicate. Error bars represent 1 standard deviation of the variability scores

after bootstrapping cells from each replicate.

54
Supplemental Figure 3. Characterization of high-variance trans-factors in K562

cells. (a-d) Distribution of (a) GATA1, (b) GATA2, (c) actin and (d) CTCF fluorescence

observed by flow cytometry. Distributions in grey depict isotype controls. (e) Bi-

clustered heat map of single cell deviations as observed within K562 cells (N=239).

Labels on right identify co-clustering of related factors. (f) Bi-clustered heat map of

single-cell deviations observed from permuted data. (g) Projection of factor loadings onto

principal component 1 versus 5 from principal component (PC) analysis of heatmap from

Fig. 2d. Factor loadings do not vary along PC5, while peaks associated with regions with

different replication timings (RepliSeq) have strong variation along this axis. Venn-

diagrams showing variability of (h) GATA1 and/or GATA2, (i) CJUN and/or GATA2

and CEBPB and/or GATA2 (co-) occurring ChIP-seq sites. (h) Distribution of

accessibility among GATA1 only, GATA2 only, and shared sites. (i) Mean accessibility

55
from GATA1 only, GATA2 only, and shared sites in (k), error bars represent 1 standard

deviation generated by bootstrapping ChIP-seq peaks.

56
Supplemental Figure 4. Drug treatments modulate factor variability. (a-b) Change in

variability of untreated K562 cells versus cells treated with (a) Imatinib and (b) JUN

inhibitor show increase of variability in factors associated with the cell cycle or s-phase

and JUN factors respectively. (c-f) Flow cytometry data depicting DNA content, using

DAPI or PI, in (c) control K562 cells or cells showing altered cell-cycle status after

treatment with (d) cell-cycle inhibitor, (e) Imatinib and (f) JUN inhibitor.

57
Supplemental Figure 5. Measurements of individual peaks within single-cells. (a)

The distribution of GATA1 deviation scores for single K562 cells. Volcano plots of (b)

non-GATA1 peaks and (c) GATA1 peaks in K562 cells, p-values were calculated using a

binomial test. (d) The distribution of NF-kB deviation scores for single GM12878 cells.

Volcano plots of (e) non-NFKB peaks and (f) NF-kB peaks in GM12878 cells, p-values

were calculated using a binomial test. Inset numbers show the number of points in upper

left or upper right quadrants of the panel. (g) Accessibility at a genomic locus, showing

58
(top) aggregate NFKB low (blue) and NFKB high (red) profiles, (middle) single

GM12878 cells ranked by NFKB deviations scores and (bottom) unranked single-cells.

59
References

1. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug


responses across a human hematopoietic continuum. Science 332, 687696 (2011).
2. Raj, A., Rifkin, S. A., Andersen, E. & van Oudenaarden, A. Variability in gene
expression underlies incomplete penetrance. Nature 463, 913918 (2010).
3. Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free
decomposition of tissues into cell types. Science 343, 776779 (2014).
4. Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing
epigenetic heterogeneity. Nat Meth 11, 817820 (2014).
5. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-wide detection of single-
nucleotide and copy-number variations of a single human cell. Science 338, 1622
1626 (2012).
6. Chang, H. H., Hemberg, M., Barahona, M., Ingber, D. E. & Huang, S.
Transcriptome-wide noise controls lineage choice in mammalian progenitor cells.
Nature 453, 544547 (2008).
7. Imayoshi, I. et al. Oscillatory control of factors determining multipotency and fate
in mouse neural progenitors. Science 342, 12031208 (2013).
8. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in
primary glioblastoma. Science 344, 13961401 (2014).
9. Michor, F. et al. Dynamics of chronic myeloid leukaemia. Nature 435, 12671270
(2005).
10. Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 5774 (2012).
11. Thurman, R. E. et al. The accessible chromatin landscape of the human genome.
Nature 489, 7582 (2012).
12. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).
13. Goryshin, I. Y. & Reznikoff, W. S. Tn5 in vitro transposition. J. Biol. Chem. 273,
73677374 (1998).
14. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment
libraries by high-density in vitro transposition. Genome Biol 11, R119 (2010).
15. Consortium, T. E. P. A User's Guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol 9, e1001046 (2011).
16. Gerstein, M. B. et al. Architecture of the human regulatory network derived from
ENCODE data. Nature 489, 91100 (2012).
17. Xie, D. et al. Dynamic trans-Acting Factor Colocalization in Human Cells. Cell
155, 713724 (2013).
18. Hansen, R. S. et al. Sequencing newly replicated DNA reveals widespread
plasticity in human replication timing. Proceedings of the National Academy of
Sciences 107, 139144 (2010).
19. Parelho, V. et al. Cohesins Functionally Associate with CTCF on Mammalian
Chromosome Arms. Cell 132, 422433 (2008).
20. Tay, S. et al. Single-cell NF-kB dynamics reveal digital activation and analogue

60
information processing. Nature 466, 267271 (2010).
21. Grn, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-
cell transcriptomics. Nat Meth 11, 637640 (2014).
22. Singer, Z. S. et al. Dynamic Heterogeneity and DNA Methylation in Embryonic
Stem Cells. Molecular Cell 55, 319331 (2014).
23. Cai, L., Dalal, C. K. & Elowitz, M. B. Frequency-modulated nuclear localization
bursts coordinate gene regulation. Nature 455, 485490 (2008).
24. Levine, J. H., Lin, Y. & Elowitz, M. B. Functional roles of pulsing in genetic
circuits. Science 342, 11931200 (2013).
25. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions
reveals folding principles of the human genome. Science 326, 289293 (2009).
26. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. Genome architectures
revealed by tethered chromosome conformation capture and population-based
modeling. Nat. Biotechnol. 30, 9098 (2012).
27. Giorgetti, L. et al. Predictive Polymer Modeling Reveals Coupled Fluctuations in
Chromosome Conformation and Transcription. Cell 157, 950963 (2014).

61
CHAPTER FIVE The epigenomic determinants of human hematopoiesis4

Introduction

The entire human hematopoietic system is maintained by the activity of a small

number of self-renewing hematopoietic stem cells (HSCs). These cells are long-lived and

retain the ability to give rise to multiple distinct lineages of blood cells. During the course

of a single day, more than 200 billion blood cells are produced1, highlighting the need for

tightly controlled regulatory programs that balance self-renewal of the apex stem cells

with downstream production of differentiated effector cells. Despite its functional

complexity, the hematopoietic system is the most extensively characterized adult stem

cell hierarchy whereby many diverse cell types can be isolated through the use of multi-

parameter fluorescence activated cell sorting (FACS)2. This enables interrogation of the

precise transcriptional dynamics that govern the cell state transitions associated with

differentiation and lineage commitment.

Genome-wide sequencing methods are exquisite sensors for assessing the

molecular determinants governing these distinct regulatory programs. Previous studies

have profiled gene expression patterns in mouse3-5 and human6,7 hematopoiesis providing

a rich resource for characterizing these cellular states. However, measuring gene

expression alone provides limited information regarding the causative regulators of cell

identity. The dynamic expression of key transcription factors (TFs) can dramatically alter

the regulatory landscape, which defines the expression of nearby genes, and forms the

molecular basis of specialized regulatory programs. Genome-wide chromatin-based

assays measuring chromatin accessibility or chromatin bound proteins are sensitive



4
Portions of this chapter were taken from Corces R & Buenrostro et al. Lineage-specific and
single cell chromatin accessibility charts human hematopoiesis and leukemia evolution.
Submitted.

62
methods for assaying cellular regulation. Importantly, chromatin accessibility measures

nucleosome-free or nucleosome-depleted sites throughout the genome, which demarcate

active regulatory elements and hotspots for TF binding. However, these assays require

millions of cells, hindering efforts to comprehensively catalogue hematopoietic

regulation and limiting their application to either cell lines8 or whole tissues9 which do

not accurately represent individual primary cell types. Recent developments have enabled

genome-wide chromatin immunoprecipitation5 or chromatin accessibility10 profiling in

rare cellular populations, which has successfully lead to the identification of regulatory

elements within mouse hematopoiesis5. These remarkable improvements in speed and

efficiency have afforded the unique opportunity to provide such comprehensive and data-

rich measurements in human hematopoietic cells and provide a platform for

understanding the molecular underpinnings of human blood development and disease.

We previously described the Assay for Transposase Accessible Chromatin using

sequencing (ATAC-seq), a method capable of measuring chromatin accessibility in rare

cellular populations10. Here, we report the development of an improved ATAC-seq

protocol, optimized for human blood cells, that allows for more rapid high-quality

measurements with 10-fold fewer cells. We apply this optimized protocol to cells isolated

from 9 healthy human donors, studying 13 of the major cell types of the normal

hematopoiesis. In addition, we measure the transcriptomes of the same healthy donors to

derive paired expression data. This atlas of normal human hematopoiesis provides a data

rich resource for discovering the molecular determinants of human hematopoiesis and

allows for the deconvolution of complex biological data.

63
Identification of chromatin accessibility landscape in primary blood cells

To better understand the regulatory networks controlling human hematopoietic

differentiation and leukemogenesis, we sought to create a reference regulome and

transcriptome map of the normal hematopoietic hierarchy (Fig. 1a,b). Although ATAC-

seq is highly efficacious for a variety of cell sources, further optimizations were required

to profile rare primary human blood cells from cryopreserved specimens. This protocol,

termed Fast-ATAC, was optimized for use on primary blood cells and relies on a 1-step

membrane permeabilization and transposition using the lysis reagent digitonin. We found

that this simplified protocol provides extremely high quality data (Fig. S1a-c), requires

just 5,000 cells, offers an approximately 10-fold improvement in sensitivity, and reduces

the frequency of mitochondrial reads by ~5 fold (Fig. S1d). However, we note that

digitonin is a gentle detergent and may not be ideal for cell lines and other cell types that

are more resistant to lysis. Overall, Fast-ATAC provided hematopoietic epigenomes with

i) increased speed, ii) fewer cells and iii) lower cost, making it readily adaptable for

large-scale studies of rare cellular populations.

Using Fast-ATAC and RNA-seq, we profiled the chromatin accessibility

landscape (regulomes) and transcriptomes from 13 distinct cellular populations from

the human hematopoietic hierarchy via fluorescence activated cell sorting (FACS) (Fig.

1a and Fig. S2). Cells were taken directly from donor bone marrow or peripheral blood

without further in vitro manipulation or treating donors with agents such as granulocyte

colony-stimulating factor (G-CSF). These analyses excluded mature granulocytes due to

high endogenous RNases and proteases as well as mature megakaryocytes, which proved

difficult to isolate in adequate cell numbers. The isolated cell populations included 7

64
unique stem and progenitor and 6 differentiated cell types spanning the myeloid,

erythroid, and lymphoid lineages2,11-13. All together, we performed ATAC-seq and RNA-

seq on 3-4 adult donors for each cell population totaling 49 transcriptomes and 77

regulomes (Fig. S1e).

With this dataset we identified a total of 590,650 hematopoietic accessible peaks.

Each individual cell type of the hematopoietic hierarchy displayed a set of uniquely

expressed genes and uniquely open peaks mapping to genes known to be involved in

cellular functions important for the given cell type (Fig. 1c and Fig. S1f,g). Additionally,

the sets of uniquely open peaks were enriched for motifs of transcription factors known to

be involved in the biological processes of the cell type of interest (Fig. S1h).

We found Fast-ATAC profiles to be highly reproducible between technical

(R=0.93, Fig. 1d) and biological (R=0.93, Fig. 1e) replicates. We also observed a

significant correlation (R=0.73) between Fast-ATAC and DNase-seq (data from the

Epigenomic Roadmap Consortium) of bone-marrow derived CD34+ HSPCs (Fig. 1f).

Importantly, we find that hematopoietic stem cells (HSCs), a CD34+ subpopulation, can

have significantly different chromatin profiles than the bulk CD34+ HSPC pool (R=0.77,

Fig. 1g), highlighting the value of analysis of highly purified stem and progenitor cell

subpopulations.

Chromatin accessibility at distal elements delineates the hematopoietic hierarchy

Paired regulome and transcriptome data provide a unique opportunity to

understand the regulatory networks of human hematopoiesis. Unsupervised hierarchical

clustering of our RNA-seq and ATAC-seq data shows robust classification of cell types

65
among technical and biological replicates (Fig. 2a,b). In this analysis, ATAC-seq appears

to be more adept at classifying cell types as quantified by the cluster purity14, suggesting

that chromatin accessibility is more cell type-specific and better captures cell identity.

Intriguingly, and in line with previous studies of murine cell subsets5, epigenomes and

transcriptomes also provide different conclusions about the relationship between various

cell types. By RNA-seq, the common myeloid progenitor (CMP) clusters with the

megakaryocyte erythroid progenitor (MEP) (Fig. 2c), whereas, by ATAC-seq CMP

clusters more closely with the HSC and the multipotent progenitor cell (MPP) (Fig. 2d),

which is more consistent with its role in the hematopoietic hierarchy. We reasoned that

inaccurate lineage classification by RNA-seq may be resolved by ATAC-seq, with the

latter revealing the cell-type specific and combinatorial logic of regulatory elements that

control the expression of nearby genes. When regulatory elements were subdivided to

gene promoters versus putative distal enhancers (>1000 bp away from the closest TSS),

we find that distal enhancers provide significantly improved cell-type classification

compared to promoters and transcription profiles (Fig. 2e,f). Notably, we find that

promoter elements are largely invariant within CD34+ stem and progenitor cells,

suggesting that chromatin remodeling associated with these linked developmental lineage

decisions occurs predominantly within distal regulatory elements. This observation is

clearly illustrated by the region surrounding the TET2 gene, a gene expressed within a 2-

fold range in all cell types throughout the hematopoietic hierarchy. Despite the invariant

expression of TET2 and ubiquitous accessibility of TET2 promoter, we find highly

diverse accessibility profiles within nearby distal regulatory elements, clearly

distinguishing HSPCs, NK cells, and T cells (Fig. 2g).

66
Enhancer cytometry prospectively deconvolves complex cell populations

Given the accuracy with which regulatory landscapes delineate cell types, we

hypothesized that Fast-ATAC can be used to deconvolve highly complex cellular

populations into their constitutive subsets. For instance, the Epigenomic Roadmap

Consortium has provided multiple datasets on heterogeneous tissues, and in particular,

mixtures of CD34+ HSPCs. These data are very useful for understanding the biology of

these cells; however, these tissues represent an ensemble average of multiple distinct cell

types. While some regulatory elements are ubiquitous among all HSPCs, others show

high cell type specificity (Fig. 3a). For example, the accessible site near micro-RNA 1915

shows a robust peak exclusively in CMP cells but shows almost no accessibility in the

CD34+ DNaseseq data. In fact, regulatory elements that are highly cell-type specific are

averaged out and difficult to detect in this bulk CD34+ data (Fig. 3a).

The highly cell type-specific nature of our ATAC-seq data enabled the

development of a strategy we term enhancer cytometry, wherein we enumerate the

frequency of cell types in complex cellular mixtures based on chromatin accessibility

data. To do this, we employ the deconvolution algorithm CIBERSORT14 to quantify the

contribution of each individual cell type to the ensemble profile. Analogous to flow

cytometry of cell surface markers, enhancer cytometry with CIBERSORT uses the

presence or absence of accessibility at tens of thousands of elements to match pre-defined

patterns of cell identity. To do this, we filtered for high-quality distal regulatory elements

and removed promoter signal (see methods) and applied CIBERSORT to define an array

of cell-type specific regulatory elements (Fig. 3b). CIBERSORT employs support vector

67
regression (SVR) for deconvolution, a method shown to be robust to noise, unknown

mixture content, and multicollinearity14. We validated this approach using a leave-one-

out cross validation and found that enhancer cytometry proved to be highly robust for

classification of all normal hematopoietic cell types (Fig. 3c,d). One exception is the

MPP that showed reasonable but lower accuracy than other cell types. However, we note

that when MPP cells are misclassified, they are most frequently misclassified as HSCs,

their closest normal cell type. Next, we prospectively tested enhancer cytometry on bulk

CD34+ HSPCs and performed flow cytometry in parallel. We found that enhancer

cytometry yielded highly accurate enumeration of the constituent cell types when

compared to flow cytometry (R2 = 0.95, Fig. 3e,f). Notably, this cell type deconvolution

was not as accurate without restriction to distal regulatory elements (R2 = 0.91). In

addition, we found that enhancer cytometry can also be used to deconvolve CD34+

DNase-seq data (p < 0.001), suggesting that ATAC-seq with enhancer cytometry may be

a general strategy for identifying and counting cells within complex cellular mixtures.

Regulatory networks of normal hematopoiesis

To better understand the mechanisms governing these diverse regulatory

landscapes, we sought to quantify the effect of specific trans-factors at each

developmental transition. To do this we adapted a computational framework we

previously developed to measure accessibility across regulatory elements sharing a

common feature, i.e. TF motif15. In brief, we classified hematopoietic regulatory elements

by their underlying transcription factor motifs and calculated a bias corrected deviation

score, which represents a differential gain or loss of accessibility across peaks sharing a

68
given motif for each transition in the hematopoietic hierarchy. We note that, unlike

current methods for TF footprinting16, this measure of TF accessibility is highly robust to

the number of sequenced reads, DNA sequence bias, and signal-to-noise bias. We

therefore chose this approach to measure the effect that a given TF motif enacts on the

accessible genome at each stage of hematopoiesis; for subsequent visualization, we

condensed similar motifs to create a non-overlapping list (see methods). We find TF

motifs such as GATA, RUNX, and SPI1 to be dominant regulators of chromatin

accessibility (Fig. 4a and Fig. S3a). Notably, these factors have also been previously

shown to be governing master regulators of hematopoiesis17-19. We find that activation of

these TFs are highly cell-type specific, often displaying step-wise gains across

developmental lineages. This is exemplified by the GATA and PAX motifs which

are strongly enriched in erythroid and lymphoid lineages respectively (Fig. 4b,c). To

validate this approach for determining global TF motif regulators of cell identity, we

compared GATA TF footprints20 between MEPs (GATA high) and common lymphoid

progenitors (CLPs) (GATA low) and found that CLPs had no detectable binding at

GATA sites when compared to MEPs (Fig. 4d). For further validation, we employed

PIQ21, a TF footprinting algorithm, and found drastically fewer GATA footprints in CLPs

compared to MEPs (N=173 and N=27,292 respectively), thus, confirming our analytical

strategy for measuring TF binding.

We reasoned that the accessibility of a given motif should correlate with the

expression of the associated transcription factor. However, the underlying motif sequence

does not identify the precise causative regulator of accessibility at those motif instances.

This is a common issue in epigenomic studies and particularly important for cases in

69
which many factors share identical or near-identical TF motifs. For example, the GATA

motif is shared among 6 TFs (GATA1-6), while the PAX motif is shared among 9 TFs. In

an effort to assign motifs to transcription factors, we integrated our ATAC-seq and RNA-

seq data to predict causative regulators of motif accessibility. To do this, we employed

CIS-BP22, a comprehensive database of in vitro and in silico derived motifs, to create an

association table linking hematopoietic TF motifs to 806 genes by motif similarity (Fig.

S3b-e). Next, we calculated correlation coefficients for the expression of all known TFs23

to deviation scores across hematopoiesis. Using this approach we find a striking

correlation of motif usage with the expression of known master regulators of

hematopoiesis (Fig. 4e). For example, the expression of GATA1 and PAX5 are highly

correlated with accessibility at GATA and PAX motifs, respectively (R = 0.75, P = 10-18

and R = 0.88, P = 10-230, Fig. 4e-g and Fig. S3f). Interestingly, for some motifs, such as

the HOX motif, we find many putative regulators with weak correlations (N = 11; Fig.

S3g,h), suggesting that regulation of HOX accessibility is more complex. Together, these

results highlight the utility of a systems-level analysis of epigenome and transcriptome

data.

Accessibility profiles of purified cell populations identify the ontogeny of human

diseases

In addition to enhancing our understanding of developmental gene regulation, the

hematopoietic regulome can trace the ontogeny of activity in the noncoding genome that

impacts human disease. Many genome-wide association studies (GWAS) have linked

diseases to polymorphisms, but have not been able to pinpoint the cells responsible for

70
those phenotypes. By measuring the activity of regulatory elements that overlap regions

with predicted sites of functional variation from GWAS, it is now possible to more

accurately predict the specific cell types impacted by genetic variants linked to diverse

human diseases24-26. To do this we first filtered for GWAS that were significantly

enriched in hematopoietic cells (Fig. S4a,b; see methods), then calculated deviation

scores for each GWAS across the hematopoietic hierarchy as described above. We found

that each of these associations can be traced through the hematopoietic lineage to predict

the developmental point at which each variant may first exert its effects, thus enriching

our understanding of developmental origins of human disease (Fig. 4h-k and Fig. S4c).

As a positive control example, polymorphisms linked to mean corpuscular volume

(MCV), a measure of the average volume of an erythrocyte cell, are most strongly

enriched in erythroblasts (Fig. 4h). Intriguingly, many regions associated with MCV

polymorphisms first become accessible at the CMP stage and increase in accessibility in

MEP cells. These non-coding polymorphisms are predicted to affect transcription factor

binding and would, therefore, lead to closure of sites that would otherwise be accessible.

From this, MCV-associated polymorphisms found in the accessible regions of CMPs and

MEPs suggest that these polymorphisms exert their effects prior to full erythroid lineage

commitment. As a second example, polymorphisms associated with rheumatoid arthritis

(RA) show a strong enrichment in B cells (Fig. 4i). This association is consistent with the

known role of autoantibodies and pathogenic B cells in the pathogenesis of RA, as well

as the documented success of B cell depletion therapy in the treatment of RA27,28.

We find a more complex pattern in the disease alopecia areata, an autoimmune

disease in which hair is lost from some or all areas of the body. The autoimmunity

71
driving this disease has recently been associated with both innate and adaptive immune

responses29, a result consistent with the enrichment of polymorphisms for alopecia areata

in both CD4+ and CD8+ T cells and monocytes (Fig. 4j). B cells also harbor many active

elements associated with alopecia areata but have not been studied in this disease,

suggesting a new direction of investigation. Importantly, the disease associations that are

highlighted by our data are not limited to diseases canonically associated with

hematopoietic cells; polymorphisms linked to Alzheimers disease show a strong

enrichment in B cells and monocytes, two cell types that have predicted roles in the

pathogenesis of the disease24,30,31 (Fig. 4k).

Discussion

Here we report a rich resource charting the epigenomic and transcriptomic

landscape of 13 unique blood cell types. This resource relies on the accurate and precise

determination of the epigenomic landscapes in primary human blood cells, made possible

by Fast-ATAC. The chromatin accessibility profiles of blood cells are highly cell type

specific and allow for a much more robust classification system than more frequently

used transcriptional profiles. Unsupervised clustering of accessible chromatin regions,

specifically distal enhancers, groups individual cell types with extremely high cluster

purity, demonstrating that these distal regulatory elements more precisely define cell

identity and developmental trajectory. Enhancer cytometry harnesses this specificity and

proves to be a useful strategy to navigate regulome data. By matching patterns of distal

element accessibility to known profiles of pure cell types, enhancer cytometry

enumerates the frequencies of pure cell types in complex cell mixtures. This technique

72
enabled the accurate deconvolution of data derived from CD34+ bone marrow cells into

the constituent highly-similar HSPC cell types. Flow cytometry has become a standard

technique, but it is typically limited to a handful of cell surface markers, each requiring a

different antibody that may have off-target binding and gating idiosyncrasies. In contrast,

enhancer cytometry employs a universal probe system (a transposase) to simultaneously

interrogate hundreds of thousands of regulatory elements, empowering an extremely

robust classification system. An important limitation of enhancer cytometry is that the

method destroys the cell as the measurements are made, and thus does not permit

prospective cell purification at present. We note that while we have used well-

characterized cell types with known cell surface immunophenotypes to generate pure cell

type reference maps, single cell ATAC-seq with enhancer cytometry may be used as an

unbiased measure of cell type identity within a population providing archetypal cell

profiles within complex cellular populations. In principle, this general approach may be

used to resolve cellular heterogeneity in any tissue or organism.

This atlas of human hematopoiesis enriches the interpretation of GWAS results in

several ways. First, we identify strong associations of disease-linked polymorphisms with

the open chromatin landscapes of specific hematopoietic cell types, notably the

developmental contexts in which the disease-relevant elements first become active. In the

case of mean corpuscular volume, a measurement of the size of red blood cells, the

strongest association occurs in erythroblast cells, but a significant association can be seen

as early as the common myeloid progenitor stage (CMP). These results are consistent

with the concept that many enhancers are developmentally primed prior to their

activation following cell differentiation5. Given our in-depth characterization of known

73
human HSPC subtypes, we are able to identify the earliest progenitor cells that may be

relevant in the pathogenesis of specific diseases and elucidate putative targets for

corrective action. It is now well accepted that effective genetic correction of coding

mutations needs to take place in the stem cell compartment - e.g. the HSC in blood or

basal cells in epithelia - in order to achieve long lasting phenotypic correction in the

tissue. The same logic applies to genetic variants in the noncoding genome and suggests

the need to map the developmental ontogeny of regulatory elements. Comprehensive and

cell type-specific regulome maps will help to nominate hypotheses of relevant cell types

in diseases.

Lastly, this resource provides a platform to identify specific trans-acting

regulators that drive blood cell identity and function. Integration of ATAC-seq and RNA-

seq data improves motif-transcription factor pairing and enables the accurate

determination of causative regulators of chromatin accessibility throughout hematopoietic

differentiation. We anticipate this combined data set, which represents a dynamic

developmental process, to be a rich resource for continued efforts to build computational

tools that model both cis32 and trans33 determinants of chromatin accessibility and gene

expression.

74
Chapter 5 - Figures and Figure Legends

Figure 1. Interrogation of chromatin landscapes in primary blood cells. (a)

Schematic of the human hematopoietic hierarchy shows the 13 primary cell types

analyzed in this work. Granulocytes and megakaryocytes were excluded. (b) Diagram of

analyses performed using paired ATAC-seq and RNA-seq data in both primary human

blood cells and primary patient AML cells. (c) Normalized ATAC-seq profiles at

developmentally important genes. Profiles represent the union of all technical and

biological replicates for each cell type. See Supplementary Table 1 for the exact number

75
of technical and biological replicates for each cell type. (d-g) Scatter plot showing

correlation of (d) technical replicates, (e) different human donors, (f) ATAC-seq and

DNase-seq data derived from CD34+ HSPCs, and (g) ATAC-seq HSCs with bulk CD34+

HSPCs.

76
Figure 2. Distal regulatory elements enable accurate classification of the

hematopoietic hierarchy. (a,b) Hierarchical clustering of (a) RNA-seq (N=49) and (b)

ATAC-seq (N=77) data from all biological replicates of 13 normal hematopoietic cell

types. Values shown are Pearson correlation coefficients. Cluster purity quantifies the

degree that cells of the same lineage (color coded in the key) are clustered together. (c,d)

Phylogenetic dendrograms of (c) RNA-seq and (d) ATAC-seq data showing inter-cell

type correlations derived from aggregate averages of all biological and technical

replicates. Length of tree branches represents Euclidean distance. Data represents the

union of all technical and biological replicates for each cell type. (e,f) Hierarchical

clustering of ATAC-seq profiles (N=77) mapping to (e) promoters and (f) distal

regulatory elements. (g) ATAC-seq peaks in the TET2 locus show highly variable distal

regulatory landscapes (left) and relatively constitutive expression of TET2 (right). Data

represents the union of all technical and biological replicates for each cell type.

77
Figure 3. Enhancer cytometry allows for deconvolution of the hematopoietic

hierarchy. (a) Normalized ATAC-seq profiles of HSPC subsets and ensemble CD34+

HSPC DNase-seq profiles illustrating heterogeneity amongst CD34+ HSPC

subpopulations. Predicted cell fractions are shown on the left and nearest annotated genes

are shown on the bottom. (b) Schematic of enhancer cytometry, including methods to

define a signature matrix of highly cell-type specific enhancers (right panel, N=735).

(c,d) Benchmarking of enhancer cytometry using randomly permuted synthetic mixtures

to test robustness to (c) sequential subtraction and (d) randomized mixture content. Test

data and training data are non-overlapping. Error bars in (c) represent the standard

deviation of 100 random permutations. (e) Enhancer cytometry of ATAC-seq data

derived from FACS sorted bulk CD34+ HSPCs identifies fractional contribution from all

expected cell types. (f) Correlation of predicted fractional contribution of each HSPC cell

type by enhancer cytometry versus flow cytometric ground truth data of input CD34+

cells.

78
Figure 4. Integrative analysis of the hematopoietic regulome refines transcriptional

circuitry driving cell specification and enriches the understanding of human disease

(a) Transcription factor dynamics showing major TFs driving hematopoietic regulomes.

The size of the circle represents the effect of that motif in driving accessibility in human

blood cells. The relative distance between circles represents the co-occurrence of motifs

throughout hematopoietic differentiation (see methods). (b,c) Usage of the (b) GATA and

(c) PAX motif throughout hematopoietic differentiation. Values represent the relative

deviation of the motif accessibility, a measure of motif usage, compared to that in HSCs.

(d) Footprint analysis of the GATA motif in MEP and CLP cells. (e) Correlation

(Pearson) of motif accessibility and significance of gene expression for GATA (top) and

PAX (bottom). Red dots represent DNA-binding factors annotated to bind the given

79
motif, gray dots represent all other DNA-binding factors. (f,g) Expression of (f) GATA1

and (g) PAX5 phenocopies the usage of the GATA motif throughout hematopoietic

differentiation (h-k) Relative deviation scores of chromatin accessibility within

hematopoietic regulatory elements with GWAS SNPs for (h) mean corpuscular volume,

(i) rhuematoid arthritis, (j) alopecia areata, and (k) Alzheimers disease.

80
Supplementary Figure 1. Data processing pipelines. (a) ATAC-seq insert size

distribution for three biological replicates of HSCs. (b,c) Enrichment of signal at

annotated transcription start sites (TSS) from Fast-ATAC data compared to (b) DNase-

seq and (c) previously published ATAC-seq data using the original ATAC-seq protocol10.

(d) Fraction of total mitochondrial reads derived from the original ATAC-seq protocol

and the fast-ATAC protocol. (e) Accessible chromatin landscapes surrounding a

constitutively accessible region of the genome. Profiles represent the union of all

technical and biological replicates for each cell type. (f,g) GO Term analyses from unique

81
(f) gene expression and (g) accessible peaks from normal hematopoietic cells. (h)

Enrichment of developmentally relevant motifs in accessible peaks.

82
Supplementary Figure 2. Cell sorting strategies. (a) Representative examples of

sorting strategies for the seven CD34+ HSPC populations isolated in this study.

83
Supplementary Figure 3. Trans regulators of hematopoiesis. (a) Summary of motif

deviations across hematopoiesis normalized by maximum and minimum signal. Scale is

represented above each column. (b) Clustering of hematopoiesis TF motifs (N=46) with

CIS-BP motifs (N=806) using Pearson correlation (see methods). (c,d) Example of

clustered motifs for (c) GATA4 and (d) MEIS1. (e) Histogram of all correlation values

shown in (b) with lists of putative hematopoietic regulators highlighted (N=255). (f)

Correlation of motif deviations to gene expression changes in hematopoiesis for two

developmentally important TFs, GATA1 and PAX5. (g,h) Summary list of putative TF

(g) positive and (h) negative regulators of hematopoiesis. Motifs are listed on the left and

genes are listed on the right. Values represent correlation coefficients (Pearson).

84
Supplementary Figure 4. GWAS enrichments across hematopoiesis. (a)

Representative example of GWAS enrichment across tissues (see methods). Colors as

shown in (b). (b) Hierarchical clustering of all GWAS (N=235) across diverse tissues. (c)

Summary of GWAS deviations across hematopoiesis normalized by maximum and

minimum signal.

85
References

1. Quesenberry, P. J. & Colvin, G. A. Hematopoietic Stem Cells, Progenitor Cells,


and Cytokines. In Williams Hematology. 153174 (McGraw-Hill, 2005).
2. Seita, J. & Weissman, I. L. Hematopoietic stem cell: self-renewal versus
differentiation. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 2,
640653 (2010).
3. Ji, H. et al. Comprehensive methylome map of lineage commitment from
haematopoietic progenitors. Nature 467, 338342 (2010).
4. Inlay, M. A. et al. Ly6d marks the earliest stage of B-cell specification and
identifies the branchpoint between B-cell and T-cell development. Genes and
Development 23, 23762381 (2009).
5. Lara-Astiaso, D. et al. Chromatin state dynamics during blood formation. Science
55, 110 (2014).
6. Chen, L. et al. Transcriptional diversity during lineage commitment of human
blood progenitors. Science 345, 12510331251033 (2014).
7. Novershtern, N. et al. Densely interconnected transcriptional circuits control cell
states in human hematopoiesis. Cell 144, 296309 (2011).
8. Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human
genome. Nature 489, 5774 (2012).
9. Consortium, R. E. et al. Integrative analysis of 111 reference human epigenomes.
Nature 518, 317330 (2015).
10. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J.
Transposition of native chromatin for fast and sensitive epigenomic profiling of
open chromatin, DNA-binding proteins and nucleosome position. Nat Meth 10,
12131218 (2013).
11. Majeti, R., Park, C. Y. & Weissman, I. L. Identification of a hierarchy of
multipotent hematopoietic progenitors in human cord blood. Cell Stem Cell 1,
63545 (2007).
12. Manz, M. G., Miyamoto, T., Akashi, K. & Weissman, I. L. Prospective isolation of
human clonogenic common myeloid progenitors. Proceedings of the National
Academy of Sciences of the United States of America 99, 1187211877 (2002).
13. Kohn, L. A. et al. Lymphoid priming in human bone marrow begins before
expression of CD10 with upregulation of L-selectin. Nature Immunology 13, 963
971 (2012).
14. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression
profiles. Nat Meth 12, 110 (2015).
15. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of
regulatory variation. Nature 523, 486490 (2015).
16. He, H. H. et al. Refined DNase-seq protocol and data analysis reveals intrinsic bias
in transcription factor footprint identification. Nat Meth 11, 7378 (2013).
17. Weiss, M. J. & Orkin, S. H. GATA transcription factors: key regulators of
hematopoiesis. Experimental Hematology 23, 99107 (1995).
18. Burns, C. E., Traver, D., Mayhall, E., Shepard, J. L. & Zon, L. I. Hematopoietic
stem cell fate is established by the Notch-Runx pathway. Genes & development 19,
233142 (2005).

86
19. Nerlov, C. & Graf, T. PU.1 induces myeloid lineage commitment in multipotent
hematopoietic progenitors. Genes & development 12, 24032412 (1998).
20. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 8390 (2012).
21. Sherwood, R. I. et al. Discovery of directional and nondirectional pioneer
transcription factors by modeling DNase profile magnitude and shape. Nat.
Biotechnol. 32, 1718 (2014).
22. Weirauch, M. T. et al. Determination and Inference of Eukaryotic Transcription
Factor Sequence Specificity. Cell 158, 14311443 (2014).
23. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A
census of human transcription factors: function, expression and evolution. Nature
Reviews Genetics 10, 252263 (2009).
24. Gjoneska, E. et al. Conserved epigenomic signals in mice and humans reveal
immune basis of Alzheimers disease. Nature 518, 365369 (2015).
25. Farh, K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune
disease variants. Nature 518, 337343 (2015).
26. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated
Variation in Regulatory DNA. 337, 11901195 (2012).
27. De Vita, S. et al. Efficacy of selective B cell blockade in the treatment of
rheumatoid arthritis: evidence for a pathogenetic role of B cells. Arthritis &
Rheumatology 46, 202933 (2002).
28. Coenen, M. J. H. & Gregersen, P. K. Rheumatoid arthritis: a view of the current
genetic landscape. Genes and Immunity 10, 101111 (2009).
29. Petukhova, L. et al. Genome-wide association study in alopecia areata implicates
both innate and adaptive immunity. Nature 466, 113117 (2010).
30. Butovsky, O., Kunis, G., Koronyo-Hamaoui, M. & Schwartz, M. Selective
ablation of bone marrow-derived dendritic cells increases amyloid plaques in a
mouse Alzheimer's disease model. European Journal of Neuroscience 26, 413416
(2007).
31. Khoury, El, J. et al. Ccr2 deficiency impairs microglial accumulation and
accelerates progression of Alzheimer-like disease. Nat Med 13, 432438 (2007).
32. Gonzlez, A. J., Setty, M. & Leslie, C. S. Early enhancer establishment and
regulatory locus complexity shape transcriptional programs in hematopoietic
differentiation. Nature Genetics (2015).
33. Whitaker, J. W., Chen, Z. & Wang, W. Predicting the human epigenome from
DNA motifs. Nat Methods 12, 265272 (2015).

87
CHAPTER SIX The regulatory landscape of acute myeloid leukemia5

Introduction

Dysregulation of the intricate regulatory networks of the hematopoietic system

has been shown to play a critical role in the development of hematologic malignancies1.

Despite a low overall mutation rate2 and prolonged periods between cell divisions3, the

long lifespan of HSCs makes them susceptible to the accumulation of mutations over

time. Recent work4-8 has demonstrated that HSCs constitute a cellular reservoir for

mutation acquisition that plays a causative role in multiple hematopoietic malignancies.

In particular, in the case of acute myeloid leukemia (AML), HSCs isolated from leukemia

patients have been shown to harbor some but not all of the genetic alterations found in the

frankly leukemic cells and have, therefore, been termed pre-leukemic HSCs. Importantly,

many of the genes found to be recurrently mutated during the pre-leukemic phase of

AML have been shown to regulate the epigenome5,6 such as DNA methyltransferase 3A

(DNMT3A)9, ten-eleven translocated 2 (TET2)10, and isocitrate dehydrogenase 1 and 2

(IDH1/2)11,12. However, the role of these epigenetic mutations during the evolutionary

process of leukemogenesis and their effects on the regulatory networks that govern

normal hematopoiesis remains poorly understood. Notably, longstanding debates have

centered on how cell fate choices are corrupted in human leukemias13 whether leukemic

cells truly harbor multiple lineage-specific regulatory programs at once (termed lineage

infidelity) or merely maintain bipotential progenitor states that normally exist in

development (termed lineage promiscuity)--fundamental issues that may be resolved

by modern epigenomic technologies.



5
Portions of this chapter were taken from Corces R & Buenrostro et al. Lineage-specific and
single cell chromatin accessibility charts human hematopoiesis and leukemia evolution.
Submitted.

88
With hematopoiesis as a reference of normal development, we measure the effects

on the leukemogenic process of both early mutations in epigenetic modifiers and late

mutations in proliferative oncogenes, providing the first characterization of the full

evolutionary process of leukemogenesis. Through direct comparison of hematopoietic

cells isolated from normal human bone marrow and patient-matched pre-leukemic HSCs,

leukemia stem cells, and leukemic blast cells, we chart the genetic and epigenetic

progression from normal to malignant in AML. We demonstrate that the vast majority of

epigenetic and transcriptomic change that occurs during leukemogenesis is derived from

normal hematopoietic differentiation. Moreover, diverse genetic mutations can lead to

similar epigenetic alterations, suggesting a common path for the leukemogenic process.

Our results provide key insights into the evolutionary process of leukemogenesis and

identify important transcriptional programs that could be targeted to disrupt this process

during its earliest stages. In summary, this work serves as a rich resource for the study of

regulatory dynamics in normal and malignant hematopoiesis.

Leukemogenesis and cancer evolution in AML

We sought to characterize the evolution of AML, one of the most aggressive

hematopoietic malignancies14, in the context of normal hematopoiesis. To this end, we

first identified 3 distinct stages of AML evolution, pre-leukemic HSCs (pHSCs),

leukemia stem cells (LSCs), and leukemic blast cells (blasts). Each of these leukemic cell

populations can be enriched based on immunophenotype via FACS purification.

Unmutated HSCs serve as the reservoir for mutation acquisition during the early phases

of leukemogenesis (Fig. 1a). Acquisition of mutations, typically in genes that regulate the

89
epigenome, creates pHSCs that expand to create a pre-leukemic clone. Subsequent

acquisition of progressor mutations, typically in genes that lead to increased proliferation,

generates LSCs that are capable of self-renewal and the production of AML blasts (Fig.

1a).

Importantly, the population of HSCs isolated from leukemia patients by FACS

represents a heterogeneous mixture of healthy unmutated HSCs and pre-leukemic HSCs.

To quantify this heterogeneity, we define the pre-leukemic burden as the percentage of

HSCs isolated from a leukemia patient that harbor at least the first mutation. We profiled

the mutation frequency of known leukemogenic driver mutations in HSCs, T cells, and

blast cells from 39 AML patients. Pre-leukemic burden is highly variable in this cohort

with some patients exhibiting a complete repopulation of the HSC compartment with pre-

leukemic cells and others exhibiting undetectable levels of pre-leukemic mutations (Fig.

1b). The pre-leukemic mutations found in this large cohort recapitulate previous

findings5,6 showing that early mutations tend to occur in genes that modify the epigenome

while later mutations occur in genes involved in activated signal transduction.

AML represents a cooption of normal myelopoiesis

The AML leukemogenic process provides a novel system to study the genesis and

evolution of cancer at the level of the epigenome through the lens of normal

hematopoiesis. We performed Fast-ATAC and compared the chromatin accessibility

landscapes of patient-matched pHSCs, LSCs, and blasts. The optimized Fast-ATAC

protocol produced robust accessibility profiles from cryopreserved primary patient AML

cells (Fig. 1c). This allowed us to quantify the heterogeneity exhibited among the

90
different stages in leukemia evolution. We find that the level of epigenetic variance

between all samples of the same cell type increases through progressive stages of

leukemia evolution (Fig. 1d, see methods). As expected, all AML cell types exhibit more

inter-donor variance than normal hematopoietic cells. This may be the consequence of

the epigenetic mutations present in the leukemic cell types or a manifestation of the point

along the normal hematopoietic hierarchy at which the particular AML cell types exist.

Indeed, key developmentally-associated genes such as GATA2 and CEBPB show

variation amongst the AML cell types consistent with different developmental stages

(Fig. 1e). When overlaid across the principal components derived from normal

hematopoiesis, we find that the first four principal components from normal

hematopoietic differentiation account for 60% of the variation observed in our leukemia

samples (Fig. 1f). Assigning a score to the myeloid differentiation component of our data,

we find that the various stages of AML spread across the trajectory from HSC to

monocyte, indicating that the process of leukemogenesis largely mirrors the process of

normal myelopoiesis (Fig. 1g). Consistent with their functional ability to produce both

lymphoid and myeloid cells in xenotransplantation assays6,15,16, pHSCs are most closely

related to HSCs and MPPs (Fig. 1g). As shown previously17, LSCs show strong similarity

to GMP and LMPP cells and leukemic blast cells show a wider distribution with less

differentiated blasts clustering with GMP cells and more differentiated blasts clustering

with monocyte cells18,19 (Fig. 1g). These results indicate that the majority of inter-patient

variation in AML is derived from the developmental position along the normal myeloid

differentiation trajectory where each leukemia has arrested.

91
AML cell types exhibit lineage infidelity with regulatory contributions from multiple

normal blood cell types

These intermediate positions across myelopoiesis suggest that each patient-

specific AML might harbor a unique collection of multiple distinct normal regulatory

programs. Using enhancer cytometry, we quantified the contribution of each normal cell

type for each leukemic sample assayed (Fig. 2a). We found that each patient, at each

stage of leukemogenesis, harbors multiple distinct regulatory networks contributing to the

epigenetic diversity of leukemic cell types. Importantly, we find that the majority of the

patient donors have AML blasts that are clonally derived and harbor all the leukemic

mutations at comparable allele frequencies. Together, these findings raise the intriguing

possibility that AML cell types may either i) exist in stable intermediate cell states that

are not normally maintained during normal hematopoiesis, or ii) show developmental

heterogeneity within individual clonally derived cells. Traditional ensemble genome-

wide approaches for measuring regulatory elements average over cellular states and

cannot distinguish between these two hypotheses; however, we recently developed

single-cell ATAC-seq (scATAC-seq)20 and reasoned that scATAC-seq with enhancer

cytometry would be able to resolve these two pressing hypotheses (Fig. 2b).

To discriminate between these two possibilities, we performed scATAC-seq on

purified LSCs and blast cells from patient SU070. Although CIBERSORT could

accurately deconvolve bulk populations, we found that individual regulatory elements

within single cells often contained 0, 1 or 2 fragments, consistent with our previous

work20, and was simply too sparse for existing deconvolution methods such as

CIBERSORT. Rather than relying on individual regulatory elements, we reasoned that

92
principle component analysis (PCA) of the regulome, learned from normal bulk

hematopoiesis, could be used to assign chromatin accessibility at all enhancers to

developmental lineages and enable enhancer cytometry in single-cells (Fig. 2b). Indeed,

we found that with this approach, single cell accessibility profiles could be projected onto

hematopoietic principal components with high accuracy (Fig. 2c,d and Fig. S1b,c; see

methods). To better visualize and quantify heterogeneity within these cell subsets we

flattened these components onto a one-dimensional myelopoietic developmental

progression (Fig. 2e). Using these projections, we find that primary patient derived LSCs

and blast cells are remarkably homogenous and indeed exist at intermediate cell states.

This observation is corroborated by enhancer cytometry of a widely used clonal AML

cell line HL60, which also shows mixed normal cell contributions using ensemble (Fig.

S1a) and single-cell (Fig. 2e) enhancer cytometry. To further test our ability to project

single-cells onto hematopoietic components, we performed scATAC-seq on FACS-

purified MEP cells. Intriguingly, we find single MEPs show a predominant peak centered

at the MEP position with a prominent tail towards CMP along erythropoietic

differentiation (Fig. 2f and Fig. S1c). This observation is consistent with post-sort

analysis of MEPs suggesting a low level of contribution of CMP or CMP-to-MEP

transitional cell-states (Fig. S1d). Importantly, we also find that biological replicates of

scATAC-seq from the erythroleukemia cell line (K562) show highly reproducible

measures of erythroid differentiation (Fig. 2f). Together, these results corroborate a

lineage infidelity model wherein primary human AML cells and AML-derived cell lines

can simultaneously access two normally independent regulatory programs within the

same cell.

93
Generation of synthetic normal analogs for assessment of AML-specific biology

The ability to accurately quantify the contribution of each normal cell regulome to

the epigenetic profile of a leukemic cell type enables a more robust identification of

AML-specific regulatory elements. In particular, analyses of leukemic cell types in the

past have relied on comparing the malignant cells to a carefully chosen normal cell type.

Our data (Fig. 2a) shows that this may not be sufficient, and that multiple distinct normal

regulatory patterns are contributing to the biology of AML cells. Due to these mixed

lineages, we suspect that past epigenomic and transcriptomic cancer studies may be

highly biased towards the rediscovery of normal and developmentally dynamic genes

rather than bona fide cancer-specific genes. We reasoned that effective removal of this

normal contribution is possible through the generation of synthetic normal analogs

which represent admixtures of various normal cells defined by enhancer cytometry (see

methods). While comparison of AML cell types to their closest normal cell analogs yields

a high correlation (R = 0.86, Fig. 2g), comparison of AML cell types to their synthetic

normal analogs yields an even higher correlation (R = 0.91, Fig. 2h) and, more

importantly, leads to a reduction in the number of predicted AML-specific peaks (N =

10,954 to N = 8,003). Notably, we found that comparison of AML epigenomes to

synthetic normal analogs consistently resulted in higher Pearson correlation values (Fig.

S1e) and provided fewer cancer-specific peaks than comparison to the closest normal

analog (Fig. 2i and Fig. S1f).

By examining co-association of AML-specific peaks, we identified 6 regulatory

modules that are utilized by AML cells (Fig. 3a and Fig. S2a). We can track the usage of

94
these modules through leukemogenesis and identify patterns related to specific AML cell

types (Fig. 3b). Additionally, each module shows enrichment for peaks associated with

different key transcription factors (Fig. 3c). For example, modules 1 and 2 show strong

enrichment for JUN and FOS activity, indicating the activation of AP-1-dependent stress

response pathways in these cells. This increase in accessibility of JUN/FOS motifs is

echoed by an increase in expression of these factors by RNA-seq (Fig. S2b) and is

maintained through the stages of leukemogenesis, identifying inhibition of these

pathways as a potential therapeutic strategy in AML. Indeed, JNK inhibition showed a

moderate but consistent selective targeting of AML blasts (Fig. S2c-e). This observation

is consistent with previous publications that identify JNK as a therapeutic target in

AML21 and indicates that similar strategies may prove efficacious in targeting pre-

leukemic HSC.

Mechanism and clinical consequences of pre-leukemic HSC clonal advantage

Despite previous work on the acquisitions of mutations during the pre-leukemic

phase of AML evolution5, it remains unclear whether pre-leukemic HSC represent a

unique functional state or merely serve as long-lived reservoirs for mutation

accumulation. Moreover, functional epigenetic consequences of pre-leukemic mutations

in primary AML samples have not been characterized. Using ATAC-seq and enhancer

cytometry we show that pHSCs share many regulatory programs with HSCs and MPPs

(Fig. 6a). Nevertheless, comparison to synthetic normal analogs identifies a distinct

regulatory module (module 6) that shows decreased accessibility in pHSCs, representing

the earliest known event of AML evolution (Fig. 3b). This repressed regulatory module is

95
enriched for motifs associated with HSPCs (i.e. HOX and GATA) and provides direct

evidence to support a model where pHSCs maintain a unique epigenetic and functional

state.

In order to better understand the consequences of a loss in accessibility at motifs

associated with HSPCs, we probed pHSCs for phenotypic changes related to self-renewal

and differentiation. When pushed to differentiate down the myeloid and erythroid

lineages (Fig. S2f), pHSCs showed a strong resistance towards differentiation, instead

favoring maintenance of the stem cell state (Fig. 3d,e). Given the decreased accessibility

of module 6, this suggests that accessibility at certain stem cell-related motifs may confer

the ability to properly differentiate rather than properly self-renew. We have previously

assessed the effect of depletion of GATA1 and GATA2 on HSPC differentiation and self-

renewal(Mazumdar et al., 2015 in press), finding that knockdown of GATA2 led to a

decrease in self-renewal of HSPCs while knockdown of GATA1 had no effect. This

observation excludes these GATA factors from mediating the defects in differentiation

associated with repression of module 6. Given the well-studied role of HOX factors in

stem cells22, in particular the role of HOXA9 in HSCs, we hypothesized that HOXA9

might mediate the observed stemness phenotype. In fact, previous studies have shown an

increase in the number of HSCs in mice deficient for HOXA923. From this, we reasoned

that loss of accessibility at HOXA9 target sites may confer an increase in stemness and

prevent proper differentiation, a hallmark of AML. Indeed, we found depletion of

HOXA9 by short hairpin RNA (shRNA) knockdown (Fig. S2g) in umbilical cord blood

CD34+ HSPCs led to a retention of stemness in the context of both myeloid (Fig. 3f) and

erythroid (Fig. 3g) differentiation. Moreover, a concomitant decrease in differentiated

96
granulocytes and erythroid cells was also observed (Fig. S2h,i), consistent with results

from mouse models of HOXA9 deficiency23,24. In addition, we note that this retention of

stemness is also observed in the absence of a differentiation stimulus (Fig. S2j). Together,

these results suggest that decreased HOX accessibility in pHSCs may promote retention

of stemness and prevent differentiation of these cells.

The retention of stemness in pHSCs caused by loss of accessibility at HOXA9

motifs helps to explain the observation that pHSCs outcompete their normal HSC

counterparts in vivo (Fig. S7k). Retention of stemness provides pHSCs with an

evolutionary advantage in that resisting differentiation maintains cells in an HSC-like

state, which increases the likelihood of acquiring additional leukemogenic mutations.

One implication of this model is that pre-leukemic burden may have adverse effects on

patient survival, despite the fact that pHSCs do not confer disease in xenograft transplant

assays4,6,16. Characterization of our patient cohort shows that pre-leukemic burden

inversely correlates with overall survival and relapse-free survival (Fig. 3h,i). High pre-

leukemic burden is associated with approximately 300% increased likelihood of death or

leukemia relapse (hazard ratio = 3.30 for overall survival and 2.99 for relapse free

survival, p < 0.05). These results further implicate pHSCs in AML pathology and suggest

a mechanism wherein AML arises from the presence of a pre-leukemic clone that is

capable of outcompeting its normal HSC counterparts (Fig. S7k) and predispose patients

to more aggressive or refractory leukemia. In sum, detailed analysis of AML-specific

regulomes enables the identification of novel features of pHSC biology that have

important prognostic implications.

97
Discussion

The study of acute myeloid leukemia sheds light on the biology and step-wise

progression of leukemia evolution. We measured regulomes in patient-matched pre-

leukemic HSC, LSC, and blast cells representing three distinct time points in AML

evolution. Examination of the average epigenetic variance across the genome shows that

variance increases through the stages of leukemia evolution with the majority of this

variance being explained by differences observed during normal hematopoietic

differentiation. The epigenetic landscapes of AML blast cells isolated from various

patients are extremely divergent, highlighting the need for personalized approaches to

adequately target each patients unique cancer cells.

A longstanding debate in cancer biology is how cancer cells violate cell lineage

rules. Cancer cells with markers or morphologies of one cell type have been shown to

also express markers of a different cell type25, which raises diagnostic challenges and

treatment conundrums. Two classic but competing models posited (i) lineage infidelity

a single cancer cell simultaneously accesses two normally distinct regulatory programs;

or alternatively (ii) lineage promiscuitya normally bipotential progenitor cell exists,

and the cancer cell is simply an expansion of this rare but physiologic bipotential state.

By using our comprehensive map of hematopoiesis, patient-matched AML cell subsets,

and single-cell ATAC-seq of hundreds of individual leukemic cells, we show direct

evidence of lineage infidelitya single cell accessing a mixed regulatory program. This

result has potentially important diagnostic and mechanistic implications, and we build

upon both classical models to address this challenge. Comparison of cancer to matched

normal cells is one of the most basic and commonplace experiments in cancer biology,

98
but lineage infidelity demonstrates that there may be no appropriate normal for

comparison in epigenomic and transcriptomic studies. Instead, we use enhancer

cytometry to construct synthetic normalsproportionally matching the fractional

contribution of cell type-specific regulomesin order to pinpoint cancer-specific

aberrations.

This approach streamlined the discovery of candidate drivers and led us to

discover the loss of HOXA9-mediated accessibility as the most consistent defect in pre-

leukemic HSCs. We found that HOXA9 loss can, in fact, cause defects in differentiation

as observed in these pre-leukemic HSC and confer an evolutionary advantage.

Importantly, higher pre-leukemic burden is predictive of poor overall and relapse-free

survival in AML, indicating an important role for pre-leukemic HSC in disease

pathogenesis. These results provide potential avenues for therapeutic intervention during

the earliest stages of leukemogenesis. Moreover, we anticipate that lineage infidelity is a

widespread phenomenon in many types of cancer, and that our integrative approach using

enhancer cytometry to construct synthetic normal analogs should be broadly applicable to

many disease pathologies.

99
Chapter 6 - Figures and Figure Legends

Figure 1. Acute myeloid leukemia regulomes reveal a cooption of normal

myelopoiesis. (a) Schematic of the leukemogenic process. HSCs serve as a reservoir of

mutation acquisition. Early mutations in epigenetic modifiers such as DNMT3A, TET2,

and IDH1/2 generate pre-leukemic HSCs. Downstream acquisition of genes involved in

activated signal transduction such as FLT3 and RAS lead to generation of leukemia stem

cells which both self-renew and produce leukemic blast cells. (b) Genotype and mutation

frequencies of HSCs isolated from AML patients (N=39). Color indicates the percent of

cells mutated as estimated from the variant allele frequency. Gray color indicates a

mutation known to be present in leukemic cells but not observed during the pre-leukemic

phase of AML evolution (i.e. a late mutation event). Asterisks indicate the predicted first

mutation. If a mutation is bi-allelic, the representative bar is divided in half. Patients with

more than 20% of HSCs harboring a pre-leukemic mutation were classified as high

100
burden and those patients with less than 20% of HSCs harboring a pre-leukemic

mutation were classified as low burden. (c) Normalized sequencing track of control

loci on chromosome 19 from FACS-purified AML cell types. Profiles represent the union

of all biological replicates for each cell type. (d) Mean variance of chromatin accessibility

across the genome as calculated by a moving average across each leukemic cell stage (see

methods). (e) Normalized sequencing tracks of developmentally-associated genes

GATA2 (left) and CEBPB (right). Profiles represent the union of all biological replicates

for each cell type pHSC (N=12), LSC (N=8), Blasts (N=12). (f) Cumulative variance of

AML ATAC-seq data explained by the first N principal components derived from normal

hematopoiesis. (g) Myeloid development score in normal blood cell types (N=4

biological replicates) and AML cell types. The myeloid score is calculated from the first

principal component which encompasses the majority of variation observed in

myelopoiesis.

101
Figure 2. Enhancer cytometry and single-cell regulomes support a model of lineage

infidelity and allow for deconvolution of AML-specific biology. (a) Enhancer

cytometry deconvolution showing contribution of various normal cell types to the

epigenetic landscape of different AML cell types. (b) Schematic of single-cell ATAC-seq

protocol and analysis. (c,d) Projection of ATAC-seq data derived from (c) single SU070

LSCs and (d) single SU070 blast cells onto the principal components derived from the

normal hematopoietic hierarchy. (e,f) Relative density of (e) single SU070 LSCs, SU070

blasts, and HL60 and (f) single MEP and K562 cells projected onto a one-dimensional

representation of the myeloid and erythroid progression, respectively. Two biological

replicates of K562 cells are marked as K562-1 and K562-2. (g) Scatter plot showing

the correlation of ATAC-seq data derived from SU353 blast cells with the closest normal

102
cell type (GMP) (R=0.86). Using a log2(fold change) cutoff of 4 we identify 8,209 peaks

depleted and 10,954 peaks enriched in SU353 blast cells. (h) Scatter plot, as shown in (g),

showing the correlation between SU353 blast cells with the enhancer cytometry-defined

synthetic normal analog (R=0.91). Using a log2(fold change) cutoff of 4 identifies 5,887

peaks enriched in the synthetic normal analog and 8,003 peaks enriched in SU353 blast

cells. (i) Comparison of AML cell types to synthetic normal analogs. The closest normal

is shown in color. The percent of the total significant peaks that are removed by

comparison to synthetic normal analogs is plotted for each sample.

103
Figure 3. Early chromatin accessibility alterations within pHSCs promote stemness

which predicts adverse patient outcomes. (a) K-means clustering of cancer-specific

peaks identifies 6 distinct regulatory modules. (b) Enrichment of each module, identified

in Figure 7a identifies activated and repressed patterns in leukemogenic progression.

Gray bars shown represent 1 S.D. across all samples of that given cell type. (c)

Enrichment and hierarchical clustering of motifs in AML-specific regulatory modules.

(d,e) Retention of stemness as measured by flow cytometric analysis of CD34 protein

expression after 6 days of enforced differentiation down the (d) myeloid lineage and (e)

erythroid lineage. Error bars represent 1 S.D. Experiments done in triplicate. (f,g) Fold

change in the percent of cells expressing CD34 as measured by flow cytometric analysis

of human umbilical cord blood-derived HSCs transduced with shRNAs targeting HOXA9

104
or a non-targeting control. Percent CD34+ cells measured after 6 days of enforced

differentiation down the (f) myeloid lineage and (g) erythroid lineage. Only GFP+

transduced cells analyzed. Error bars represent 1 S.D. Experiments done in triplicate. (h)

Overall and (i) relapse-free survival of patients stratified by pre-leukemic burden as

described in Figure 5b (High pre-leukemic burden, N=24; Low pre-leukemic burden

N=15). High pre-leukemic burden defined as greater than or equal to 20% of HSCs

harboring at least the first pre-leukemic mutation. Survival analysis was performed using

the Kaplan-Meier estimate method. All patients were included for the analysis regardless

of their treatment. P values comparing two Kaplan-Meier survival curves were calculated

using the log-rank (Mantel-Cox) test. Hazard ratios were determined using the Mantel-

Haenszel approach. **p<0.01, ***p<0.001, ****p<0.0001 derived from two-tailed t-test.

105
Supplementary Figure 1. Validation of enhancer cytometry in AML cell lines and

primary cells by single-cell ATAC-seq. (a) Enhancer cytometry of ATAC-seq data

derived from various blood cell lines demonstrates mixed regulatory contribution from

various normal hematopoietic cell types. (b) Projection of down sampled bulk

hematopoiesis data onto myeloid (left) and erythroid (right) progression. (c) Projection of

single MEPs onto hematopoiesis principal components 2 and 3. (d) Post-sort analysis of

MEPs used in scATAC-seq analyses presented in Figure 6f and Supplementary Figure 6c

gated for CMP (2.54%), MEP (97.5%) and GMP (0%). (e) Pearson correlations of AML

cell types with the closest normal analog (color) and the enhancer cytometry-derived

106
synthetic normal (gray). (f) Total significant peaks observed after comparison of AML

cell types to synthetic normal analogs. Significance measured as log2(fold change) > 3.

107
Supplementary Figure 2. Validation of regulatory network analysis in AML cell

types. (a) Principal component analysis of the log2(fold change) values of each AML cell

type compared to its synthetic normal. (b) Expression of JUN in various normal

hematopoietic cells, pHSCs, and blasts. *p<0.05, two-tailed t-test. (c-e) The effect of

JNK/ERK inhibition by (a) JNK-IN-8, (b) SP600125, and (c) SCH772984 was

determined by IC50 of sorted primary AML blast cells in comparison to CD34+ HSPCs

derived from umbilical cord blood. Viability determined by flow cytometric assessment

of Annexin V and DAPI. (f) Strategy for in vitro differentiation of HSPCs down the

myeloid and erythroid lineages. HSPCs are grown in defined culture media for 6 days

108
and then analyzed for cell surface markers of stemness or differentiation. Immature cells

at day 6 express CD34 and have not yet upregulated CD33. (g) Quantitative reverse-

transcriptase PCR validation of HOXA9 knockdown via shRNA. Knockdown performed

in THP1 cells for 72 hours and validated with two separate primer sets. (h,i) Fold change

in the percent of (h) CD15+ granulocytes or (i) CD71+GPA+ erythroblasts between cord

blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9

shRNAs after 6 days of differentiation down the (h) myeloid or (i) erythroid lineage.

***p<0.001, ****p<0.0001 by two-tailed t-test. (j) Fold change in the percent of CD34+

HSPCs after 6 days of culture in stemness retention media (see methods) between cord

blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9

shRNAs. (k) Burden of mutations in DNMT3A, TET2, IDH1/2, or other genes when

detected in pre-leukemic HSC. *p < 0.05, **p < 0.01 by two-tailed t-test

109
References

1. Shih, A. H., Abdel-Wahab, O., Patel, J. P. & Levine, R. L. The role of mutations in
epigenetic regulators in myeloid malignancies. Nature Reviews Cancer 263, 2235
(2015).
2. Araten, D. J. et al. A quantitative measurement of the human somatic mutation
rate. Cancer research 65, 81117 (2005).
3. Sun, J. et al. Clonal dynamics of native haematopoiesis. Nature (2014).
4. Jan, M. et al. Clonal evolution of preleukemic hematopoietic stem cells precedes
human acute myeloid leukemia. Science translational medicine 4, 110 (2012).
5. Corces-Zimmerman, M. R. & Majeti, R. Pre-leukemic evolution of hematopoietic
stem cells: the importance of early mutations in leukemogenesis. Leukemia 28,
22762282 (2014).
6. Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in
acute leukaemia. Nature 506, 328333 (2014).
7. Lindberg, J. et al. Clonal Hematopoiesis and Blood-Cancer Risk Inferred from
Blood DNA Sequence. N Engl J Med 371, 24772487 (2014).
8. Jaiswal, S. et al. Age-Related Clonal Hematopoiesis Associated with Adverse
Outcomes. N Engl J Med 371, 24882498 (2014).
9. Okano, M., Xie, S. & Li, E. Cloning and characterization of a family of novel
mammalian DNA ( cytosine-5 ) methyltransferases Non-invasive sexing of
preimplantation stage mammalian embryos. Nature Genetics 19, 219220 (1998).
10. Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in
mammalian DNA by MLL partner TET1. Science 324, 9305 (2009).
11. Dang, L. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate.
Nature 462, 73944 (2009).
12. Figueroa, M. E. et al. Leukemic IDH1 and IDH2 Mutations Result in a
Hypermethylation Phenotype, Disrupt TET2 Function, and Impair Hematopoietic
Differentiation. Cancer Cell 18, 553567 (2010).
13. Greaves, M. F., Chan, L. C., Furley, A. J. W., Watt, S. M. & Molgaard, H. V.
Lineage Promiscuity in Hemopoietic Differentiation and Leukemia. Blood 67, 1
11 (1986).
14. Dohner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute Myeloid Leukemia. N
Engl J Med 373, 113652 (2015).
15. Jan, M. & Majeti, R. Clonal evolution of acute leukemia genomes. Oncogene 16
(2012).
16. Corces-Zimmerman, M. R., Hong, W.-J., Weissman, I. L., Medeiros, B. C. &
Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect
epigenetic regulators and persist in remission. Proceedings of the National
Academy of Sciences of the United States of America 111, 254853 (2014).
17. Goardon, N. et al. Coexistence of LMPP-like and GMP-like Leukemia Stem Cells
in Acute Myeloid Leukemia. Cancer Cell 19, 138152 (2011).
18. Bennet, J. M. et al. Proposals for the classification of the acute leukaemias.
French-American-British (FAB) co-operative group. British Journal of
Haematology 33, 4518 (1976).
19. van't Veer, M. B. The diagnosis of acute leukemia with undifferentiated or

110
minimally differentiated blasts. Annals of Hematology 64, 1615 (1992).
20. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of
regulatory variation. Nature 523, 486490 (2015).
21. Volk, A. et al. Co-inhibition of NF- B and JNK is synergistic in TNF-expressing
human AML. Journal of Experimental Medicine 211, 10931108 (2014).
22. Abramovich, C. & Humphries, R. K. Hox regulation of normal and leukemic
hematopoietic stem cells. Current opinion in hematology 12, 210216 (2005).
23. Magnusson, M., Brun, A. C. M., Lawrence, H. J. & Karlsson, S.
Hoxa9/hoxb3/hoxb4 compound null mice display severe hematopoietic defects.
Experimental Hematology 35, 1421.e11421.e9 (2007).
24. Lawrence, H. J. et al. Mice bearing a targeted interruption of the homeobox gene
HOXA9 have defects in myeloid, erythroid, and lymphoid hematopoiesis. Blood
89, 19221930 (1997).
25. Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals
Progenitor-like Cells that Correlate with Prognosis. Cell 162, 184197 (2015).

111
CHAPTER SEVEN Conclusion

Methods for gene regulation

The thesis work presented leverages high-throughput methodologies in effort to

provide a quantitative understanding of cellular regulation. Using an in vitro approach we

make >107 quantitative measurements of the biophysical parameters defining an RNA-

protein interaction across sequence mutants. This platform may be extended to profile a

diversity of RNA-protein interactions and may form the basis of methods that quantify

protein-protein interactions or chromatin-TF interactions. Such efforts promise to provide

a principled biochemical understanding of the sequence and structure determinants of

trans-factor binding.

Measuring these regulatory processes in vivo provides unique insight into the

characteristics and potential of cellular behavior. Preceding this work, methods for

measuring chromatin structure genome-wide often required tens of millions of cells and

included complex experimental workflows. We have developed ATAC-seq and scATAC-

seq for profiling chromatin accessibility within rare cellular populations and/or from

single-cells. Together, these methods enable genome-wide chromatin accessibility

measurements of carefully isolated or de novo defined cellular populations, and the

inference of the trans-acting regulatory proteins that define them. In addition, these

methods can measure chromatin accessibility in in vivo derived human tissues, as

demonstrated by our efforts to understand human hematopoiesis and leukemogenesis.

Future work

Regulatory rules describing promoter-enhancer interactions and their effect on the

expression of nearby genes would greatly enhance our ability to causally link the

112
epigenome to gene expression and subsequently disease mutations to phenotypes. Such a

lofty endeavor will require new experimental and computational methodologies.

Specifically, TF-TF interactions, TF-remodeler or other TF-protein interactions are

critical for understanding TF binding landscapes and gene expression in vivo. Further

development of in vitro methods or high-throughput in vivo reporter assays may be used

to further elucidate these mechanisms.

Furthermore, combining genome-wide assays within single-cells provides a

unique opportunity to develop regulatory models, wherein natural variation within single

cells can be used to infer causal changes of expression at nearby genes. Integrating

ATAC-seq, RNA-seq and protein measurements in the same single-cell at high-

throughput may serve to quantify trans-acting regulators, their binding to cis regulatory

elements and the effect of expression in nearby genes.

Together, these approaches provide deep insight into individual regulatory

patterns, however, only a combined or systems approach to these data would yield a

complete understanding of cellular regulation within single-cells. To this effort,

computational models that integrate these data sets and infer causality, for example the

expression of a gene or cellular response to a stimulus, are required. In summary, a

multidisciplinary and collaborative approach promises to enrich our understanding of

cellular regulation and form the basis of our understanding of human disease.

113

Vous aimerez peut-être aussi