Vous êtes sur la page 1sur 33

Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data

Prof. Rushen Chahal

Thesis Statement
The DNA replication timing profile can be reconstructed efficiently and accurately from discrete time points.

(Glossary)

Prof. Rushen Chahal

Presentation Outline
Biology background Microarray technology Experimental data
Challenges

Algorithms Research Plans


Replication timing Origins Scale up
Prof. Rushen Chahal

Why Study DNA Replication?


Natural Science
DNA is the blueprint for organisms
It must be passed on (organism, cell)

Engineering
Gene therapy
Insertion, deletion, modification

Cancer is unchecked replication

Prof. Rushen Chahal

... A ... T

G C

G C

T A

C G

G C

A T

C G

A T

C ... G ...

Human genome > 3 billion bp Replication rate ~ 1000 bp/min Serial replication 5.7 years 6 to 10 hours (speedup > 5000)

Prof. Rushen Chahal

Background
Prokaryotes
E. Coli
DnaA binds to oriC

Eukaryotes ORC
S. Cerevisiae (yeast)
ARS 11 bp consensus
Mapping of origins

Human
No known consensus Few origins characterized

Prof. Rushen Chahal

Genome Tiling Microarrays


Interrogation at genomic scale Large increase in data
Microarray data analysis

Array of probes tiles genome


ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCAGAGTACATAGCATACCATGACTAGA GAGTACATAGCATACCATGACTAGA CTCATGTATCGTATGGTACTGATCT TACCTGATGCCTAGTCATTTAGCTAATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT

Cross-hybridization
Repeats not tiled
Gaps in genome
PM probe MM probe
GAGTACATAGCATACCATGACTAGA A

Prof. Rushen Chahal

Image analysis computes intensity of each array probe


Prof. Rushen Chahal

The Cell Cycle

S-Phase

Start of S-phase Prof. Rushen Chahal (0 hour)

Profiling DNA Replication Timing


Ideal: f(chr, bp) = rtime Isolate DNA replicated in discrete parts of S-phase
One cell is not enough Synchronize S-phase entry
Apply drugs Release together
Synchronization error

Label in two hour intervals

Allelic Variation
mf(chr, bp) = {rtime1, rtime2, }
Prof. Rushen Chahal

0hr

0hr

Allelic Variation
2hr 2hr

Fluorescent in-situ Hybridization (FISH)


4hr

Replication timing at a given site


Temporally non-specific replication (TNS)

4hr

6hr

6hr

Temporally specific replication (TS)

8hr

8hr

10hr Prof. Rushen Chahal 11

10hr

What is the Problem?


Reconstruct a continuous replication profile
Temporally (time points) Spatially (probes)

from noisy data


Biological experiments Synchronization error Microarray artifacts

efficiently
Genomic data (> 3 billion bp)

Prof. Rushen Chahal

Initial Analysis
Tiling Analysis Software (TAS)
Wilcoxon Rank Sum test in sliding window
Assess enrichment of treatment over control

Window slides to get p-value for each probe


O(kn) time complexity
n = # probes on array k = # probes in a window k scales linearly with window size
Prof. Rushen Chahal

New Analysis
Thesis Statement (revisited):
The DNA replication timing profile can be reconstructed efficiently and accurately from discrete time points.

Incorporate information from all time points


Continuous view of replication timing (TR50)

Address temporally non-specific replication Scale up to the whole genome efficiently


Prof. Rushen Chahal

Allelic Variation Examples


Temporally specific replication 0 0 2 0 4 1/1 5 TR50 6 0 8 0 10

Temporally non-specific replication 1/6 0 2 1/6 4 1/3 5 TR50 6 0 8 1/3 10

Challenge: From distribution of array signal, determine replication category.


Prof. Rushen Chahal

Temporal Specificity Algorithm


// Is there evidence that all alleles are replicating together? If (max sum of two adjacent time points 5/6 * total sum) then {probe is temporally specific} // Is at least one allele replicating apart from the majority? Else If (max sum of two adjacent time points not including the maximum time point 1/3 * total sum) then {probe is temporally non-specific} // Isolated signal is not strong enough to be an allele. Else {probe is temporally specific}

Prof. Rushen Chahal

Plotting TR50
8 6 4 2 TR50 (hours) 33 33.5 34 Chromosomal Position (in millions of bp)

Smoothed TR50 curve recovers replication pattern Local minima Possible locations of replication origin
Prof. Rushen Chahal

Segregation Algorithm
Ratio 2-to-1 & Avg < 3.4 Avg 3.4 Avg > 3.9 3.4 Avg 3.9 Avg > 3.9

TNS

Ratio < 2-to-1

Early

Avg < 3.4

Mid

Avg 3.9

Late

Ratio < 2-to-1 Ratio < 2-to-1

Sliding window passes over probes to generate intervals


Ratio of TSP to TNSP determines temporal specificity Average TR50 determines timing category
Prof. Rushen Chahal

Research Plan: Profile Generation


0-2hr 2-4hr 4-6hr 6-8hr 8-10hr Probe Classification (Temporal Specificity Algorithm) & TR50 Calculation No Signal Probes TNS Probes TS Probes & TR50 Segregation Algorithm (Sliding Window) Low Probe Density TNS Regions TS Regions Join Intervals Mask TS probes with JTS Regions Joined TNS Regions Joined TS Regions

TS Probes that fall into JTS Regions

TR50 Smoothing Early Smoothed TR50 Segregate JTS Regions into 1/3s based on STR50 Mid Late Join Intervals Joined Early Joined Mid Joined Late

Parameters to evaluate:
Segregation Algorithm: sliding window size, minimum probe density Join Intervals: minimum interval size
Prof. Rushen Chahal

Evaluation
Concordance of biological phenomena
Segregation intervals FISH STR50 local minima Other origin methods Correlation with other biological data
Gene density Early replication AT content Late replication Gene expression Early replication Activating acetylation/methylation Early replication

Performance on random data


Large quantity of TNS replication

Prof. Rushen Chahal

Research Plan: Replication Origins


Drive DNA replication pattern Smoothed TR50 local minima
Cleaned up with new profiles

Other biological assays


Early labeling fragments Nascent strands Bubble trapping ORC binding

Prof. Rushen Chahal

Approach and Evaluation


Correlation between methods
Consensus sets
Motif analysis

Positional attributes
Replication timing Proximity to genes

Evaluation is difficult (few validated origins)


Agreement between methods Testing proposed correlations Paper in preparation
Prof. Rushen Chahal

Scaling Up to Whole Genome


Pilot 1% 100% of human genome
Linear time

Algorithms developed with scalability in mind


Incremental update sliding windows

Performance based evaluation


If 100% data available
Profile multiple runs

Else
Profile many 1% runs

Prof. Rushen Chahal

Implementation Details
Java
Class representation of proprietary microarray files Algorithms to process raw microarray data Diagnostic tools

Perl
Scripts to process intermediate and final data Correlations, data transformation, quality assurance

R statistical language
Smoothing, statistical plots, correlation studies

Shell scripts
Automated processing of microarray sets

Prof. Rushen Chahal

Current/Expected Contributions
Algorithms, Software Infrastructure, Analysis Probe-by-probe TR50 analysis
Temporal Specificity Algorithm
Combinatorial analysis of allele locations

Segregation Algorithm
TNS, Early, Mid, Late replicating areas
Used to design validation experiments

Smoothed TR50 profile


Local minima provide candidate origin set

Linear algorithms enable scale up Randomness testing


Prof. Rushen Chahal

Publications
Completed: ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004 Oct 22; 306(5696):636-40. ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. {In Press, to appear in June 14, 2007 issue} Karnani N., Taylor C., Malhotra A., Dutta A. Pan-S replication patterns and chromosomal domains defined by genome tiling arrays of encode genomic areas. Genome Research. {In Press, to appear in June 2007 issue} UCSC Browser Tracks: TR50, Smoothed TR50, Local Minima, Segregation In Progress: Multi-million dollar NIH grant for scale up to full human genome Paper detailing origin methods, correlations, etc.
Prof. Rushen Chahal

Why is this work computer science?


Fred Brooks: The Computer Scientist as Toolsmith II
Hitching our research to someone elses driving problems, and solving those problems on the owners terms, leads us to richer computer science research.

Not an incremental improvement


Algorithmic techniques and analysis used to solve a problem previously addressed inadequately with a statistical approach that performed poorly

Collaboration outside of engineering disciplines enhances visibility, funding opportunities, and demand for CS work Developed algorithms, time complexity analysis, combinatorial analysis, feedback to experimental design
Prof. Rushen Chahal

Will this work lead to any CS publications?


The Nature article focused on analysis of the biological data and includes descriptions of some of my algorithms The Genome Research paper and origins paper will also contain writeups of my algorithms and analysis techniques The Pacific Symposium on Biocomputing focuses on algorithms and computational techniques
Prof. Rushen Chahal

Isn't your approach too simple?


The approach isnt simple:
Combinatorial analysis Temporal specificity algorithm (many iterations) Probewise computation to deal with binding affinity Incremental updating sliding windows
Cross-hybridiztion Synchronization error

Smoothing
Parameterization

Linear algorithms for scale up


Prof. Rushen Chahal

Can't your algorithm be replaced by a well-known statistical method?


HMMs were used for segregation of intervals
Performed poorly in comparison to my algorithm
Less accurate categorization of replication intervals Prone to rapid oscillation, producing tiny intervals Parameterization was difficult

Lowess smoothing is a statistical method


Parameterization was not easy

Prof. Rushen Chahal

What are the biggest challenges in this work?


Noise!
The data to analyze comes from biological experiments with several sources of noise that compound upon one another

Biology
I havent had a course in biology since 10th grade

Microarrays
New, evolving technology were still learning to deal with

Data size
Hundreds of GB of data to process Replicates, failed experiments Algorithms must be efficient
Prof. Rushen Chahal

What kind of career are you aiming for after graduation, and why?
Teaching Computer Science (Small College)
I enjoyed learning in my undergraduate curriculum with meaningful interactions with professors I taught Discrete Math at UVa in Fall 02 and Spring 03
Enjoyable, but 60-70 students too large

Post-doctoral (Biological Computing)


Many opportunities around the world Further exploration of the field

Prof. Rushen Chahal

How will you know when your work/thesis is done?


Research is never really done, but you have to declare victory at some point The replication profiling algorithms Ive developed already perform quite well
I have concrete plans to improve and finalize them

Prof. Rushen Chahal

Vous aimerez peut-être aussi