Vous êtes sur la page 1sur 74

S TATISTICAL L EARNING IN C ANCER B IOLOGY: L ECTURE 1

Donald Geman, Michael Ochs, Laurent Younes


Johns Hopkins Unversity

ENS-Cachan February 7, 2013

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

2 / 67

L ECTURE S ERIES

Lecture 1: Lecture 2: Lecture 3: Lecture 4: Lecture 5: Lecture 6: Lecture 7: Lecture 8:

Introduction (DG) Cancer Biology (MO) Cell Signaling Networks (MO) Genetic Variation (DG) Massive Testing (LY) Biomarker Discovery (LY) Phenotype Prediction (DG) Embedding Mechanism (DG)

3 / 67

I NSIDE

THE

S ERIES

{Content} {Computational Molecular Medicine} Computational:


Three pillars: hypothesis testing, statistical learning and stochastic modeling and simulation. Statistical learning as proxy for all three.

Molecular:
As in molecular biology, so involving biomolecules in cells, mainly DNA, RNA and proteins.

Medicine:
Associating genetic variation with disease; Predicting phenotypes from molecular concentrations; Understanding disease in the context of molecular networks of genes and gene products;
4 / 67

O UTSIDE

THE

S ERIES

Technology:
(Almost) nothing about advances in generating omics data. In particular, nothing about next-gen xyz, data acquisition, data preprocessing. For us, data are a matrix of numbers.

Software:
Computational does not stand for computer-based. Nothing on packages for implementing specic algorithms, on tools for the community, or on R or any other programming language.

Medical Informatics: Nothing about storing, retrieving, organizing, sharing data.

5 / 67

VARIABLES

Modern capture devices enable the simultaneous and quantitative assessment of many molecular states:
RNA (mRNA, miRNA, etc.) expression. Metabolite quantication. DNA polymorphism measurements (e.g., SNPs). Protein quantication (e.g., via MassSpec). Methlyation arrays. DNA-protein binding (e.g., CHIP-Chip).

6 / 67

W HY S TATISTICAL L EARNING ?

Molecular data are stochastic.


Within single cells, RNA and protein expression are inherently stochastic. Massive variability in the human genome from person to person. Similarly from tissue to tissue and phenotype to phenotype.

Molecular data are high-dimensional.


Analyzing them by hand or by inspection is impossible. Computational methods are necessary for unbiased discovery, hypothesis generation and simulation.

7 / 67

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

8 / 67

L AST C ENTURY: G ENE -C ENTRIC W ORLD


Cricks central dogma is a directionally one way view of the genetic code and living systems. The genome is the blueprint and book of life. Function and disease arise from specic genes (genetic determinism). An organisms phenotype and response to any environmental stimulus can theoretically be determined if its genome is completely characterized. A gold rush for disease genes by the pharmaceutical industry. Human Genome Project

9 / 67

R EVISION : S YSTEMS B IOLOGY


Understanding disease, development and cell function involves a complex, interwoven network of parts. HGP provides only a parts list, and mice and men have about the same list. Genes and their products do not act independently, but instead are organizing into functional units.
Function and disease are determined by network interactions. Species differences arise from the orchestration of gene expression.

The objective is then to analyze relationships among biomolecules in the context of a network.
10 / 67

S YSTEMS B IOLOGY ( CONT )


Therefore, an integrative, computational approach to biology designed to achieve a systems understanding. Enabled by the data revolution, i.e., the omics technologies. System properites are learned from data. Integrative means over multiple biological scales and molecular agents, hence different from the traditional reductionist approach. At this level of complexity, mathematical modeling is indespendsable for predicting system properties. In particular, a deep understanding requires a statistical characterization: learning the likely and unlikely molecular concentrations, not just individually but collectively.
11 / 67

TOWARDS S YSTEMS M EDICINE


Diseases can be explained as perturbations of genetic, molecular and cellular networks. For example, for drug design it is necessary to understand target response within the context of physiological networks. In principle, topological and statistical signatures (e.g., likelihood ratios) can inform clinical decisions. Consequently, some tools are in place to apply SB to human diseases. Systems medicine: The application of systems biology approaches to medical research and practice.

12 / 67

B IOMARKERS

Something measurable which carries information about the occult state of the disease. Hence a surrogate for the state of interest. Types:
Diagnostic: screening biomarkers from serum, imaging, saliva or urine. Specicity dominates. Prognostic: based on tumor tissue (e.g., expression, CNV). Sensitivity dominates. Predictive of treatment outcome: based on tumor tissue.

13 / 67

S PECIFIC O BJECTIVES

Discover biomarkers and biomarker interactions for disease progression which are clinically useful for early diagnosis, risk assessment, prognosis and personalized treatment. Elucidate the pathophysiological mechanisms of multifactorial/chronic diseases (e.g.,cancer, diabetes, obesity, metabolic disorders, aging). Develop strategies for combinatorial drug therapies and screenings, test effective treatment, and predict drugs effects and side-effects.

14 / 67

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

15 / 67

C ANCER D IAGNOSIS : G OLUB

ET AL

1999

In 1960s, acute leukemias were divided into acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Separation was based on histochemical testing, and later on antibody-based testing. In either case, there was no single established test to make this diagnosis. Golub et al used supervised machine learning to learn a predictor of ALL vs AML based on a signature of gene expression values. New leukemia cases could then be classied from microarray data extracted from tissue.
16 / 67

S UBTYPE D ISCOVERY: A LIZADEH

ET AL

2000

The authors measured gene expression in samples from patients diagnosed with diffuse large B-cell lymphoma, a cancer of B-lymphocytes. Previously, such cancers were divided into low-, intermediate- and high-grade categories based on growth patterns and immunohistochemistry. Alizadeh applied hierarchical clustering to these data, revealing a division of B-cell lymphoma samples into an equal split of two subtypes. Moreover, patients retrospectively demonstrated signicant differences in survival (Kaplan-Meier analysis).

17 / 67

T HERAPEUTICS : B UTTE

ET AL

2000

Generated hypotheses about functional relationships between pairs of genes and pharmaceuticals. Two databases:
Baseline microarray measurements of 6701 genes in a standardized set of 60 human cancer cell lines. Drug susceptibility measurements for the same cell lines, across nearly 5000 anti-cancer agents.

Used mutual information to build a graph of associations between baseline RNA expression levels and inhibition of growth by thousands of anti-cancer agents. Discovered a previously unknown association between a gene and a measure of anti-cancer agent susceptibility.
18 / 67

H ISTOPATHOLOGY: R AMASWAMY

ET AL

2001

Histopathology: study of diseased tissue by sectioning, staining, and multi-resolution microscopy. Computational methods can determine whether such samples quantitatively resemble a disease. Using support vector machines, Ramaswamy et al, 2001, predicted the original source of a cancer given just a metastatic sample.

19 / 67

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

20 / 67

B UT N OT Y ET
But despite these promising beginnings and post-HGP technological advances, and with few exceptions, the results to date from computational learning are not sufciently accurate or reproducible for clinical use. In particular, the effect of omics data on drug discovery and the identication of novel, more effective therapies has been limited. And it remains unclear how exactly to extract useful medical knowledge from experimental data, where useful means achieving higher precision and patient benet than can be currently obtained with traditional clincial practice.

21 / 67

DATA

D : High-dimensional, high-throughput genomic data. The traditional approach experimental and molecule-by-molecule is not feasible for this level of complexity. A principled mathematical approach has become indispensible for extracting knowledge from D. For example, over the last decade statistical learning has emerged as a core methodology for the analysis of D.

22 / 67

B ARRIER I: T ECHNOLOGICAL

Collecting D usually requires an invasive procedure. D is often of low quality, degraded by lab and batch effects. The number of samples in D is usually insufcient to represent the populations under study, and too small to generate results which are statistically robust and consistent across studies and devices.

23 / 67

B ARRIER II: T RANSLATIONAL

With off-the-shelf statistical learning techniques, the learned models and decision rules usually involve nonlinear functions of a great many variables. Consequently, it is difcult to look under the hood, yet there is added value in transparency for knowledge discovery and treatment design.

24 / 67

B ARRIER III: M ATHEMATICAL

The mathematical challenges are formidable because n, the number of samples, is very small relative to d, the number of molecular species assayed. Ideally, n d. In practice, n d. Hence, in view of trade-offs between bias and variance, incorporating rich a priori knowledge to constrain the representations of D may be unavoidable.

25 / 67

T YPICAL C ASES

Detect disease phenotypes from microarray data with d = 10, 000 transcripts and n = 100 (or fewer) patients. Measure the degree of phenotypic-regulation in pathways with d = 100 genes for n = 100 samples. Infer the statistics (and possibly the wiring diagram) of signaling and gene regulatory networks with d = 100 variables (genes, proteins) and n = 100 samples.

26 / 67

S MALL N , L ARGE D : O NE C ONSEQUENCE


Determining signicance of ndings is problematic. Must compensate for massive testing. Example: Using microarrays obtained for two phenotypes A, B, test the hypothesis of no difference in distribution for each of 40,000 genes (transcripts). This generates a p-value for each gene; take 0.05 as the threshold. Then even if A = B, on average you discover 2000 differentially expressed genes. Motivates a deeper statistical analysis, including false discovery rates.

27 / 67

E XAMPLE : I MPROVED P REVENTIVE C ARE

Standard lab report (e.g., from blood or urine) uses biomarkers X1 , X2 , ...: If Xi > , check out diseases a, b Measuring interactions among genetic markers and molecular concentrations can reveal more information: If g(Xi , Xj , Xk ) > , check ... But what is the right set {g} of functions, models or combinatorial logic? What will small n, large d allow?

28 / 67

E XAMPLE : P ERSONALIZED C ANCER T REATMENT

Standard treatment (prognosis, choice of drugs) is based on population statistics. But, massive diversity among tumors of the same general category due to differing mutational signatures. How rich a sub-categorization can we afford based on the available data?

29 / 67

E XAMPLE : M OLECULAR D IAGNOSIS D ISEASE

OF

Diseases are often the result of perturbed biomolecular networks, leading to differences in the abundances of biomolecules (e.g., mRNA, proteins, metabolites). Analyzing these differences enables learning predictors of disease presence, status and response to treatment. In particular, transcriptomics provides the global mRNA expression of particular tissue exposing transcriptional differences among diseases. What marker interactions do these data reveal?

30 / 67

DANGEROUS L IAISONS

Computational biology, vision and speech:


High-dimensional data (large d) A pattern discovery core

But importing methodology to biology is dangerous:


Our n is orders of magnitude smaller; Black boxes are often a liability.
Camera face detectors were developed with more samples then the total number of (publicly available) human gene chips (roughly n = 45, 000 for d = 50, 000 transcripts).

31 / 67

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

32 / 67

DNA NA is a uble helix t-handed)

33 / 67

PACKAGING

IN THE

N UCLEUS

The human genome is about 1.8 meters long and the nucleus of a cell is 6 106 meters in diameter DNA is carefully packaged into the nucleus in a regulated way; packaging is correlated with gene expression and therefore phenotype. DNA is packaged inside the nuc

Human genome meters (near Nucleus of a ce meters in dia

34 / 67

RNA

Ribonucleic acid Single-stranded sugar-phosphate backbone with nucleotides. Uracil(U) instead of Thymine(T). Can bind to DNA or RNA. Roles: information carrier, regulatory, enzymatic.

RNA

Ribonucleic acid Single-stranded sugarphosphate backbone with nucleotides Uracil instead of Thymine Can bind to DNA or RNA Roles:
Information carrier Regulatory enzymatic

14

35 / 67

T YPES

OF

RNA

mRNA: takes info from DNA and encodes proteins rRNA: platform and environment for protein synthesis tRNA: brings amino acids in to form proteins miRNA: regulatory and other functions siRNA: regulatory and other functions ribozymes: huge array of functions

36 / 67

C ENTRAL D OGMA

37 / 67

T RANSCRIPTION : DNA

TO

RNA

Transcription: DN

Lots o

DNA

Lots of regulation required. DNA in nucleus is very compact. RNA polymerase need associated factors in order to bind.

RNA factor

On t be ve

38 / 67

T RANSLATION : RNA

TO

P ROTEIN

Translation: mRNA species a protein

39 / 67

G ENETIC C ODE : M RNA S EQUENCE P ROTEIN S EQUENCE

TO

Amino acids are encoded by triplets of nucleotides called codons. The code is non-overlapping and comma-free. It is also redundant: there are 64 possible codons and 20 amino acids (and a special stop codon). The start codon is AUG (Methionine).

40 / 67

P ROTEINS
Polymers with 20 amino acids as building blocks. No complementary pairing. Perform virtually all work in the organism: enzymes, transport, signaling. come in different shapes Proteins Proteins come in many different shapes and sizes.

sizes, and numbers:

41 / 67

C HROMOSOMES
Single strands of DNA. Chromosomes Species have different chromosome numbers and layouts. Prokaryotes: one single circular chromosome, no nucleus. Viruses: lotschromosomes plus sex chromosomes (X and Y) Humans: 22 of little pieces. (haploid 22 chromosomes plus sex chromosomes (X and Humans:numbers) humans Y), diploid. (most eukaryotes) are diploid:

mom

dad

kid

42 / 67

E PIGENETICS
Means outside gene Epigenome controls cell type specic behaviors. Epigenetic marks and modications do not alter the DNA sequence. These marks have profound functional consequences and are heritable, and responsible for imprinting. They can cause or be altered by disease. Examples:
Cytosine methylation (CpG) Histone modications (methylation, acetylation, etc.) Other nucleotide modications (hydroxy-A, hydroxy-C, etc.)

43 / 67

E PIGENETICS

IN

C ANCER

Cancer cells show serious disruption in overall methylation; generally hypomethylated but extremely patchy. Methylation suppresses transcription and is important in gene regulation (and transposon regulation). Dysregulation of epigenetic marks causes large-scale changes in gene expression. Successful drugs have come from HDAC inhibitors and DNMT inhibitors. Methylation status obtained from microarrays and sequencing techologies.

44 / 67

O UTLINE

Scope of the Series Systems Biology to Systems Medicine Early Success Stories and Challenges Central Dogma Cancer: First Model

45 / 67

C ANCER

A disease of the genes due to the accumulation of genetic alterations over time that leads to uncontrolled cell growth and proliferation. An acquired genetic disorder.

46 / 67

C ANCER

A disease of the genes due to the accumulation of genetic alterations over time that leads to uncontrolled cell growth and proliferation. An acquired genetic disorder. Ninety percent of deaths result from metastasis, meaning that cancer cells migrate to distant organs and replace normal cells until the organ no longer functions.

46 / 67

C ANCER

A disease of the genes due to the accumulation of genetic alterations over time that leads to uncontrolled cell growth and proliferation. An acquired genetic disorder. Ninety percent of deaths result from metastasis, meaning that cancer cells migrate to distant organs and replace normal cells until the organ no longer functions. Differences in age at the onset of cancer reect different latency periods of the various types of cancer.

46 / 67

F ITNESS

OF

C ANCER C ELLS

Ability to proliferate. Propensity to invade: break away from the tumor and enter surrounding tissues. Ability to metastasize: spread to a non-adjacent body organ through the blood stream. Resistance to drugs and therapies, e.g., insensitivity to drug-induced apoptosis.

47 / 67

M UTATIONS
In a group of replicating cells, the probability of a mutation arising from DNA copying is 109 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that do not die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis).
48 / 67

P ICTURES

Normal Colon

Adenoma

Carcinoma
Melstrom et al, doi:10.1158/1078-0432.CCR-07-4631
Cancer can be observed histologically as deviation from normal morphology and biochemically as deviation from normal gene expression. Here you can see both the gradual dysplasia of cells progressing from normal colon cells to carcinoma. Cells here are stained for expression of genes involved in cell growth and magnied 20X. More brown means higher expression of a growth gene.

49 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

50 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

51 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

52 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

53 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

54 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

55 / 67

P ROGRESSION
Alberts et al, Molecular Biology of the Cell

In a group of replicating cells, the probability of a mutation arising from DNA copying is 10^-9 per nucleotide per cell. Should a cell acquire a mutation in a gene that confers a growth advantage, it will begin to outgrow its brethren. More replication events mean more mutation events. Subsequent hits can affect whether or not cells die when they are supposed to (a process called apoptosis), and cells that dont die can replicate more. When members of the clonal population have accrued enough mutations, they will experience gross changes in morphology and longevity, and they will be able to break away to lodge in other organs (a process called metastasis). In this process many mutations may be acquired that do not help or slow cancer development - they are passenger mutations, noise that adds to the difficulty in identifying cancer processes.

56 / 67

C ANCER I NVASION

AND

M ETASTASIS

Cell surface of a liver showing multiple metastatic nodules originating from pancreatic cancer

57 / 67

T YPES

OF

G ENES I MPLICATED

Oncogenes: Protein-coding genes that are up-regulated in cancer. Mutations render these genes constitutively active. Tumor Suppressor Genes: Protein-coding genes that are down-regulated in cancer. Mutations reduce the activity of their gene products. Genetic Instability Genes: Responsible for repairing subtle mistakes due to DNA replication or exposure to carcinogens, e.g., mismatch repair, nucleotide-excision repair, base-excision repair.

58 / 67

T YPES ( CONT )
Tumor Suppressor Genes
loss of function ~ no brakes
APC, p53, RB1, NF1

Oncogenes
gain of function ~ stuck gas pedal
K-ras, RET, KIT, MET

Growth Advantage

loss of function ~ missing mechanic


BRCA1, DNA repair genes

Genetic Instability Genes

Mutation Rate

59 / 67

T UMORIGENESIS

Discussion
Genetic Progression and the Waiting Time to Cancer
Niko Beerenwinkel1*, Tibor Antal1, David Dingli1, Arne Traulsen1, Kenneth W. Kinzler2, Victor E. Velculescu2, Bert Vogelstein2,3, Martin A. Nowak1
1 Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America, 2 Ludwig Center, Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, Maryland, United States of America, 3 Howard Hughes Medical Institute, Johns Hopkins University, Baltimore, Maryland, United States of America

Cancer results from genetic alterations that disturb the normal cooperative behavior of cells. Recent high-throughput genomic studies of cancer cells have shown that the mutational landscape of cancer is complex and that individual cancers may evolve through mutations in as many as 20 different cancer-associated genes. We use data published by Sjoblom et al. (2006) to develop a new mathematical model for the somatic evolution of colorectal cancers. We employ the Wright-Fisher process for exploring the basic parameters of this evolutionary process and derive an analytical approximation for the expected waiting time to the cancer phenotype. Our results highlight the relative importance of selection over both the size of the cell population at risk and the mutation rate. The model predicts that the observed genetic diversity of cancer genomes can arise under a normal mutation rate if the average selective advantage per mutation is on the order of 1%. Increased mutation rates due to genetic instability would allow even smaller selective advantages during tumorigenesis. The complexity of cancer progression can be understood as the result of multiple sequential mutations, each of which has a relatively small but positive effect on net cell growth.

Balaji Veeramani & Sarah Richardson 550.635 Topics in Bioinformatics Monday September 13, 2010

60 / 67

N UMBER OF M UTATIONS T ISSUE S AMPLE

PER

G ENE

IN

A panel of 35 tumors and 78 genes ordered by frequency of mutation in colon cancers. The number of driver genes that must acquire mutations is Genetic Progression of Cancer small for colon cancer.

Figure 2. Mutational Patterns in 35 Late-Stage Colorectal Cancer Tumors from Sjoblom et al. (2006) Matrix rows are indexed by tumors, columns are indexed by cancer-associated genes as identified by Sjoblom et al. (2006). Dark spots indicate mutated genes. Both tumors and genes have been sorted by an increasing number of mutations. The three genes mutated most often are APC (in 24 tumors; last 67 61 /

K-ras p53

APC

W RIGHT-F ISHER M ODEL


At time t, the number of cells is N(t). Cells are in one of d + 1 states, i.e. cells without mutation, one mutation, etc., up to d mutations. Nk (t): the number of cells at time t with k (driver) mutations, k = 0, ..., d, so N0 (t) + + Nd (t) = N(t). In the Wright-Fisher model, at time t all cells in the current population disappear, and are replaced with the new generation. The process is Markov in an appropriate sense.

62 / 67

W RIGHT-F ISHER M ODEL


At time t, the number of cells is N(t). Cells are in one of d + 1 states, i.e. cells without mutation, one mutation, etc., up to d mutations. Nk (t): the number of cells at time t with k (driver) mutations, k = 0, ..., d, so N0 (t) + + Nd (t) = N(t). In the Wright-Fisher model, at time t all cells in the current population disappear, and are replaced with the new generation. The process is Markov in an appropriate sense. Question: when does a k-mutant cell rst arise in the population?

62 / 67

WF ( CONT )

Let (t)) = (N0 (t), . . . , Nd (t)). All cells are generated independently at each generation. Assume that the population size follows a deterministic evolution. Let (k|(t)) be the conditional probability that a cell is in state k at time t + 1 given (t) N(t + 1)! P((t + 1)|(t)) = N0 (t + 1)! Nd (t + 1)!
d

(k|(t))Nk (t+1) .
k=0

63 / 67

WF ( CONT )
Suppose each cell picks its parent prototype at random.

64 / 67

WF ( CONT )
Suppose each cell picks its parent prototype at random. Selective advantage is modeled by assigning weights (w0 , . . . , wd ) to the parents: (k|(t)) = wk Nk (t) . w0 N0 (t) + + wd Nd (t)

64 / 67

WF ( CONT )
Suppose each cell picks its parent prototype at random. Selective advantage is modeled by assigning weights (w0 , . . . , wd ) to the parents: (k|(t)) = wk Nk (t) . w0 N0 (t) + + wd Nd (t)

Let wk = (1 + s)k , where s is the selective advantage.

64 / 67

WF ( CONT )
Suppose each cell picks its parent prototype at random. Selective advantage is modeled by assigning weights (w0 , . . . , wd ) to the parents: (k|(t)) = wk Nk (t) . w0 N0 (t) + + wd Nd (t)

Let wk = (1 + s)k , where s is the selective advantage. The mutation rate, u, the probability for each loci to mutate from one generation to another.

64 / 67

WF ( CONT )
Suppose each cell picks its parent prototype at random. Selective advantage is modeled by assigning weights (w0 , . . . , wd ) to the parents: (k|(t)) = wk Nk (t) . w0 N0 (t) + + wd Nd (t)

Let wk = (1 + s)k , where s is the selective advantage. The mutation rate, u, the probability for each loci to mutate from one generation to another. Then (k|)
k

=
j=0

wj yj (t) (d j)! u kj (1 u)dk . (k j)! (d k)! w0 y0 (t) + + wd yd (t)


64 / 67

Finally, N(t + 1) = (1 + )N(t).

d N (t) N! M ATLAB S IMULATION j N0 (t)! Nd (t)! j=0 j

we (1)

with parameters a sample matlab code & simulation j i d i (1 ji dj + s) xi j = u (1 u) . ji (1 + s) x i=0

Wright-Fisher with mutation and selection

a so enti but (2)

The parameter j is the probability that a cell in the next generation will have j mutations. If the mutation rate is small u 1 we can neglect multiple mutations, and j simplies to (1 + s)j xj (1 + s)j1 xj1 j = + u(d j + 1) . (1 + s) x (1 + s) x
Portions of code omitted for illustration

Thi into mea

The rst term is the probability to to floating points effects. produce an additional Matlab command 'mnrnd' very sensitive cell of type j without mutation, while the second term

bot valu 65 / 67 T

s E XPECTED)2WAITING T IME (log ud

FOR

tk = k

C ANCER

s log (Ninit Nn )

= number of cells in the population s (log ud )2 s = constant selective advantage, > 0 tku ==k constant mutation rate per gene s log (Ninit Nn ) d = number of driver genes considered
N = number of cells of the population k = number in driver genes

with mutations

s = constant selective advantage, > 0 u = constant mutation rate per gene d = number of driver genes considered k = number of driver genes with mutations

= 107 7 init = 10
9
66 / 67

k = number of driver genes with

T RAVELING WAVE
u = 107 Ninit = 107 Nn = 109 d = 100 s = 0.1 for k = 20, tk will be between 5 and 15 years. u = 107 Ninit = 106 Nn = 109 s = 0.01 d = 100 Figure 3 for k = 20, tk will be between 5 and 15 years.
67 / 67

A single simulation.The rst mutations in homogenous wildtype population set of a traveling wave. Ech class has a gaussian distribution, turnover is 1 cell division per cell per day.

Vous aimerez peut-être aussi