Vous êtes sur la page 1sur 30

Note that X

CS109/Stat121/AC209/E-109
(2)
qij = qik qkj
k

ince to get from i to j in two steps, the chain must go from i to some intermediar
Data Science
tate k, and then from k to j (these transitions are independent because of the Marko
roperty). So the matrix Q2 gives the 2-step transition probabilities. Similarly (b

Markov Chain Monte Carlo


nduction), powers of the transition matrix give the n-step transition probabilities
(n) n
qijis the (i, j) entry of Q .
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu
Example. Figure / blitzstein@stat.harvard.edu
1 shows an example of a Markov chain with 4 states. The chai

3 4
This Week
HW3 due next Thursday (Oct 17) at 11:59 pm
start now!

Friday lab 10-11:30 am in MD G115


(n)
qij is the (i, j) entry of Qn .
Example.
What is a Markov Chain?
Figure 1 shows an example of a Markov chain with 4 states. T

Chain with 4 states: 2

randomly choosing which arrow to follow. 3 Here we 4 assume that if there are a arr
originating at state i, then each is chosen with probability 1/a, but in general
arrow could be given any probability, such that the sum of the probabilities on
arrows leaving i Figure
is 1. The1: transition
A Markovmatrix
Chainofwith 4 Recurrent
the chain Statesis
shown above
Transition matrix, 0 1
can be visualized by thinking of1/3 1/3 1/3
a particle 0
wandering around from state
if at each stage an B 0 0 1/2 1/2 C
Q=B C.
arrow is followed @ 0 1
2
0 0 A
1/2 0 0 1/2
uniformly at random:
To compute, say, the probability that the chain is in state 3 after 5 steps, star
at state 1, we would look at the (3,1) entry of Q5 . Here (using a computer to
distributions, via algorithms known as MCMC (Markov Chain Monte Carlo).
(n)
Qnan
Definition of Markov Chain
To see where the Markov modelqijcomes is the (i, j)
from, entry of
consider first . i.i.d. sequence of
random variables X0 , X1 , . . . , Xn , . . . where we think of n as time. Independence is
Example. Figure 1 shows an example of a Markov chain with 4 states. T
a very strong assumption: it means that the Xj s provide no information about each
other. At the other extreme, allowing general interactions between the Xj s makes
it very difficult to compute even basic things. 1Markov chains are a happy medium
between complete independence and complete dependence.
The space on which a Markov process lives can be either discrete or continuous,
Chain
and time with
can be4either
states: discrete or continuous. In 2
Stat 110, we will focus on Markov
chains X0 , X1 , X2 , . . . in discrete space and time (continuous time would be a process
Xt defined for all real t 0). Most of the ideas can be extended to the other cases.
Specifically, we will assume that Xn takes values in a finite set (the state space),
which we usually take to be {1, 2, . . . , M } (or 3{0, 1, . . . ,4M } if it is more convenient).
In Stat 110, we will always assume that our Markov chains are on finite state spaces.

Definition 1. A sequence Figure


of 1: A Markov
random Chain
variables X0 , with
X1 , X42 , Recurrent Statesin the
. . . taking values
state space {1, . . . , M } is called a Markov chain if there is an M by M matrix
Q = can
(qij ) be
suchvisualized by nthinking
that for any 0, of a particle wandering around from state

P (Xn+1 = j|Xn = i, Xn 1 = in 1 , . . . , X0 = i0 )2 = P (Xn+1 = j|Xn = i) = qij .

The matrix Q is called the transition matrix of the chain, and qij is the transition
probability from i to j.
step. For example, the chain below is a birth-death chain if the labeled transitions
Example: Birth-Death Chain
have positive probabilities (except for the loops from a state to itself, which are
allowed to have 0 probability). We will now show that any birth-death chain is

q(1,1) q(2,2) q(3,3) q(4,4) q(5,5)

q(1,2) q(2,3) q(3,4) q(4,5)


1 2 3 4 5
q(2,1) q(3,2) q(4,3) q(5,4)

From j, can only go to j-1 or j+1, or stay at j (at


boundaries,
reversible, and construct onlydistribution.
the stationary 2 of these Let are
s1 bepossible).
a positive number (to
be specified later). Since we want s1 q12 = s2 q21 , let s2 = s1 q12 /q21 . Then since we
want s2 q23 = s3 q32 , let s3 = s2 q23 /q32 = s1 q12 q23 /(q32 q21 ). Continuing in this way, let
s1 q12 q23 . . . qj 1,j
sj = ,
qj,j 1 qj 1,j 2 . . . q21

or all states j with 2 j M . Choose s1 so that the sj s sum to 1. Then the chain
s reversible with respect to s, since qij = qji = 0 if |i j| 2 and by construction
si qij = sj qji if |i j| = 1. Thus, s is the stationary distribution.
graphically as a collection of states, each of which corresponds to
r residue,Application:
with arrowsDNA Sequence
between Analysis,
the states. CpG chain
A Markov Islandsfor DNA c
n like this:

A T

C G
source:
e we see a state Durbin
for each et al,
of the fourBiological Sequence
letters A, Analysis
C, G, and T in the DNA
A probability parameter is associated with each arrow in the figure,
mines the probability of a certain residue following another residue,
following another state. These probability parameters are called the t
probabilities, which we will write ast :
The transition probabilities for each model were set using the equation
lly as a collection of states, each of which corresponds to a par-
Application: DNA Sequence
with arrows between the states. A Markov c
+ chain for
+Analysis, CpG
st DNA can be Islands
ast = ! + , (3.3)
: t # cst #

A for where
and its analogue
ast , T
+
cst is the number of times letter t followed letter
In C-G dinucleotides, the C
s in the labelled regions. These are the maximum likelihood (ML) estimators for
the transition probabilities, as described often mutates
in Chapter 1. to a T due to
(In this case there were almost 60 000 methylation. InML
nucleotides, and a CpG island
estimators are ade-
G of eachthe
quate. If theCnumber of counts typemethylation
had been small,isthen
suppressed.
a Bayesian es-
timation process would have been more appropriate, as discussed in Chapter 11
a stateand
for below
each offor
theHMMs.) The
four letters resulting
A, C, G, and Ttables
in theare
DNA alpha-
lity parameter is associated with each arrow in the figure, which
+
probability Aor one C
ofAa certainC residueGfollowingT another residue, G T
another
A state. These0.274
0.180 probability parameters
0.426 0.120 are called
A the transi-0.205
0.300 0.285 0.210
ies, which we will write a st :
C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302
G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208
T ast = P(x
0.079 i = t|x0.384
0.355 i1 = s). 0.182 T 0.177(3.1)0.239 0.292 0.292

babilistic model of sequences we can write the probability of the


where the first row in each case contains the frequencies with which an A is
source: Durbin et al
followed by each of the four bases, and so on for the other rows, so each row
The transition probabilities for each model were set using the equation
sums to Application: DNA are
one. These numbers Sequence
+not the
c +Analysis, CpG Islands
stsame; for example, G following A is
ast = ! + , (3.3)
more common than T following A. Notice t # cst # also that the tables are asymme
both tables the Now
probability

canforuse likelihood ratios to decide
+ following C is lower than that for C follow
G
for ast , where
and its analogue whether a cst issequence
given the numberwas of times
from letter
a CpGt followed letter
although the effect
s in the labelled is stronger
regions. These in the
are themaximum
table, as expected.
likelihood (ML) estimators for
island or the rest of the ocean!
Totransition
the use theseprobabilities,
models forasdiscrimination,
described in Chapter we calculate
1. the log-odds ratio
(In this case there were almost 60 000 nucleotides, and ML estimators are ade-
L +
P(x|model
quate. If the number of counts of each type had +) ! a
been small, then xai1Bayesian
xi es-
Log-likelihood ratio:= log
S(x) = log in Chapter 11
timation process would have been P(x|model
more appropriate,
) asi=1 discussed
axi1 xi
and below for HMMs.) The resulting tables are
L
!
+ A C =
G Txi1 xi A C G T
i=1
A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210
C x0.171
where is the 0.368
sequence and 0.188
0.274 xi1 xi are C
the log likelihood
0.322 0.298 ratios
0.078 of0.302
correspo
G 0.161
transition 0.339 0.375
probabilities. 0.125
A table for isGgiven0.248
below0.246 0.298
in bits: 1 0.208
T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292
A C G T
where the first row in each case contains the frequencies with which an A is
A source:
0.740 Durbin
0.419 et0.580
al 0.803
followed by each of the four bases, and so on for the other rows, so each row
property). So the matrix Q gives the 2-step transition probabilities. Similarly (b
Google
nduction), powers of the PageRank
transition (Page-Brin)
matrix give the n-step transition probabilities
(n)
qij is the (i, j) entry of Qn .
Imagine someone randomly surfing the web...
Example. Figure 1 shows an example of a Markov chain with 4 states. The cha

3 4

How important
Figure a page
1: A Markov Chainiswith
depends not only
4 Recurrent Stateson

can be
how many other pages link to it, but also on
visualized by thinking of a particle wandering around from state to stat
how important those pages are!
2
Markov Chain Monte Carlo (MCMC)

Key idea: simulate complicated distributions and


approximate hard-to-compute averages by designing
and running a Markov chain!

The chain is constructed so that its stationary


distribution (the distribution it converges to in the
long run) is the desired distribution.

But why isnt that at least as hard as the original problem?


Darwins Finches

http://en.wikipedia.org/wiki/Darwin's_finches
Darwins Finches

Seeing this gradation and diversity of structure in one


small, intimately related group of birds, one might really
fancy that from an original paucity of birds in this
archipelago, one species had been taken and modified
for different ends. Charles Darwin
Darwins
1.2. GIBBS SAMPLING Finches and Binary Tables 11

Island
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Total
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 14
2 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 13
3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 14
4 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 10
5 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0 12
Species

6 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2
7 0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 10
8 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
9 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 10
10 0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 11
11 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 6
12 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2
13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 17
Total 4 4 11 10 10 8 9 10 8 9 3 10 4 7 9 3 3 122

data from Sanderson (2000)


Table 1.1: Presence of 13 finch species (rows) on 17 islands (columns). A 1 in entry
(i, j) indicates that species i was observed on island j. Data are from Sanderson
(2000).Jared Diamond defined a checkerboard as a pair of species that
never co-occur on an island. Here there are 10 checkerboards
out of 78 possible. Is that a lot or a little?
Given these data, we might be interested in knowing whether the pattern of 0s
Darwins Finches: A Monte Carlo Algorithm
1.2. GIBBS SAMPLING 11

Island
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Total
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 14
2 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 13
3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 14
4 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 10
5 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0 12

Species
6 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2
7 0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0 10
8 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
9 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 10
10 0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 11
11 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 6
12 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2
13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 17
Total 4 4 11 10 10 8 9 10 8 9 3 10 4 7 9 3 3 122

One
12
move of a Markov chain that preserves
Table 1.1: Presence of 13 finch species (rows) on 17 islands (columns). A 1 in entry
(i, j) indicates that species i was observed on island j. Data are from Sanderson
CHAPTER 1. MARKOV CHAIN
(2000).
row and column sums:
Given these data, we might be interested in knowing whether the pattern of 0s
and 1s observed in the table is anomalous in some way. For example, does there
have one of the two following patterns:
appear to be dependence between the rows and columns? Do some pairs of species

1. Pick 2 random rows, 2 random columns.


frequently occur together on the same islands, more often than one would expect by
chance? These patterns may shed light on the dynamics of inter-species cooperation

2. If submatrix is 0 1
or competition. One way to test for such patterns is by looking at a lot of random
tables with the same row and column sums as the observed table, to see how the 1 0
or
observed table compares to the random ones. This is a common technique in statistics
known as a goodness-of-fit test. 1 0 0 1
But how do we generate random tables with the same row and column sums
then swap between them;
as Table ??? The number of tables satisfying these constraints is impossible to

then switch to the opposite pattern with probability 1/2; othe


enumerate. Heres where MCMC comes to the rescue: well create a Markov chain on

else stay at the current state.


the space of all tables with these row and column sums, whose stationary distribution
is uniform over all such tables.
Hidden Markov Models (HMM)

Many applications in biology (e.g., identifying special regions in


a long sequence, sequence alignment for two sequences), in
speech recognition, and elsewhere.

Assume there is an underlying Markov chain running but that


we cant directly observe the sequence of states. Instead, we
observe emissions released just after each step in the chain.

s1 s2 s3 s4 hidden

x1 x2 x3 x4 observed
Figure 3.2 The histogram of the length-normalised scores for all the se-
quences. CpG islands are shown with dark grey and non-CpG with light
grey. HMM Application: CpG Islands

C+ G+
A+ T+

A T
C G
Figure 3.3 An HMMsource: Durbin
for CpG islands. et alto the transitions shown,
In addition
there is also a complete set of transitions within each set, as in the earlier
simple Markov chains.
The challenging part is that we can only observe the
sequence of As, Cs, Ts, Gs, not whether the state was an
expect CpG islands to stand out with positive values. However, this is somewhat
island (+) or non-island (-).
unsatisfactory if we believe that in fact CpG islands have sharp boundaries, and
are of variable length. Why use a window size of 100? A more satisfactory ap-
proach is to build a single model for the entire sequence that incorporates both
HMM Application: Speech Recognition
Three Fundamental Questions for HMMs

1. Find p(x), the probability of the observed sequence.


2. Find the most likely state sequence s, given the data x.
3. Estimate the model parameters (transition and emission
probabilities), given the data x.

s1 s2 s3 s4

x1 x2 x3 x4
Three Fundamental Questions for HMMs

1. Find p(x), the probability of the observed sequence.


2. Find the most likely state sequence s, given the data x.
3. Estimate the model parameters (transition and emission
probabilities), given the data x.

Methods:
1. Forward algorithm, backward algorithm (dynamic
programming, recursive)
2. Viterbi algorithm (dynamic programming, recursive)
3. Baum-Welch algorithm (a form of the EM algorithm
[Dempster-Laird-Rubin])
Finding the probability p(x)
X X
Naive method: p(x) = p(x, s) = p(x|s)p(s)
s s

Each term is easy to compute (assuming the


transition and emission probabilities are known).

But note how many terms there are... For 10


possible states and 100 observations, the number
of terms is 10100 intractable even on the fastest
supercomputer!
Finding the probability p(x): forward algorithm

Let fn (k) = p(x1 , . . . , xn , sn = k)


and compute these recursively!
X
fn+1 (l) = el (xn+1 ) fn (k)qkl
k
where the es are emission probabilities and
the qs are transition probabilities.
Then sum over k to get p(x).
Approximate number of multiplication operations
needed for 10 possible states and 100 observations:

naive method: 2 10102


4
forward method: 2 10
Social Networks: Geometric Features
Define three features:
n
! "
Edge ( 2 possible)
n
! "
Triangle ( 3 possible)
Two-star (3 n3 possible)
! "

53 edges, 30 edges, 53 edges, Can overlap; e.g., triangle


1 triangle, 8 triangles, 2 triangles, contains three two-stars
148 two-stars 58 two-stars 158 two-stars Edge

Twostar

Triangle

27 edges, 21 edges, 27 edges,


6 triangles, 3 triangles, 2 triangles,
53 two-stars 66 two-stars 90 two-stars

6
Exponential Random Graph Model
Idea: Make edge, triangle, two-star totals be sufficient statistics
in an exponential family.
p(G) exp(edges (# edges)
+ triangles (# triangles)
+ two-stars (# two-stars))
(Number of nodes presumed fixed.) More generally,
"
p(G) exp( x(G))
To get = instead of , need normalizing constant:
"
exp( x(G))
p(G) =
c()
Normalizing constant c() is unknown!
For 20 nodes, sum involves7 1057 terms....
MCMC for Generating Random Networks

Pick a random pair of nodes, and toggle whether there


is an edge there.
This gives a uniform stationary distribution.

Pick a random pair of nodes, and try to toggle


whether there is an edge there.
Getting Samples: MCMC: Metropolis. No c dependence:
p0(G !) exp( 0! x(G !))/c( 0) ! ! !
"
= ! = exp 0 (x(G ) x(G))
p0(G) exp( 0x(G))/c( 0)
Likelihood Approximation: plug in c() to get p(g).
Flip a coin with this probability

of Heads (or one if this
Optimizeexceeds one), MLE:
to estimate and accept the
MLE = toggle pifHeads.
argmax (g).
This gives the desired stationary distribution on networks!
Likelihood Surfaces: True and Estimated
1.4 1.6 1.8 2.0 2.2 2.4

True Approximated
Gibbs Sampler
Explore space by updating one coordinate at a time.

2D parameter space version:

Draw new 1 from conditional distribution of 1 |2


Draw new 2 from conditional distribution of 2 |1
Repeat
Gibbs Sampler
Posterior simulation 295

Figure 11.3 Four independent sequences of the Gibbs sampler for a bivariate
normal distribution with correlation =0.8, with overdispersed
starting points indicated by solid squares. (a) First 10 iterations,
showing the component-by-component updating of the Gibbs
iterations. (b) After 500 iterations, the sequences have reached
approximate convergence. Figure (c) shows the iterates from the
second halves of the sequences.

Thus, each subvector Gelman


j is et al,
updated Bayesian
conditional Data
on the Analysis
latest values of the other
components of , which are the iteration t values for the components already updated and
the iteration t 1 values for the others.
Metropolis-Hastings Algorithm

Modify a Markov chain on a state space of interest to obtain


a new chain with any desired stationary distribution!
2 CHAPTER 1. MARKOV CHAIN MONTE CARLO

1. If Xn = i, propose a new state j using the transition probabilities pij of the


original Markov chain.

2. Compute an acceptance probability,



sj pji
aij = min ,1 .
si pij

3. Flip a coin that lands Heads with probability aij , independently of the Markov
chain.

4. If the coin lands Heads, accept the proposal and set Xn+1 = j. Otherwise, stay
in state i; set Xn+1 = i.

In other words, the modified Markov chain uses the original transition probabilities
pij to propose where to go next, then accepts the proposal with probability aij ,
Metropolis-Hastings Algorithm
Posterior simulation 293

Figure 11.2 Five independent sequences of a Markov chain simulation for the
bivariate unit normal distribution, with overdispersed starting points
indicated by solid squares. (a) After 50 iterations, the sequences are
still far from convergence. (b) After 1000 iterations, the sequences
are nearer to convergence. Figure (c) shows the iterates from the
second halves of the sequences. The points in Figure (c) have been
jittered so that steps in which the random walks stood still are not
hidden. The simulation is a Metropolis algorithm described in the
example on page 290.

distribution, p( |y).Gelman et al,


The samples are Bayesian Data Analysis
drawn sequentially, with the distribution of the
sampled draws depending on the last value drawn; hence, the draws form a Markov
Regression
Example: student performances in school

yi = 0 + 1 xi + i
But what about differences between schools?

yi = 0 + 1 xi + 2 I2i + 3 I3i + + m Imi + i


Ugly...
yi = j[i] + xi 1 + i
But then what about school level covariates? What
about prior information?
Multilevel (Hierarchical) Models

yi = j[i] + xi 1 + i
2
j N( 0 , 0 )

yi = j[i] + xi 1 + i
j = 0 + 1 zj + j
2
j N (0, 1 )

Then use MCMC to study


the joint posterior (density of
all the parameters, given the
data)

Vous aimerez peut-être aussi