Vous êtes sur la page 1sur 19

Bayesian inference in hidden Markov models

through reversible jump Markov chain Monte Carlo


Christian P. Robert  , Tobias Ryd
en y , and D.M. Titterington z
CREST-INSEE, Paris, Lund University, and University of Glasgow

Abstract
Hidden Markov models form an extension of mixture models providing a ex-
ible class of models exhibiting dependence and a possibly large degree of vari-
ability. In this paper we show how reversible jump Markov chain Monte Carlo
techniques can be used to estimate the parameters as well as the number of
components of a hidden Markov model in a Bayesian framework. We employ
a mixture of zero mean normal distributions as our main example and apply
this model to three sets of data from nance, meteorology and geomagnetism,
respectively.

AMS 1991 classi cation. Primary 62M09. Secondary 62F15, 62P05.


Key words. Hidden Markov model, Bayesian inference, model selection, Markov chain Monte Carlo.

1 Introduction
Hidden Markov models (HMMs) have been used in many areas as convenient representations of
weakly dependent heterogeneous phenomena; a few examples are econometrics (Hamilton, 1989;
Chib, 1996; Krolzig, 1997; Billio et al., 1998), nance (Ryden et al., 1998), biology (Leroux and
Puterman, 1992), genetics (Churchill, 1995), neurophysiology (Fredkin and Rice, 1992) and speech
processing (Rabiner, 1989). We also refer to the recent monograph by MacDonald and Zucchini
(1997).
In hidden Markov models, the heterogeneity in the data is represented by a mixture structure,
that is, a pair (zt ; yt ) with zt 2 f1; : : : ; kg and yt jzt  fz (yt ). The yt 's are assumed to be in-
dependent conditional on the zt 's. The Markov nomenclature comes from assuming that fzt g is
t

distributed as a ( nite-state) Markov chain and the hidden part is due to the fact that fzt g is not
observed. Commonly, the conditional distributions fi belong to a single parametric family, such as
the normal or Poisson families, so that zt corresponds to the parameter used to generate yt .
Inference for HMMs was rst considered by Baum and Petrie (1966), who treated the case
when yt takes values in a nite set. Petrie (1969) provided identi ability conditions for HMMs
 Laboratoire de Statistique, CREST, INSEE Timbre J340, 75675 Paris cedex 14, France
y Department of Mathematical Statistics, Lund University, Box 118, 221 00 Lund, Sweden
z Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland, U.K.

1
while Baum et al. (1970) considered maximum likelihood further, implementing an early version
of the EM algorithm as well as deriving a general proof of monotonicity for the algorithm. They
also came up with a procedure which allows for the derivation of the conditional joint distribution
of the zt 's given the yt 's and which has been used ever since (see Appendix A). Later, Leroux
(1992) proved consistency of the MLE for general HMMs under mild conditions while asymptotic
normality has been proved by Bickel et al. (1998).
The above references apply to the estimation of HMM parameters when the number of values of
zt , k say, is known. This paper is concerned with inference when k is also unknown. The approach
adopted here is to use Bayesian inference and implement it by reversible jump Markov chain Monte
Carlo techniques rst proposed by Green (1995).
Non-Bayesian approaches to the problem include tests based on likelihood ratios and penalized
likelihood methods. With the former method, we rst postulate H0 : k = 1 as our null hypothesis
and H1 : k = 2 as the alternative, compute the likelihood ratio and reject H0 if this statistic is
large enough. Standard theory would say that, if H0 is true, the likelihood ratio is asymptotically
distributed as a 2 random variable. This result is not true for HMMs or mixtures, however, es-
sentially because, if H0 is true, then the parameters are not uniquely identi ed under H1 . For
HMMs, the limiting distribution of the likelihood ratio cannot be computed numerically, but must
be approximated by simulation. McLachlan (1987), in the context of mixtures, proposed to ap-
proximate it by parametric bootstrap, and Ryden et al. (1998) indeed applied this idea to HMMs.
The approach requires enormous computational e orts, however; even one likelihood ratio statistic
is time-consuming to compute, and usually 50 or 100 bootstrap replications are simulated. Ryden
et al. (1998) were not able to take the approach further than to testing k = 2 vs. k = 3 because
of the computational diculties. Penalized likelihood methods such as the Akaike and Bayesian
information criteria (AIC, BIC) are less demanding since simulated replications are not involved.
On the other hand, they produce no number that quanti es the con dence in the result, such as a
p-value. If this is desired, we are again con ned to parametric bootstrap.
Bayesian inference through reversible jump Markov chain Monte Carlo methods is a viable
alternative to frequentist analysis that both explores models with di erent values of k and provides
gures representing con dence in di erent models. A most appealing feature is that this approach
bypasses the traditional model choice structure, which requires prior modelling and testing for the
di erent models. This alternative is based on a standard hierarchical modelling and estimates k as
well as the posterior probabilities of the di erent models. Richardson and Green (1997) developed
similar methods for mixture distribution analysis, and our work is in many respects based on, and
a development of, their achievements. Our work also continues that of Robert et al. (1993), Chib
(1996) and Robert and Titterington (1998) on Bayesian estimation for HMMs with a xed number
of components.
As in the case of mixtures, HMMs with an unknown number of components are appealing in
two settings: rst, a heterogeneous population with an unknown degree of heterogeneity k may
come under scrutiny and the value of k matters for the subsequent analysis; secondly, the density of
a dependent sample may be estimated and the HMM structure provides a parsimonious alternative
to standard nonparametric estimates.
The paper is organised as follows: Section 2 details the prior modelling, Section 3 details the
steps in the reversible jump MCMC algorithm, since the Markovian structure together with the
reversibility constraint lead to a higher level of complexity than in Richardson and Green (1997),
while Section 4 examines the performance of this method on several datasets.

2
2 The Bayesian model
We choose to study mixtures of zero mean normals, partly because we want to analyse data to
which this model was previously tted using likelihood techniques (see Section 4), and partly to
provide some variety over the structure considered by Richardson and Green (1997). A wide range
of other mixtures can be dealt with along similar lines. For a given k, the marginal distribution of
an observation yt will thus be
X
k
i N (0; i2 ) ;
i=1
where the i 's are the components of the stationary vector of the transition matrix of the hidden
states, A = (aij ), such that P (zt+1 = j jzt = i) = aij . The set of i 's is denoted by .
Along lines similar to Section 2.2 of Richardson and Green (1997), we assume that the joint
density of all variables mentioned so far takes the form
p(k; A; z; ; y) = p(k)p(Ajk)p(zjA; k)p(jk)p(yj; z):
In particular, k is a priori uniform on f1; 2; : : : ; kmax g, where kmax is speci ed, given k the rows of A
are independent with a Dirichlet distribution D(; ; : : : ; ), and given k the i 's are uniform on (0; )
but sorted in ascending order, so that the parameters and components are identi able. In addition,
we assume that has a negative exponential distribution with mean 1= and 1= = 30 max jyt j.
The reason for putting a hyperprior on is to make the posterior robust; this is not the case if
is xed but it is with the current model, as we expand on further below. On the other hand, we
chose a particular value for , in fact  = 1. Thus, the complete joint density, corresponding to (6)
in Richardson and Green (1997), is
p( ; k; A; z; ; y) = p( )p(k)p(Ajk; )p(zjA; k)p(jk; )p(yj; z ): (1)

3 MCMC methodology
General theory and methodology of MCMC procedures is now fairly standard; see for instance Tier-
ney (1994) and Gilks et al. (1996). For the particular approach used here, we are clearly especially
indebted to the formative work on reversible jump MCMC in Green (1995) and Richardson and
Green (1997); for further background references, the latter paper and its discussion are particularly
valuable. We simply recall here that the reversible jump techniques were proposed by Green (1995)
to overcome the measure theoretic diculties in implementing standard MCMC algorithms in prob-
lems with varying dimension parameters. The fundamental idea behind Green's (1995) technique
is to impose a one-to-one transition in steps when the dimension of the space varies (so that the
dominating measure can be chosen positive). In a mixture set-up, the moves where the dimension
changes can be restricted to so-called split and merge moves, where a component is broken into
two components or, conversely, when two components are reunited in a single component.
Inferences are based on a sample y1 ; : : : ; yn , that is the sample size is n. Our objective is to
generate realizations from the conditional joint density p( ; k; A; z; jy) de ned through (1), and
the detailed description of one sweep of our MCMC procedure is as follows:
(a) update transition probability matrix A;
(b) update the standard deviations ;
(c) update the allocations z ;

3
(d) update the hyperparameter ;
(e) consider splitting a component into two or combining two into one;
(f) consider the birth or death of an empty component.
3.1 Gibbs moves
Move types (a){(d) are Gibbs moves and (a) and (c) follow Robert et al. (1993). In P (a), the i-th
?1 I fzt =
row of A is sampled from a Dirichlet distribution D( + ni1 ; : : : ;  + nik ), where nij = tn=1
i; zt+1 = j g is the number of jumps from component i to component j ; I fg denotes an indi-
cator function. In (b), each i?2 is sampled from a truncated Gamma distribution with density
proportional to
u(n ?3)=2 e?(s2 =2)u I fu  1= 2 g;
i i (2)
P P
where ni = nt=1 I fzt = ig and s2i = nt=1 I fzt = igyt2 are respectively the number of observations
allocated to component i and the corresponding sum of squares. Sampling from (2) was carried
out by simply rejecting Gamma random variates until the condition u  1= 2 was ful lled, except
when ni = 0 or 1; more sophisticated approaches include the slice sampler of Damien and Walker
(1996). After sampling each i , the new set of i 's was accepted if they were ascending, otherwise
the update was rejected. In (c), z1 ; : : : ; zn were resampled one at a time from t = 1 to t = n, with
conditional probabilities given by
P (zt = i j   ) / az ?1 i '(yt ; i )aiz +1 ;
t t
(3)
where '(; ) is the density of a normal random variable with zero mean and standard deviation ;
for t = 1, the rst factor is replaced by the stationary probability i , and, for t = n, the last factor
is replaced by unity. In (d), was sampled from a distribution with density proportional to
u?k e?u I fu  max  g:
i i
(4)
Sampling from this density was carried out using the method of Damien and Walker (1996).
3.2 Split and combine moves
The moves (e) and (f) are more involved Metropolis-Hastings steps and allow for increasing or
decreasing the number of components by one. In (e), we choose to split with probability bk and to
combine with probability dk = 1 ? bk . Naturally, d1 = bkmax = 0, and we used bk = dk = 1=2 for
k = 2; 3; : : : ; kmax ? 1. To describe the combine move, suppose that the current state of the MCMC
algorithm is xe, with parameters aeij , etc., and that this representation has k + 1 components. We
randomly select a pair (j1 ; j2 ) of adjacent components and try to combine them into a single new
component j , thus creating a new state x with k components. This is done in several steps. First,
for any t with zet equal to j1 or j2 , zt is set to j while remaining zet 's are simply copied. Secondly,
the standard deviation of the new state is given by
j2 = ej1 ej21 + ej2 ej22 ; (5)
while remaining ei 's are copied. Here, ej is the stationary probability of the hidden chain ze being
in state j , so that the vector e satis es e Ae = e . Thirdly, the transition probabilities from and to
the components involved in the move are set as
ajj = e e+j1e aej1 j + e e+j2e aej2 j for j 6= j ;
j1 j2 j1 j2 (6)
aij = aeij1 + aeij2 for i 6= j :
4
Remaining aij 's are obtained by equating row sums to one or by copying. Note that the expression
for aj j in (6) is the conditional probability of jumping to component j , given that the current
component is either j1 or j2 . Note also that the new transition probability matrix A has essentially
the same stationary probabilities as does Ae; one can readily verify that j = ej for j 6= j and
j = ej1 + ej2 . This in turn implies that the new hidden Markov model (A; ) has the same second
moment as does (A; e e ): P j j2 = P ej ej2. Obviously, the rst moments of the HMMs we consider
are identically zero.
In the split move, a component j is randomly chosen and split into two new ones, j1 and j2 . In
describing this, we assume that the old representation is x, with k components, and that the new
one is xe, with k + 1 components. In designing the split move, our starting point was the following
result.
Theorem. Let 0 < u0; u1 < 1. Assume that the new transition probabilities are given by
aej1 j = ajj ; aej2 j = ajj for j 6= j1 ; j2 ;
aeij1 = u0 aij ; aeij2 = (1 ? u0 )aij for i = 6 j1 ; j2 ; (7)
aej1 j1 = (1 ? u1 )ajj ; aej1 j2 = u1 ajj ;
and that aej2 j1 and aej2 j2 are taken so that the new stationary probabilities are ej = j for j 6= j1 ; j2 ,
ej1 = u0 j and ej2 = (1 ? u0 )j ; remaining aeij 's are obtained by copying. It is assumed that the
pair (u0 ; u1 ) is chosen so that all aeij 's are non-negative.
Then if  is an eigenvalue of A with right eigenvector h,  is also an eigenvalue of Ae with right
eigenvector he given by he j = hj for j = 6 j1 ; j2 and he j1 = he j2 = hj . In addition, (1 ? u1 ? u0)=(1 ?
u0 )  ajj is an eigenvalue of A with right eigenvector eh given by ehj = 0 for j =
e 6 j1; j2 , he j1 = 1 ? u0
and ehj2 = ?u0 . Finally, it holds that (Am )ij = (Aem )ij for each m > 0 and i; j 6= j ; j1 ; j2 . 2
The theorem shows that, by splitting the transition probabilities as in (7), the dynamics of the
hidden Markov chain are essentially preserved; the di erence is found in the dynamics within the
new components j1 and j2 . The variable u0 determines how much of the stationary probability
for j is assigned to the new component j1 , and u1 determines the dynamics within the two new
components. Since this way of splitting the component j into two is essentially deterministic, it
cannot be used in a reversible jump MCMC algorithm. However, we can take it as a guideline for
designing a more random split move.
Our aim is to split j in such a way that stationary probabilities for the hidden chain are
preserved as above, that is ej = j for j 6= j1 ; j2 , ej1 = u0 j and ej2 = (1 ? u0 )j . We can
accomplish this as follows, Let u0  Be (2; 2), uj  Be (r; s) for each j 6= j1 ; j2 , and vi  Be (r; s) for
each i 6= j1 ; j2 . The shape parameters r and s are given below. Then set aeij = aij for i; j 6= j1 ; j2 ,
set
? uj a
aej1 j = uuj ajj ; aej2 j = 11 ?
0 u0 jj for j 6= j1 ; j2 ;
aeij1 = vi aij ; aeij2 = (1 ? vi )aij for i 6= j1 ; j2 ;
  (8)
eaj1j1 = u1 1 ? Pj6=j uuj ajj ;
0
n o
aej2 j1 = (1 ? u1 ) j6=j uj ajj + u0 u1 ? Pi6=j i viaij =(1 ? u0 );
P
where i = i =j , and set aej1 j1 and aej2 j2 to make P
rows sum to unity. Since u0 aej1 j + (1 ? u0 )aej2 j =
ajj for j 6= j1 ; j2 , the equilibrium equation ej = i ei aeij holds for j = 6 j1 ; j2 . The expression for

5
aej1 j2 ensures that it holds also for j = j2 . The shape parameters r and s are taken as
r = 1 ? u0c(12 + c ) ; s = r 1 ?u u0 if u0  1=2;
2
0 (9)
s= 1 ? (1 ? u 0 )(1 + c 2 ) u 0
; r = s 1 ? u if u0 > 1=2:
c2 0
This produces a Beta distribution with mean u0 and, if u0  1=2, squared coecient of variation
c2 ; if u0 > 1=2 the squared coecient of variation is not c2 but rather the distribution is a mirrored
(around x = 1=2) version of the distribution obtained for 1 ? u0  1=2. For our numerical results,
we used c2 = 0:5. Note that, given u0 , the conditional means of aej1 j and aej2 j , j 6= j1 ; j2 , are aj j ,
and the conditional means of aeij1 and aeij2 , i 6= j1 ; j2 , are u0 aij and (1 ? u0 )aij respectively. This
is all consistent with the theorem above. We now discuss u1 . Given u0 , the uj 's and the vi 's, the
valid range for u1 , that is the range in which the resulting Ae is stochastic, is [uL1 ; uU1 ], with
( 1 ? P  =u  ae )
u1 = max 1 ? 1 ?i6=P
L j i 0 ij1
;0 ;
e
a
j 6=j j1 j
8
< P  =u  ae ? (1 ? u )=u  1 ? P ae  9
1 ? i6=j i 0 ij1 0 0 j 6=j j2 j =
uU1 = min :1 ? P
1 ? j 6=j aej1 j ; 1; :

It may happen that uL1 > uU1 , in which case there is no valid u1 and the split move is rejected. If
uL1 < uU1 , we draw u1 as, symbolically written, u1  uL1 + (uU1 ? uL1 )Be (1; 1).
The new standard deviations are created as ej = j for j 6= j1 ; j2 and
s ! s !
ej21 = j2 1 ? w 1 ?u u0 ; ej22 = j2 1 + w 1 ?u0u : (10)
0 0
Here the range for w in which the ei 's become properly sorted is [0; wU ], where
( s s )
= max h21 1 ?u0u ; h22 1 ?u u0 ;
wU (11)
j 0 j 0
h1 = j2 ? j2?1 if j > 1 and h1 = j2 otherwise, and h2 = j2+1 ? j2 if j < k and h2 = 2 ? j2
otherwise. We draw w as w 2 wU Be (1; 1). All the above Beta variables are independent.
Finally we discuss how the hidden chain z is split, giving ze. The zt 's di erent from j are simply
copied, but those equal to j should be relabelled as j1 or j2 . Assume that zt = j for t1  t  t2
with zt1 ?1 6= j and zt2 +1 6= j . We then sample zet1 ; : : : ; zet2 from the conditional distribution of
this vector given the remaining zet 's, the yt 's, and the proposed new aeij 's and ei 's. To do this, we
employ a restricted backward algorithm described in Appendix A. Note that if we rst split the
state x as in (8) and (10), giving xe, and then combine the components again as in (5) and (6), we
recover x. This makes the split/combine pair reversible. Also, note that, from (5) and (6), we can
compute corresponding values of u0 , u1 , the uj 's, the vi 's and w in the combine move.
The acceptance probabilities for the split and combine moves are min(1; R) and min(1; R?1 ),
respectively, where
+1 ?[(k + 1)] kY
kY +1
aeij?1
ze; e)  p(k + 1) ?( ) k+1 e
R = pp((yyjjz; ) p(k)  i=1Yk ?(k) Y k
j =1
 pp((zzejjAA))
?(  ) k aij?1
i=1 j =1

6
kY
+1
?1 I fei  g
 (k +k! 1)!  iY
=1
k  b dPk+1
k alloc
?1 I fi  g
i=1
0 ! 1?1
Y Y
 @g2;2 (u0) gr;s(uj ) gr;s(vi ) 1 u1 ? uL1 1 g  w A  J;
j i uU1 ? uL1 g1;1 uU1 ? uL1 wU 1;1 wU
where gr;s is the Be (r; s) density and Palloc is the probability of making the particular allocation of
the zet 's that was made. Furthermore, J is the Jacobian determinant of the transformation given by
(8) and (10). This determinant, which partly is evaluated numerically, is further discussed Q in Ap-
pendix B. TheQconditional densities involved in R are simple products: p(yjz; ) = nt=1 '(yt ; z ),
p(zjA) = z1 tn=1 ?1 az z , etc. When  = 1, which is the value we used for the numerical results t

t t +1
below, the expression for R simpli es to
ze; e)  p(k + 1)  kk k!  p(zejAe)  (k + 1)  ?1  dk+1
R = pp((yyjjz; ) p(k) p(zjA) bk Palloc
0 !  1?1
Y Y 1 u ? u L 1 w
 @g (u ) g (u ) g (v )
2;2 0 r;s j r;s i g
uU1 ? uL1
1
1;1
1 g
uU1 ? uL1 wU
A  J: (12)
1;1 wU
j i

3.3 Birth and death moves


We now proceed to the birth and death moves. In (f), we rst choose at random between birth
and death with probabilities bk and dk , respectively. The death move is accomplished by selecting
a component at random among the empty ones and deleting it. The remaining rows of A are then
renormalised and the z 's are unaltered.
In the birth move, we start with a model with k components and we want to create a new empty
component j . To do this, we draw the j -th row aj  = u of the new transition probability matrix
from the prior D(; : : : ; ). Furthermore, we draw vi , i 6= j , from Be (1; k) and, when i 6= j  , set
aeij = (1 ? vi )aij for j 6= j  ; (13)
aij = vi :
The new standard deviation ej is drawn from the uniform prior U (0; ), the remaining ei 's are
copied, and, again, ze = z because the new component is empty. As for the split/combine pair, if we
rst create a new component using the birth move and then delete it through the death move, we
recover the original state. The acceptance probabilities for the birth and death moves are min(1; R)
and min(1; R?1 ), respectively, where
+1 ?[(k + 1)] kY
kY +1
aeij?1
?( ) k+1 e
R= p(k + 1)  i=1
p(k) Yk ?(k) Y k
j =1
 pp((zzejjAA))
?(  ) k aij?1
i=1 j =1
kY +1
?1 I fei  g
 (k + 1)!  i=1
k! Yk ?1
I fi  g
i=1

7
Y !?1
dk +1
 b (k + 1)  pD (u) g1;k (vi) ?1  J;
k 0 i
where k0 is the number of empty components before the birth and pD (u) is the D(; : : : ; ) density.
Furthermore, J is the Jacobian determinant of the transformation from (aij ; uj ), where i; j 6= j  ,
to (aeij ; aj  j ), where j 6= j  . Note that we can omit the index j  because row sums of A and Ae equal
unity. Since the new row component is drawn from the prior, cancellations occur in the expression
for R, and when  = 1 it simpli es to
e) Y !?1
p(k + 1) k p( e
z j A dk +1
R = p(k)  k  p(zjA)  (k + 1)  b (k + 1)  g1;k (vi )  J: (14)
k 0 i
Finally, Y
J= (1 ? vi )k?1 ; (15)
i6=j 
this expression corresponds to the last factor of (12) in Richardson and Green (1997), but there is
a misprint in their formula, in that exponent k should be k ? 1; see Richardson and Green (1998).
For the death move, the corresponding u is given by the deleted row of Ae and the vi 's are given by
(13).
That the sampler behaves as desired, in terms of converging to a realization from the joint
distribution in (1), follows by arguments similar to those used at the end of Section 3.2 in Richardson
and Green (1997) to establish aperiodicity and irreducibility. For instance, irreducibility obtains
because the chain can move among the possible values of k by increasing or decreasing its value by
one, with positive probability, in step (c) all allocations z have positive probability, in step (a) the
conditional density is positive on the natural parameter space, and the same holds true for steps
(b) and (d) if we consider two consecutive sweeps.

4 Applications
4.1 Description of examples
To investigate the behaviour and performance of the reversible jump MCMC algorithm, we applied
it to three di erent sets of data. The rst set is an extract from the S&P 500 stock index. It consists
of 1700 observations of daily returns during the 1950's. This series was previously analysed in Ryden
et al. (1998); it is the subseries called series E in that paper. We also refer readers to this paper for
more information about the data. Our second set of data is a series of 500 hourly wind measurements
at Athens in January 1990. It has previously been analysed by Francq and Roussignol (1997) using
a model almost identical to ours. The di erence is that their hidden Markov chain z cannot jump
arbitrarily between states; rather there is a `base state' that communicates with all other states,
but any other jumps are disallowed. Our third set of data consists of 2700 residuals from a t of
an ARMA model to three-hour planetary geomagnetic activity index measurements. This dataset
was also analysed by Francq and Roussignol (1997), using a restricted model as above, and we
refer to that paper for further information. The datasets are plotted in Figures 2{4. The wind
data are discretized in steps of 0.1 and contain 56 observations that equal zero. The likelihood is
then unbounded; indeed, xing k > 1, A and all i 's but one and letting the remaining i tend
to zero we see that the likelihood grows inde nitely. This of course makes unrestricted maximum-
likelihood estimation inappropriate and it also has a substantial e ect on the result of a Bayesian
analysis. Therefore, before using the data, we added a uniform (?0:05; 0:05) random variable (all
independent) to each observation.

8
4.2 Some results
For each set of data our results are based on 1,000,000 sweeps of the MCMC algorithm after a
burn-in of 100,000 sweeps. Table 1 shows the posterior distribution of the number of components
k. For the S&P 500 series, there is an almost dead heat between k = 2 and k = 3. Ryden et al.
(1998) tested the null hypothesis k = 2 vs. k = 3 using a bootstrapped likelihood ratio test, and
obtained the simulated p-value 0.57. Even if there is no exact correspondence between this gure
and the present results, they indicate a similar degree of belief in k = 2. Figure 1 shows density
plots of the loglikelihood values sampled at every 10th sweep. There is a clear di erence between
k = 2 and k = 3, after which the curves essentially overlap. We also note that the parameter
posterior means given k = 2 are a12 = 0:044, a21 = 0:083, 1 = 0:0046 and 2 = 0:0093, which are
close to the MLEs 0.037, 0.069, 0.0046 and 0.0092, respectively, found in Ryden et al. (1998).
For the wind data, Table 1 suggests k = 3. Francq and Roussignol (1997) proposed k = 2 but,
as noted above, they used a somewhat di erent model. Also, they did not preprocess the data and
do not discuss the problem of unbounded likelihoods. The loglikelihood obtained by their MLE
is about ?689 for both the original and the preprocessed data, which could be compared to the
loglikelihood curves in Figure 1. This gure show a clear distinction between k = 2 and k = 3, but
also a distinction when going to k = 4. For k = 5 the curve overlaps, however.
For the magnetic dataset, Table 1 again suggests k = 3 components. Francq and Roussignol
(1997) proposed k = 2 with their MLE yielding a loglikelihood of about ?6101. In Figure 1 curves
for k = 2 and k = 5 are not plotted because there were very few sweeps with such k.
One of the observations that Ryden et al. (1998) made was that the dependence structure of
the S&P 500 subseries was dicult to capture with maximum-likelihood estimation. First notice
that the HMMs we consider are uncorrelated. Therefore, Ryden et al. (1998) rather looked at the
covariance function of the absolute values of the series and found that it was quite slowly decaying
in the S&P 500 data. However, this property was not present to the same extent in the estimated
models. Figures 2{4 show standard non-parametric estimates of the covariance functions of fjyt jg,
together with a corresponding Bayesian estimate. By Bayesian estimate we mean that in each
sweep the covariance function of the current model was computed, and these functions were then
averaged over all sweeps. For details of the computation of the covariance function of the HMMs,
see Ryden et al. (1998). Note that the averaged covariance function is not a covariance function
of an HMM. Figure 2 shows a reasonable agreement between the two estimates up to lags about
15, but after this the Bayesian estimate decays faster. Similar features emerge for the other sets
of data. On the other hand, in the S&P 500 series, for larger lags the estimates are of the order
10?6 , whence it is dicult to judge whether the true covariance function is di erent from zero. The
gures also show Bayesian estimates of the marginal density together with kernel density estimates,
and these agree very well.

k S&P 500 wind magnetic


1 0.0000 0.0000 0.0000
2 0.4877 0.0050 0.0000
3 0.4521 0.8725 0.9830
4 0.0550 0.1168 0.0169
5 0.0040 0.0055 0.0002
6 0.0009 0.0001 0.0000
7 0.0003 0.0000 0.0000

Table 1: Posterior distribution of the number of components k.

9
0.4

0.3

0.2

0.1

0
6200 6205 6210 6215 6220 6225

0.4

0.3

0.2

0.1

0
695 690 685 680 675 670 665

0.2

0.15

0.1

0.05

0
6095 6090 6085 6080 6075

Figure 1: Density plots for the loglikelihood values sampled at every 10th sweep. Upper panel:
S&P 500 data; middle panel: wind data; lower panel: magnetic data. Solid line: k = 2; dashed
line: k = 3; dotted line: k = 4; dash-dotted line: k = 5.

0.04

0.02

0.02

0.04
0 200 400 600 800 1000 1200 1400 1600 1800

80

60

40

20

0
0.03 0.02 0.01 0 0.01 0.02 0.03

10
0 10 20 30 40 50 60 70 80 90 100

Figure 2: The S&P 500 data (upper panel), Bayesian estimate and kernel estimate of its marginal
density (middle panel, solid and dashed lines), and Bayesian estimate and non-parametric estimate
of its covariance function (lower panel, base 10 logarithmic scale, solid and dashed lines).

10
10

10
0 50 100 150 200 250 300 350 400 450 500

0.8

0.6

0.4

0.2

0
8 6 4 2 0 2 4 6 8

6
0 10 20 30 40 50 60 70 80 90 100

Figure 3: The wind data (upper panel), Bayesian estimate and two kernel density estimates (with
di erent bandwidths) of its marginal density (middle panel, solid, dashed and dotted lines), and
Bayesian estimate and non-parametric estimate of its covariance function (lower panel, base 10
logarithmic scale, solid and dashed lines).

40

20

20
0 500 1000 1500 2000 2500 3000

0.2

0.15

0.1

0.05

0
10 8 6 4 2 0 2 4 6 8 10

6
0 10 20 30 40 50 60 70 80 90 100

Figure 4: The magnetic data (upper panel), Bayesian estimate and kernel estimate of its marginal
density (middle panel, solid and dashed lines), and Bayesian estimate and non-parametric estimate
of its covariance function (lower panel, base 10 logarithmic scale, solid and dashed lines).

11
k c = 5 c = 10 c = 20 c = 30 c = 50 c = 70 c = 100
2 0.4607 0.4586 0.4785 0.4793 0.4701 0.4585 0.4831
3 0.4745 0.4756 0.4580 0.4608 0.4723 0.4811 0.4568
4 0.0597 0.0607 0.0583 0.0545 0.0528 0.0550 0.0549
5 0.0047 0.0044 0.0049 0.0048 0.0043 0.0051 0.0050
6 0.0004 0.0007 0.0003 0.0005 0.0005 0.0004 0.0003
7 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000

Table 2: Posterior distribution of the number of components k for the S&P 500 data,  = c max jyt j
and di erent choices of c.

4.3 Stability with respect to the prior


The parameter that determines the range of the prior for the i 's is obviously a main design
parameter. If is xed, then one would like to set it larger than max jyk j in order not to restrict
the range for the i 's too much. On the other hand, it is immediate from (12) that the acceptance
probability of the split move is inversely proportional to , so that a large pushes the posterior
distribution for k towards k = 1. That a ects the acceptance ratio and posterior distribution so
directly is of course undesirable, and it is the reason for making a hyperparameter. Our choice was
to let have a negative exponential distribution with mean 1= . Table 2 shows, for the S&P 500
data, the posterior distribution of k for di erent choices of 1= in the range 5 max jyt j { 100 max jyt j.
The posteriors more or less agree, showing that this prior structure is much less informative than
that with a xed . The results in this paper were obtained using 1= = 30 max jyt j.
4.4 Performance of the MCMC algorithm
The acceptance rates for the split/combine move were 4.4%, 1.4% and 0.26%, respectively, for the
S&P 500 data, the wind data and the magnetic data. These rates are a bit lower than desired,
but since the split and combine moves involve a change of model only and all other parameters are
updated in each sweep, we feel that they are not too low, except for the magnetic data. The results
for this series should thus be handled with caution. It should be pointed out that the number of
invalid split proposals, that is, proposals with Ae being non-stochastic, was very small and hence
proposals were rejected mainly because of acceptance probabilities R.
One should also keep in mind that the degree of precision in the posterior distribution of k
bounds the achievable acceptance rate. Suppose, for simplicity, that there are only two possible
values 1 and 2 for k, and that the posterior probability for k = 1 is 0.9. Then, in the long run, only
one sweep out of ten may have k = 2, a change of dimension may occur only in two sweeps out of
ten, and the acceptance rate is thus bounded by 0.2. For the wind and magnetic data the posteriors
for k are quite concentrated on k = 3, but of course we are far from attaining the corresponding
bound for the acceptance rates.
The acceptance rates are a ected by the parameters of the Beta distributions in the split move.
We have run the above version and variants of the present algorithm with a large number of choices
for the parameters, among which the ones given in Section 3 were the best. The birth/death move
was completely redundant in the sense that its acceptance rates were practically zero (not above
3  10?5 ). Hence, these moves can just as well be removed from the algorithm. Figures 5{7 show
plots of the rst 40,000 values of k, that is the beginning of the burn-in, for the three sets of data
(plotted every 20th sweep), the settling period of the empirical estimate of the posterior distribution
of k, and the rst 20,000 values of i 's in sweeps with k equalling the value for which the posterior
distribution attains its maximum.

12
5

1
0 10000 20000 30000 40000

0.5

0
0 50000 100000 150000 200000

0.015

0.01

0.005

0
0 5000 10000 15000 20000

Figure 5: Upper panel: First 40,000 values of k for S&P 500 data, plotted every 20th sweep. Middle
panel: estimated posterior distribution of k as a function of number of sweeps. Lower panel: 1
and 2 in rst 20,000 sweeps with k = 2.

1
0 10000 20000 30000 40000

0.5

0
0 50000 100000 150000 200000

0
0 5000 10000 15000 20000

Figure 6: Upper panel: First 40,000 values of k for wind data, plotted every 20th sweep. Middle
panel: estimated posterior distribution of k as a function of number of sweeps. Lower panel: 1 ,
2 and 3 in rst 20,000 sweeps with k = 3.

13
4

1
0 10000 20000 30000 40000

0.5

0
0 50000 100000 150000 200000

20

15

10

0
0 5000 10000 15000 20000

Figure 7: Upper panel: First 40,000 values of k for magnetic data, plotted every 20th sweep. Middle
panel: estimated posterior distribution of k as a function of number of sweeps. Lower panel: 1 ,
2 and 3 in rst 20,000 sweeps with k = 3.
0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30

Figure 8: Posterior distribution of k under i.i.d. assumptions for S&P 500 data (solid line), wind
data (dashed line) and magnetic data (dotted line).

4.5 Comparison with an i.i.d. analysis


We found it interesting to compare our results to what is obtained under i.i.d. assumptions, that is
when the zt 's are independent.
P Obviously, the yt 's are then also independent, with each yt having
k
the mixture distribution 1 i N (0; i2 ) where i = P (zt = i). We implemented a reversible jump
MCMC algorithm similar to the one presented in Richardson and Green (1997) for this case. The
prior for k was as above, that is uniform on f1; 2; : : : ; kmaxg , the prior for the weights 1 ; : : : ; k
conditional on k was D(1; : : : ; 1), and the prior structure for the i 's was as above. In the split move,
the component weight j was split as ej1 = u0 j p and ej2 = (1 ? u0 )j with u0  Bep(2; 2), and the
standard deviation j was split as ej1 = j (1 ? w (1 ? u0 )=u0 ) and ej2 = j (1+ w u0 =(1 ? u0 ))
with w 2 Be (1; k).
We ran the algorithm as above, that is 1,000,000 sweeps after a 100,000 sweep burn-in. The re-
sults are quite di erent from what is obtained modelling with HMMs. Figure 8 shows the posterior
distributions for k, which are much more spread out than in Table 1. For the S&P 500 and wind
data, there is even mass at k = 30, which was our kmax . For the magnetic dataset the distribution
14
is less spread out than for the other two datasets and actually has its mode at k = 3, just as
in the HMM case. Comparing the posterior means for the i 's given some k, we nd similarities
however. For S&P 500 data and k = 2, the gures are (0:00465; 0:00934) and (0:00464; 0:00995)
for HMM and i.i.d. analysis, respectively, for wind data and k = 3 they are (0:375; 1:08; 2:87) and
(0:0867; 0:657; 2:59), and for magnetic data and k = 3 they are (1:90; 3:28; 8:84) and (1:91; 3:65; 10:1).
For the wind data, there are only 2,488 samples with k = 3 in the i.i.d. analysis, which may explain
the larger discrepancy.
The acceptance rates for the birth/death move are much larger for the i.i.d. algorithm, despite
being less sophisticated in the sense that the valid range of w is not computed. The rates were 31%
for S&P 500 data, 33% for wind data and 22% for magnetic data.

5 Conclusions and scope for the future


We have seen that it is possible to construct reversible jump MCMC algorithms for Bayesian
analysis of hidden Markov models with an unknown number of components. We believe that such
algorithms may be a strong competitor to a frequentist approach based on maximum-likelihood,
not least from a computational point of view, because they combine parameter estimation and
model selection in a poweful way. Obviously, it is desirable to nd dimension changing moves that
achieve larger acceptance rates than those presented here. It may be fruitful to consider moves that
preserve the marginal distribution to a lesser extent than do the above moves, but that instead
preserve some of the dependence over time, such as the rst order autocorrelation. We have also
experimented with a split move that rst splits the standard deviation j , then reallocates the
appropriate zt 's as if they were independent, and nally draws rows j1 and j2 of Ae as in (a) of
Section 3. The acceptance rates obtained using this split/combine pair were lower than those
reported above, however.
We nd the di erences between the results using hidden Markov models and the results based
on i.i.d. assumptions quite striking. Based on our experiences with the latter analysis we make two
tentative conclusions. First, adding structure (here Markov dependence) to a model pushes the
posterior distribution of k towards smaller values. Secondly, adding structure to a model makes
it more dicult to design split/combine moves, or more generally dimension changing moves, that
yield high acceptance rates.

Acknowledgements
The research of CPR relates to the TMR Network on Spatial and Computational Statistics, and
was partially supported by CREST, INSEE. The research of TR was supported by the Swedish
Natural Science Research Council (contract no. M-AA/MA 10538-303) and by the Royal Swedish
Academy of Sciences through a grant from the Gustav Sigurd Magnuson foundation. This work
was initiated when CPR and TR visited the Department of Statistics, University of Glasgow, and
they wish to thank the department for its friendly environment and nancial support.
The density estimates in Figures 1{4 were computed in Matlab using the kernel density esti-
mation toolbox of C.C. Beardah, Nottingham Trent University, U.K.

15
References
Baum, L.E. and Petrie, T. (1966). Statistical inference for probabilistic functions of nite state
Markov chains. Ann. Math. Statist. 37, 1554{1563.
Baum, L.E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in
the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41,
164{171.
Bickel, P.J., Ritov, Y. and Ryden, T. (1998). Asymptotic normality of the maximum-likelihood
estimator for general hidden Markov models. Ann. Statist. (to appear).
Billio, M., Monfort, A. and Robert, C.P. (1998). Bayesian estimation of switching ARMA models.
J. Econometrics (to appear).
Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models.
J. Econometrics 75, 79{97.
Churchill, G.A. (1995). Accurate restoration of DNA sequences (with discussion). In Case Studies
in Bayesian Statistics, C. Gatsonis, J.S. Hodges, R.E. Kass and N.D. Singpurwalla (Eds.), Vol.
II, 90{148. Springer-Verlag, New York.
Damien, P. and Walker, S. (1996). Sampling probability densities via uniform random variables
and a Gibbs sampler. Preprint.
Fredkin, D.R. and Rice, J.A. (1992). Maximum likelihood estimation and identi cation directly
from single-channel recordings. Proc. Royal Soc. Lond. B 249, 125{132.
Francq, C. and Roussignol, M. (1997). On white noise driven by hidden Markov chains. J. Time
Series Anal. 18, 553{578.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice.
Chapman and Hall, London.
Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82, 711{732.
Hamilton, J.D. (1989). A new approach to the economic analysis of nonstationary time series and
the business cycle. Econometrica 57, 357{384.
Krolzig, H.-M. (1997). Markov-Switching Vector Autoregressions. Lecture Notes in Economics and
Mathematical Systems 454. Springer-Verlag, New York.
Leroux, B.G. (1992). Maximum-likelihood estimation for hidden Markov models. Stoch. Proc.
Appl. 40, 127{143.
Leroux, B.G. and Puterman, M.L. (1992). Maximum-penalized-likelihood estimation for indepen-
dent and Markov-dependent mixture models. Biometrics 48, 545{558.
MacDonald, I.L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discrete-valued

16
Time Series. Chapman and Hall, London.
McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic for the number of
components in a normal mixture. Appl. Statist. 36, 318{324.
Petrie, T. (1969). Probabilistic functions of nite state Markov chains. Ann. Math. Statist. 40,
97{115.
Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proc. IEEE 77, 257{284.
Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number
of components (with discussion). J. Roy. Statist. Soc. B 59, 731{792.
Richardson, S. and Green, P.J. (1998). Corrigendum: On Bayesian analysis of mixtures with an
unknown number of components. J. Roy. Statist. Soc. B 60, 661.
Robert, C.P., Celeux, G. and Diebolt, J. (1993). Bayesian estimation of hidden Markov chains: a
stochastic implementation. Statist. Prob. Letters 16, 77{83.
Robert, C.P. and Titterington, D.M. (1998). Resampling schemes for hidden Markov models and
their application for maximum likelihood estimation. Statistics and Computing (to appear).
Ryden, T., Terasvirta, T. and 
Asbrink, S. (1998). Stylized facts of daily return series and the
hidden Markov model. J. Appl. Econometrics (to appear).
Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Ann.
Statist. 22, 1701{1762.

Appendix A
In this appendix we describe the restricted backward algorithm used to update the hidden Markov
chain z in the split move. We are given two time indices t1  t2 such that zt = j for t1  t  t2
but zt1 ?1 ; zt2 +1 6= j . We are also given all yt 's, the proposed Ae and ei 's, and we want to sample
zet1 ; : : : ; zet2 from the conditional distribution given zet1 ?1 = zt1 ?1 , zet2 +1 = zt2 +1 , given the above
variables and given zet 2 fj1 ; j2 g for all t1  t  t2 . It is easy to see that zet1 ; : : : ; zet2 is an
inhomogeneous Markov chain given the yt 's and the parameters, and it still is when restricted to
fj1 ; j2 g.
De ne the two-dimensional vectors bt = (bt (j1 ); bt (j2 )), t1  t  t2 , by
e e );
bt (i) = P (yt+1 ; : : : ; yt2 ; zet+1 2 C; : : : ; zet2 2 C; zet2 +1 j zet = i; A;
where C = fj1 ; j2 g. These vectors may be computed recursively as
bt2 (i) = aeiez 2 +1
t
(16)
and, for t = t2 ? 1; t2 ? 2; : : : ; t1 ,
X
bt (i) = bt+1 (j )aeij '(yt+1 ; ej ):
j =j1;j2

17
Now for each t = t1 ; : : : ; t2 , in ascending order, we rst compute the two-dimensional vector ct =
(ct (j1 ); ct (j2 )) as
ct (j ) = aeez ?1 j '(yt ; ej )bt (j );
t
(17)
then renormalise this vector to make it sum to one, and nally draw zet from the probability
distribution so obtained. The normalised c-vector is the conditional distribution P (zet = j j
zet?1 ; y1 ; : : : ; yn ; zt+1 2 C; : : : ; zt2 2 C; A; e e ), so that this procedure provides us with a sample
zet1 ; : : : ; zet2 from the desired conditional distribution. If t2 = n, then replace the right-hand side of
(16) by unity, and, if t1 = 1, then replace aeez 1 ?1 j on the right-hand side of (17) by the stationary
probability ej .
t

Appendix B
In this appendix we evaluate the Jacobian determinant of the split move. We look at the transforma-
tion given by (8) and (10) from (aij ; aij ; vi ; aj j ; uj ; u0 ; u1 ; j ; w), where i; j 6= j  , to (aeij ; aeij1 ; aeij2 ;
aej1 j ; aej2j ; aej1 j2 ; aej2 j1 ; ej1 ; ej2 ), where i; j 6= j1 ; j2 . Also, we do not include any diagonal elements
of A or Ae in the tuples above; we can omit these because rows sums equal unity. We obtain the
following table of partial derivatives, which de nes the Jacobian matrix:
aeij aeij1 aeij2 aej1 j aej2 j aej1 j2 aej2j1 ej1 ej2
aij I 0 0
0 0 0  0 0
0 diag(vi) diag(1 ? vi ) 0 0 0  0 0
aij
0 diag(aij ) ?diag(aij ) 0 0 0  0 0
vi
0 ajj0 0     0 0:
0 uj 0 0     0 0
0 u0 0 0      
0 u1 0 0 0 0   0 0
0 j 0 0 0 0 0 0  
0 w 0 0 0 0 0 0  
Here I denotes an identity matrix, 0 denotes a suitably sized vector or matrix of zeros, and 
denotes non-zero entries. Since the Jacobian matrix has a block diagonal structure, it follows
that we can disregard the rows and columns corresponding to aij and aeij . What then remains
of J is upper block diagonal, whence it follows that we can consider the two subdeterminants
corresponding to (aij ; vi ) 7! (aeij1 ; aeij2 ) and (j ; w) 7! (ej1 ; ej2 ), respectively, separately.
Since adding a column of a matrix to another column does not change the determinant, the
rst subdeterminant is given by

diag(vi ) diag(1 ? vi ) = I diag(1 ? vi ) = Y aij :
diag(aij ) ?diag(aij ) 0 ?diag(aij ) i6=j
The second subdeterminant is given by
s !1=2 s !1=2
1 ? u u
1?w u 0 1 + w 1 ?0u
s
0
!?1=2 s s
0
!?1=2 s
1 u0
? 2 j 1 ? w 1 ?u0u0 1 ? u0 1  1 + w u0
u0 2 j  1 ? u0 1 ? u0
j =2
= h pu (1 ? u )i h1 ? u + wpu (1 ? u )i1=2 :
u0 ? w 0 0 0 0 0

18
What now remains of J is
aej1 j ae aej1j2 aej2j1
 uj   1j2?j uj 
ajj diag u diag 1 ? u ?u1 uuj 
0 0 0
 aj j   aj j  1 ? u1 a
uj diag u ?diag 1 ?u
 ? uu1 ajj 1 ? u0 jj ;
0 0 0
u0 ? uu2j ajj (11??uuj)2 ajj uu1 (1 ? aej1 j1 ? aej1j2 ) u1 + aej2 j1
0 0 0 1 ? u0
u1 0 0 aej1 j1 + aej1j2 u0 e e
1 ? u0 (aj1 j1 + aj1 j2 )
where most of the non-zero entries are now explicitly given. The determinant of this matrix is
evaluated numerically. In doing this, we must compute the partial derivatives @ aej2 j1 =@aj j for
j 6= j . Looking at (8), we see that the main problem here is to calculate @i =@aj j for i 6= j
(recall that i = i =j ). We now show how to do this.
Recall that the diagonal entries of A are not considered as independent variables. The equilib-
rium equations are
0 1
X X X
` = j aj` + i ai` + `a`` = j aj` + iai` + ` @1 ? a`iA ;
i6=j ;` i6=j;` i6=`
whence X X
aj` + iai` ? ` a`i = 0
i6=j;` i6=`
for each ` 6= j . Di erentiating with respect to ajj and writing 0i = @i =@aj j , we obtain
X X X
`j + 0i ai` ? 0` a`i = `j + 0i ai` ? 0` = 0 (18)
i6=j ;` i6=` i6=j
for each ` 6= j , where `j is Kronecker's . The coecient matrix of this linear system of equations
is the restriction to states 6= j of the matrix A ? I . Our prior for A implies that all its entries
are non-negative, whence A is irreducible and aperiodic. Hence unity is a single eigenvalue of A,
so that A ? I has rank k ? 1. Therefore, its restriction to k ? 1 states has full rank and (18) has a
unique solution 0` , ` 6= j .

19

Vous aimerez peut-être aussi