Vous êtes sur la page 1sur 19

Applied Probability Trust (1 April 2013)

MARKOV CHAIN ORDER ESTIMATION


AND THE CHI-SQUARE DIVERGENCE
A.R. BAIGORRI,

C.R. GONALVES

and
P.A.A. RESENDE,

University of Braslia
Email addresses: baig@mat.unb.br, catia@mat.unb.br and pa@mat.unb.br
Abstract
We dene a few objects which capture relevant information from the sample of
a Markov chain and we use the chi-square divergence as a measure of diversity
between probability densities to dene the estimators LDL and GDL for a
Markov chain sample. After exploring their properties, we propose a new
estimator for the Markov chain order. Finally we show numerical simulation
results, comparing the performance of the proposed alternative with the well
known and already established AIC, BIC and EDC estimators.
Keywords: GDL; LDL; AIC; BIC; EDC; Markov order estimation; chi-square
divergence.
2010 Mathematics Subject Classication: Primary 62M05
Secondary 60J10;62F12
1. Introduction
A sequence {X
i
}

i=0
of random variables taking values in E = {1, , m} is a
Markov chain of order if for all (a
0
, , a
n+1
) E
n+2
P(X
n+1
= a
n+1
|X
0
= a
0
, , X
n
= a
n
)
= P(X
n+1
= a
n+1
|X
n+1
= a
n+1
, , X
n
= a
n
) (1)
and is the smallest integer with this property. For simplicity, we assume {X
i
}

i=0
time homogeneous, dene a
l+k
1
= a
l
1
a
l+k
l+1
= (a
1
, , a
k+l
) E
k+l
and

Postal address: Department of Mathematics, University of Braslia, 70910-900, Braslia-DF, Brazil


1
2 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
p(a
+1
|a

1
) = P(X
+1
= a
+1
|X
1
= a
1
, , X

= a

).
Also, we have the i.i.d. case for = 0. The class of processes that holds the
condition (1) for a given 0 will be denoted by M

. In this case, the order of a


process in

i=0
M
i
is the smallest integer such that X = {X

.
Along the last few decades there has been a great number of research on the
estimation of the order of a Markov chain, starting with Bartlett [4], Hoel [13], Good
[12], Anderson and Goodman [3], Billingsley [5] among others dealing with hypothesis
tests, and more recently, using information criteria, Tong [19], Schwarz [17], Katz [14],
Csiszr and Shields [6], Zhao et al. [20] and Dorea [11] had contributed with new Markov
chain order estimators.
Akaike [1] entropic information criterion, known as AIC, has had a fundamental
impact in statistical model evaluation problems. The AIC has been applied by Tong,
for example, to the problem of estimating the order of autoregressive processes, au-
toregressive integrated moving average processes, and Markov chains. The Akaike-
Tong (AIC) estimator was derived as an asymptotic estimate of the Kullback-Leibler
information discrepancy and provides a useful tool for evaluating models estimated
by the maximum likelihood method. Later on, Katz [14] derived the asymptotic
distribution of the estimator and showed its inconsistency, proving that there is a
positive probability of overestimating the true order no matter how large the sample
size. Nevertheless, AIC is the most used Markov chain order estimator at the present
time, mainly because it is more ecient than BIC for small samples.
The main consistent alternative, the BIC estimator, does not perform too well for
relatively small samples, as it was pointed out by Katz [14] and Csiszr and Shields
[6]. It is natural to admit that the expansion of the Markov chain complexity (size
of the state space and order) has signicant inuence on the sample size required for
the identication of the unknown order, even though, most of the time it is dicult
to obtain suciently large samples in practical settings. In this sense, looking for a
better strong consistent alternative, Zhao et al. [20] and Dorea [11] estated the EDC,
adjusting properly the penalty term used at AIC and BIC.
All the already mentioned estimators are based on the penalized log-likelihood
Markov Chain Order Estimation and The Chi-Square Divergence 3
method, due that, have their common roots at the likelihood ratio for hypothesis tests.
In this notes well use a dierent entropic object called
2
divergence, and study its
behaviour when applied to samples from random variables with multinomial empirical
distributions derived from a Markov chain sample. Finally, we shall propose a new
strongly consistent Markov chain order estimator, more ecacious than the already
established AIC, BIC and EDC, which it shall be exhibited through the outcomes of
several numerical simulations.
This paper is organized as follows. Section 2 presents the fdivergences and a
rst order Markov chain derived from X, which is useful to extend the already known
asymptotic results to orders larger than one. Section 3 provides the proposed order
estimator, namely GDL, and proves its strong consistency. Finally, Section 4 provides
numerical simulation, where one can observe a better performance of GDL compared
to AIC, BIC and EDC.
2. Auxiliary Results
2.1. Entropy and fdivergences
Basically, a fdivergence is a function that measures the discrepancy between two
probability distributions P and Q. The divergence is intuitively an average of the
function f of the odds ratio given by P and Q.
These divergences were introduced and studied independently by Csiszr and Shields
[7, 8] and Ali and Silvey [2], among others. Sometimes these divergences are referred
as Ali-Silvey distances.
Denition 2.1. Let P and Q be discrete probability densities with support E =
{1, . . . , m}. For a convex function f(t) dened for t > 0, with f(1) = 0, the fdivergence
for the distributions P and Q is
D
f
(PQ) =

aE
Q(a)f
_
P(a)
Q(a)
_
.
Here we take 0f(
0
0
) = 0 , f(0) = lim
t0
f(t), 0f(
a
0
) = lim
t0
tf(
a
t
) = a lim
u
f(u)
u
.
For example, assuming f(t) = t log(t) or f(t) = (1 t
2
) we have:
4 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
f(t) = t log(t) D
f
(PQ) =

aE
P(a) log
_
P(a)
Q(a)
_
,
f(t) = (1 t
2
) D
f
(PQ) =

aE
(P(a) Q(a))
2
Q(a)
.
which are called relative entropy and
2
divergence, respectively. From now on, the

2
divergence shall be denote by D
2
(PQ).
Observe that the triangular inequality is not satised in general, so that D
2
(PQ)
denes no distance in the strict sense.
The
2
square divergence D
2
(PQ) is a well known statistical test procedure close
related to the chi-square distribution [15].
2.2. Derived Markov Chains
Let X
n
1
= (X
1
, , X
n
) be a sample from a Markov chain X = {X

of unknown
order , as already dened. Assume that, x
+1
1
E
+1
,
p(x
+1
|x

1
) = P(X
n+1
= x
n+1
|X
n
n+1
= x

1
) > 0. (2)
Following Doob [10], from the process X we can derive a rst order Markov chain,
Y
()
= {Y
()

by setting Y
()
n
= (X
n
, , X
n+1
) so that, for v = (i
1
, , i

)
and w = (i

1
, , i

),
P(Y
()
n+1
= w|Y
()
n
= v) =
_

_
p(i

|i
1
....i

), if i

j
= i
j+1
, j = 1, ..., ( 1)
0, otherwise.
Clearly Y
()
is a rst order and time homogeneous Markov chain that from now on
shall be called by the derived process, which by (2) is irreducible and positive recurrent
having unique stationary distribution, say

. It is well known [10], that the derived


Markov chains Y
()
is irreducible and aperiodic, consequently ergodic. Thus, there
exists an equilibrium distribution

(.) satisfying for any initial distribution on E

lim
n
|P

(Y
()
n
= x

1
)

(x

1
)| = 0,
and
Markov Chain Order Estimation and The Chi-Square Divergence 5

(x

1
) =

(z

1
) p(x

|z

1
) =

(xx
1
1
) p(x

|xx
1
1
).
Likewise, we can dene, for l > , Y
()
and verify that

l
(x
l
1
) =

(x

1
) p(x
+1
|x

1
)...p(x
l
|x
l1
l
) =

l
(xx
l1
1
) p(x
l
|xx
l1
l
). (3)
which shows that
l
dened above is a stationary distribution for Y
()
. For the sake
of notations simplicity well use from now on
(a
l
1
) =
l
(a
l
1
), l , (4)
(a
l
1
) =

b
l
1
E
l

(b
l
1
a
l
1
), l < (5)
and
p(j|a
l
1
) =

b
l
1
E
l
(b
l
1
a
l
1
)p(j|b
l
1
a
l
1
)

b
l
1
E
l
(b
l
1
a
l
1
)
, l < . (6)
Now, let us return to X
n
1
= (X
1
, X
2
, ..., X
n
) and dene
N (a

1
) =
n+1

j=1
1(X
j
= a
1
, ..., X
j+1
= a

). (7)
That is, the number of occurrences of a

1
in X
n
1
. If = 0, we take N( . ) = n.
From now on, the sums related with N (a

1
) are taken over positive terms, or else, we
convention 0/0 and 0 as 0.
The main interest of dening the derivate process is the possibility of use the well
established asymptotic results regarded to rst order Markov chains. Lemma 2.1 below
is a version of the Law of Iterated Logarithm, used by Dorea [11] to conclude Lemma
2.2, which will be used for the establishment of subsequent results. The Strong Law of
Large Numbers (SLLN) is needed too and can be found at [9].
6 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Lemma 2.1. (Meyn and Tweedie (1993).) Let X = {X

}
>
be an ergodic Markov
chain with nite state space E and stationary distribution , g : E R, S
n
(g) =
n

j=1
g(X
j
)
and

2
g
= E

(g
2
(X
1
)) + 2
n

j=2
E

(g(X
1
)g(X
j
))).
(a) If
2
g
= 0, then
lim
n
1

n
[S
n
(g) E

(S
n
(g))] = 0 a.s.
(b) If
2
g
> 0, then
limsup
n
S
n
(g) E

(S
n
(g))
_
2
2
g
nlog(log(n))
= 1 a.s.
and
liminf
n
S
n
(g) E

(S
n
(g))
_
2
2
g
nlog(log(n))
= 1 a.s.
Where we consider E

the expectation with initial distribution and a.s. the


abbreviation of almost surely.
Lemma 2.2. (Dorea (2008).) If Y
()
is an ergodic Markov chain with nite state
space E, initial distribution , 1 and ia

1
j E
+2
then
limsup
n
_
N (ia

1
j) N (ia

1
) p(j

i a

1
)
_
2
nlog(log(n))
= 2 (i a

1
j)(1 p(j

i a

1
)).
3. Main Results
Basically our approach consists in dening, for each sequence a

1
E

and i, j E,
two densities P
a

1
(i, j) and Q
a

1
(i, j). Comparing them using the
2
divergence, we
capture relevant information of dependency related to ia

1
j. In the sequel, we take
a sum over all possible i, j E and achieve an object having local information of
dependency order for a

1
. Finally, summing over all a

1
, rescaling properly and making
some adjusts we dene the GDL Markov chain order estimator.
Markov Chain Order Estimation and The Chi-Square Divergence 7
Denition 3.1. Assuming N (a

1
) as dened at (7), consider
P
a

1
(i, j) =
N (ia

1
j)

i,j
N (ia

1
j)
,
Q
a

1
(i, j) =
N (ia

1
) N (a

1
j)

i
N (ia

1
)

j
N (a

1
j)
and

2
(PQ) := n

i,jE
D
2
(P
a

1
(i, j)Q
a

1
(i, j)) = n

i,jE
(P
a

1
(i, j) Q
a

1
(i, j))
2
Q
a

1
(i, j)
. (8)
Using the SLLN and assuming , we conclude
lim
n
P
a

1
(i, j) = lim
n
N (ia

1
j)

i,j
N (ia

1
j)
= lim
n
nN (ia

1
)
n

i,j
N (ia

1
j)
N (ia

1
j)
N (ia

1
)
=
(ia

1
)
(a

1
)
p(j|ia

1
) a.s. (9)
analogously,
lim
n
Q
a

1
(i, j) = lim
n
N (ia

1
) N (a

1
j)

i
N (ia

1
)

j
N (a

1
j)
= lim
n
nN (ia

1
)
n

i
N (ia

1
)
N (a

1
j)

j
N (a

1
j)
=
(ia

1
)
(a

1
)
p(j|a

1
) a.s. (10)
In the same manner, but using the notation dened at (5), we conclude for <
that
lim
n
P
a

1
(i, j) =
(ia

1
)
(a

1
)
p(j|ia

1
) a.s. (11)
8 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
and
lim
n
Q
a

1
(i, j) =
(ia

1
)
(a

1
)
p(j|a

1
) a.s. (12)
At (9) and (10) we used the easy computation equivalence
1

i
N (ia

1
) =

j
N (a

1
j) +O(1) =

i,j
N (ia

1
j) +O(1) = N (a

1
) +O(1). (13)
Theorem 3.1. For

2
(PQ) as dened at (8),
(a) If , there exists a L

R such that
P
_
limsup
n
_

2
(PQ)
2loglog(n)
_
L

_
= 1. (14)
(b) If = 1, there exists a

1
, i, j, k E and k = i such that p(j|ia

1
) = p(j|ka

1
),
for these ones
P
_
limsup
n
_

2
(PQ)
2loglog(n)
_
=
_
= 1.
Proof.
(a)
Replacing P
a

1
(i, j) and Q
a

1
(i, j), using (12) and (13) we have
limsup
n

2
(PQ)
2loglog(n)
=

i,jE
limsup
n
n(P
a

1
(i, j) Q
a

1
(i, j))
2
2loglog(n)Q
a

1
(i, j)
=

i,jE
(a

1
)
2(ia

1
)p(j|a

1
)
limsup
n
n
2

N(ia

1
j)

i,j
N(ia

1
j)

N(ia

1
)N(a

1
j)

i
N(ia

1
)

j
N(a

1
j)

2
nloglog(n)
a.s.
=

i,jE
(a

1
)
2(ia

1
)p(j|a

1
)
limsup
n

nN(ia

1
j)

i,j
N(ia

1
j)

nN(ia

1
)

i
N(ia

1
)
N(a

1
j)

j
N(a

1
j)

2
nloglog(n)
=

i,jE
(a

1
)
2(ia

1
)p(j|a

1
)
limsup
n

nN(ia

1
j)
N(a

1
)+O(1)

nN(ia

1
)
N(a

1
)+O(1)
N(a

1
j)
N(a

1
)+O(1)

2
nloglog(n)
. (15)
1 Here we used the O notation: g(n) = O(f(n)) means that lim
n
g(n)
f(n)
= constant > 0.
Markov Chain Order Estimation and The Chi-Square Divergence 9
By the SLLN
n
N (a

1
) +O(1)

a.s.
1
(a

1
)
and
N (a

1
j)
N (a

1
) +O(1)

a.s.
p(j|a

1
).
Applying at (15), and using Lemma 2.2, we have
limsup
n

2
(PQ)
2loglog(n)
=

i,jE
(a

1
)
2(ia

1
)p(j|a

1
)(a

1
)
limsup
n
(N (ia

1
j) N (ia

1
) p(j|a

1
))
2
nloglog(n)
a.s.
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
(N (ia

1
j) N (ia

1
) p(j|a

1
))
2
nloglog(n)
(16)
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
(N (ia

1
j) N (ia

1
) p(j|ia

1
))
2
nloglog(n)
=

i,jE
1
2(ia

1
)p(j|a

1
)
2(i a

1
j)(1 p(j

i a

1
)) a.s.
< .
At third equation we used that p(j|a

1
) = p(j|ia

1
), that is consequence of .
Now, assuming L

suciently large, we conclude (14).


(b) = 1
Continuing from (16) and considering the notation (6) and (12),
limsup
n

2
(PQ)
2loglog(n)
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
(N (ia

1
j) N (ia

1
) p(j|a

1
))
2
nloglog(n)
a.s.
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
n
2

N(ia

1
j)
n

N(ia

1
)
n
p(j|a

1
)

2
nloglog(n)
a.s.
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
n
((ia

1
)p(j|ia

1
) (ia

1
)p(j|a

1
))
2
loglog(n)
a.s.
=

i,jE
1
2(ia

1
)p(j|a

1
)
limsup
n
n
[(ia

1
) (p(j|ia

1
) p(j|a

1
))]
2
loglog(n)
= a.s. (17)
We used at last equation the hypothesis that , i, j, k E with p(j|ia

1
) =
p(j|ka

1
), so p(j|ia

1
) = p(j|a

1
) cannot be truth for all i, j E.
10 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Herein we dene the Local Dependency Level (LDL) and the Global Dependency
Level (GDL).
Denition 3.2. Let X
n
= {X
i
}
n
i=1
be a sample of a Markov chain X of order 0.
Assume 0, P, Q and

2
(PQ) as previously dened. Also, consider V a
2
random
variable with (m1)
2
degrees of freedom and P : R
+
[0, 1] the continuous strictly
decreasing function dened by
P(x) = P(V x), x R
+
.
(a) The Local Dependency Level

LDL
n
(a

1
) for a

1
is

LDL
n
(a

1
) =

2
(P
O

n
P
E

n
)
2 log(log(n))
.
(b) The Global Dependency Level

GDL
n
() is

GDL
n
() = P
_
_

1
E

_
N (a

1
)
n
_

LDL
n
(a

1
)
_
_
.
The LDL provides a measure of dependency for a specic a

1
, that could be
analysed separately. At GDL we rescale an average of LDLs to t a proper variability.
Observe that, if the true order is , then a

1
, ,
P
_
liminf
n
_

GDL
n
()
_
P(L

)
_
= 1 (18)
and for = 1
P
_
lim
n
_

GDL
n
()
_
= P() = 0
_
= 1. (19)
Consequently, for a Markov chain X of order ,
= 0 lim
n

GDL
n
() P(L

) > 0, = 0, 1, .., B,
otherwise
= max
0 B
_
: lim
n

GDL
n
() = 0
_
+ 1.
Finally, let us dene the Markov chain order estimator based on the information
contained in the vector GDL
n
=
_

GDL
n
(0), . . . ,

GDL
n
(B)
_
.
Markov Chain Order Estimation and The Chi-Square Divergence 11
Denition 3.3. Given a xed number B N, let us dene the set S = {0, 1}
B+1
and
the application T : S N dened by
T(s
0
, . . . , s
B
) =
_

_
1, if s
i
= 1 for i = 0, . . . , B,
max
0iB
{i : s
i
= 0}, otherwise.
Denition 3.4. Let X
n
= {X
i
}
n
i=1
be a sample for the Markov chain X of order ,
0 < B N and {

GDL
n
(i)}
B
i=0
as above. We dene the order estimator
GDL
(X
n
)
as

GDL
(X
n
) = T(
n
) + 1
with
n
S dened by

n
= min
sS
_
B

i=1
_

GDL
n
(i) s(i)
_
2
_
,
where s(i) is the projection of the i coordinate.
By (18) and (19) it is clear that the order estimator converges almost surely to its
value, i.e.,
P
_
lim
n

GDL
(X
n
) =
_
= 1.
4. Numeric Simulation
In what follows we shall compare the non-asymptotic performance, mainly for small
samples, of some of the most used Markov chain order estimators. Consider the
notation N (a

1
), as dened in (7), and denote

L() =

a
+1
1
_
_
N
_
a
+1
1
_
N (a

1
)
_
_
N(a
+1
1
)
.
The estimators of the Markov chain order are dened under the hypothesis:
There exist a known B so that 0 < B
12 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
The most known order estimators are

AIC
= argmin{AIC() ; = 0, 1, ..., B},

BIC
= argmin{BIC() ; = 0, 1, ..., B},

EDC
= argmin{EDC() ; = 0, 1, ..., B},
where
AIC() = 2 log

L() + 2|E|

(|E| 1),
BIC() = 2 log

L() + |E|

(|E| 1) log(n),
EDC() = 2 log

L() + 2|E|
+1
log log(n).
By a simple observation, for large enough n, we verify that
AIC() EDC() BIC().
Clearly, for a given , the order estimator GDL(), as well as AIC(), BIC() and
EDC() contain much of the information concerning the samples relative dependency.
Nevertheless numerical simulations as well as theoretical considerations anticipates a
great deal of variability for small samples.
The following numerical simulation, based on an algorithm due to Raftery [16],
starts on with the generation of a Markov chain transition matrix Q = (q
i
1
i
2
...i

;i
+1
)
with entries
q
i
1
i
2
...i

;i
+1
=

t=1

t
R(i
t
, i
+1
), i
+1
1
E
+1
(20)
where the matrix
R(i, j), 0 i, j m,
m

j=1
R(i, j) = 1, 1 i m
and the positive numbers
{
i
}

i=1
,

i=1

i
= 1
Markov Chain Order Estimation and The Chi-Square Divergence 13
are arbitrarily chosen in advance.
Once the matrix Q = (q
i
1
i
2
...i

;i
+1
) is obtained, two hundreds replications of the
Markov chain sample of size n, space state E and transition matrix Q are generated
to compare GDL() performance against the standard, well known and already estab-
lished order estimators just mentioned above.
Katz [14] obtained the asymptotic distribution of
AIC
and proved its inconsistency
showing the existence of a positive probability to overestimate the order. See also [18].
Besides that Csiszr and Shields [6] and Zhao et al. [20] proved strong consistency
for the estimators
BIC
and
EDC
, respectively.
It is quite intuitive that the random information regarding the order of a Markov
chain is spread over an exponentially growing set of empirical distributions with
|| = m
B+1
, where B is the maximum integer . It seems reasonable to think that a
small viable sample, i.e. samples able to retrieve enough information to estimate the
chain order, should have size n O(m
B+1
). Using this, we have chosen the sample
sizes for each case.
Finally, after applying all estimators to each one of the replicated samples, the nal
results are registered in tables.
Case I: Markov Chain Examples with = 0, |E| = 3.
Firstly, we choose the matrix {R
1
, R
2
, R
3
} to produce samples with sizes 500 n
2000, originated from Markov chains of order = 0 with quite dierent probability
distributions given by:
R
1
=
_

_
0.33 0.335 0.335
0.33 0.335 0.335
0.33 0.335 0.335
_

_
, R
2
=
_

_
0.05 0.475 0.475
0.05 0.475 0.475
0.05 0.475 0.475
_

_
, R
3
=
_

_
0.05 0.05 0.90
0.05 0.05 0.90
0.05 0.05 0.90
_

_
.
14 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
Table 1: Rates of tness for case |E| = 3, = 0, n {500, 1000, 1500} and distribution given
by R
1
.
n=500 n=1000 n=1500
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 75.5% 100% 100% 99% 80% 100% 100% 99.5% 71.5% 100% 100% 99%
1 24.5% 1% 18% 0.5% 22.5% 1%
2 2% 6%
3
4
Table 2: Rates of tness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution
given by R
2
.
n=1000 n=1500 n=2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 63.5% 100% 100% 99% 63% 100% 100% 99% 59% 100% 100% 99%
1 29% 1% 34.5% 1% 37% 1%
2 7.5% 2.5% 4%
3
4
Table 3: Rates of tness for case |E| = 3, = 0, n {1000, 1500, 2000} and distribution
given by R
3
.
n=1000 n=1500 n=2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 43% 100% 100% 98% 47% 100% 99.5% 96% 46% 100% 100% 97%
1 53% 2% 51.5% 0.5% 4% 50.5% 2%
2 4% 1.5% 3.5% 1%
3
4
Notice that for a xed sample size n = {500, 1000, 1500, 2000}, the order estimator

AIC
steadily overestimate the real order = 0 with the excessiveness depending
on the probability distribution of the Markov chain. Dierently, the order estimators
Markov Chain Order Estimation and The Chi-Square Divergence 15

BIC
,
EDC
and
GDL
show consistent performance, mainly obtaining the right order,
free from the inuence of the sample size and the generating matrix. The apparent
eciency performed by
BIC
and
EDC
in the last case ( = 0) is consequence of the
great tendency of these estimators to underestimate the order.
Case II: Markov Chain Examples with = 3, |E| = 3 and {2, 3, 0}, |E| = 4
Secondly, we choose the matrix {R
4
, R
5
} to produce samples with sizes n {500, 1000,
1500, 2000}, originated from Markov chains for |E| = 3 of order = 3.
R
4
=
_

_
0.05 0.05 0.90
0.05 0.90 0.05
0.90 0.05 0.05
_

_
, R
5
=
_

_
0.475 0.475 0.05
0.475 0.05 0.475
0.05 0.475 0.475
_

_
.
Table 4: Rates of tness for case |E| = 3, = 3, n {1000, 1500, 2000} and distribution
given by R
4
and
i
= 1/3, i = 1, 2, 3.
n=1000 n=1500 n=2000
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0
1
2 99.5% 88.5% 41% 76.5% 16.5% 5% 17% 0.5% 1%
3 100% 0.5% 11.5% 59% 100% 23.5% 83.5% 95% 100% 83% 99.5% 99%
4
Table 5: Rates of tness for case |E| = 3, = 3, n {1000, 1500, 2500} and distribution
given by R
5
and
i
= 1/3, i = 1, 2, 3.
n=1000 n=1500 n=2500
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 0.5%
1 92.5% 69.5% 6.5% 54.5% 19.5% 1%
2 16.5% 7% 30.5% 92% 2% 45.5% 80.5% 80.5% 100% 98.5% 8.5%
3 83.5% 1.5% 98% 18.5% 100% 1.5% 91.5%
4
16 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
For |E| = 3, = 3 the estimator
AIC
overestimate the order in a lesser extent
than the previous case, while
BIC
and
EDC
overweighted by the respective penalty
terms, underestimate the order more than it was supposed to be. Concerning
GDL
,
it rapidly converges to the right order depending on the sample size n.
For |E| = 4, the greater complexity of a Markov chain of order = 3 imposes the use
of a larger sample size to accomplish some reliability. Finally, we choose the matrix
{R
6
, R
7
} to produce samples with size n = 5000, originated from Markov chains of
order {2, 3, 0}, like in the previous cases.
R
6
=
_

_
0.05 0.05 0.05 0.85
0.05 0.05 0.85 0.05
0.05 0.85 0.05 0.05
0.85 0.05 0.05 0.05
_

_
, R
7
=
_

_
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
0.05 0.05 0.05 0.85
_

_
.
Table 6: Rates of tness for case |E| = 4, {2, 3, 0}, n = 5000 and distributions given by
R
6
, R
7
and
i
= 1/, i = 1, 2, 3 if > 0.
R
6
,
i
= 1/2 and = 2 R
6
,
i
= 1/3 and = 3 R
7
and = 0
k AIC BIC EDC GDL AIC BIC EDC GDL AIC BIC EDC GDL
0 85% 100% 100% 100%
1 15%
2 100% 100% 100% 100% 99% 4%
3 100% 1% 100% 96%
4
5
6
For the order for |E| = 4, = 0, apparently
AIC
keeps overestimating the order
in some degree, while
BIC
, as in example = 3, severely underestimate the order,
presumably due to the excessive weight penalty term. On the contrary,
EDC
and

GDL
behaves quite well in same setting.
Markov Chain Order Estimation and The Chi-Square Divergence 17
Conclusion
The pioneer research started with the contributions of Bartlett [4], Hoel [13], Good
[12], Anderson and Goodman [3], Billingsley [5] among others, where they developed
tests of hypothesis for the estimation of the order of a given Markov chain.
Later on these procedures were adapted and improved with the use of Penalty
Functions [19, 14] together with other tools created in the realm of Models Selection
[1, 17]. Since then, there have been a considerable number of subsequent contributions
on this subject, several of them consisting in the enhancement of the already existing
techniques [6, 20].
In this notes we propose a new Markov chain order estimator based on a dierent idea
which makes it behave in a quite dierent form. This estimator is strongly consistent
and more ecient than AIC (inconsistent), outperforming the well established and
consistent BIC and EDC, mainly on relatively small samples.
References
[1] Akaike, H. (1974). A new look at the statistical model identication. Automatic
Control, IEEE Transactions on 19, 716723.
[2] Ali, S. and Silvey, S. (1966). A general class of coecients of divergence of
one distribution from another. Journal of the Royal Statistical Society. Series B
(Methodological) 28, 131142.
[3] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about
markov chains. The Annals of Mathematical Statistics 28, 89110.
[4] Bartlett, M. S. (1951). The frequency goodness of t test for probability chains.
Proceedings of the Cambridge Philosophical Society.
[5] Billingsley, P. (1961). Statistical methods in markov chains. The Annals of
Mathematical Statistics 32, 1240.
[6] Csiszar, I. and Shields, P. C. (2000). The consistency of the $bic$ markov
order estimator. The Annals of Statistics 28, 16011619.
18 A.R. Baigorri, C.R. Gonalves, P.A.A. Resende
[7] Csiszr, I. (1967). Information-type measures of dierence of probability
distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299318.
[8] Csiszr, I. and Shields, P. C. (2004). Information Theory And Statistics: A
Tutorial. Now Publishers Inc.
[9] Dacunha-Castelle, D., Duflo, M. and McHale, D. (1986). Probability and
Statistics vol. II. Springer.
[10] Doob, J. L. (1966). Stochastic Processes (Wiley Publications in Statistics). John
Wiley & Sons Inc.
[11] Dorea, C. C. Y. (2008). Optimal penalty term for EDC markov chain order
estimator. Annales de lInstitut de Statistique de lUniversite de Paris (lISUP)
52, 1526.
[12] Good, I. J. (1955). The likelihood ratio test for marko chains. Biometrika 42,
531533.
[13] Hoel, P. G. (1954). A test for marko chains. Biometrika 41, 430433.
[14] Katz, R. W. (1981). On some criteria for estimating the order of a markov chain.
Technometrics 23, 243249.
[15] Pardo, L. (2005). Statistical Inference Based on Divergence Measures. Chapman
and Hall/CRC.
[16] Raftery, A. E. (1985). A model for high-order markov chains. J. R. Statist.
Soc. B..
[17] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of
Statistics 6, 461464.
[18] Shibata, R. (1976). Selection of the order of an autoregressive model by akaikes
information criterion. Biometrika 63, 117126.
[19] Tong, H. (1975). Determination of the order of a markov chain by akaikes
information criterion. Journal of Applied Probability 12, 488497.
Markov Chain Order Estimation and The Chi-Square Divergence 19
[20] Zhao, L., Dorea, C. and Gonalves, C. (2001). On determination of the order
of a markov chain. Statistical Inference for Stochastic Processes 4, 273282.

Vous aimerez peut-être aussi