Académique Documents
Professionnel Documents
Culture Documents
Preliminaries
3304
A. Sayyareh
DI (, ) =
() log
()
.
()
The other usual distance between two densities is the 2 distance dened as
D2 (, ) =
A()A()
(h g)2
d
g
This metric assumes values in [0, ]. For a countable space this reduce to
D2 (, ) =
(() ())2
.
()
A()A()
This measure dened by Pearson, for some history, see, [9]. The 2 distance
has not some property of a metric but like the KL divergence, the 2 distance
between product measures can be bounded in terms of the distances between
their marginals, see, [10]. Developping of some inequalities for Kullback-Leibler
divergence and 2 distance are introduced in [11] and [12]. Section 2 presents
the statistical model and model selection concepts. Section 3 presents a relation between the KL divergence and the 2 distance. Section 4 presents the
main problem and introduced a new bound for the KL measure and illustrate
that by two examples. Section 5 presents a simulation study in the framework of the logistic regression and some classical models to model selection.
In section 6 we present an application of the new bound to construct two sided
bound for Shannons entropy
2
2.1
3305
Kullback-Leibler Risk
3306
A. Sayyareh
2.2
Model Selection
Model selection is the task of choosing a model with the correct inductive bias,
which in practice means selecting family of densities in an attempt to create a
model of optimal complexity for the given data. Suppose a collection of data.
Let M denote a class of these rival models. Each model G M is considered
as a set of probability distribution functions for the data. In this framework
we do not impose that one of the candidate models G in M is a correct model.
A fundamental assumption in classical hypothesis testing is that h belongs to
a parametric family of densities i.e. h G. To illustrate model selection, let
Y be a random variable from unknown density h(.). A model is assumed as
possible explanation of Y, represented by (g) = {g(y; ), B} = (g (.))B .
This function is known but its parameter as B is unknown. The aim is
to ascertain whethear (g) can be viewed as a family contained h(.) or has a
member which is a good approximate for h(.). The log likelihood loss of g
)
relatively to h(.) for observation Y is log gh(Y
(Y ) . The expectation of this loss
under h(.), or risk, is the KL divergence between g and h(.) as
KL(h, g ) = Eh
h(Y )
log
.
g (Y )
3307
Relation Between KL
Distance
Divergence and 2
The following theorem gives an upper bound for KL divergence based on the
2 distance, see, [11].
Theorem 3.1 The KL divergence, DI (, ) , and the 2 distance, D2 (, ),
satisfy
DI (, ) log[D2 (, ) + 1].
In particular DI (, ) D2 (, ) , with equality if and only if h = g.
Proof: Since log is a concave function, we have
IE
h
log
g
log IE
then
DI (, ) log
But
h
g
,
h
h d ,
g
(h g)2
h2
d =
d 1,
g
g
3308
A. Sayyareh
thus
DI (, ) log[D2 (, ) + 1] D2 (, ) .
(1)
Main Problem
One problem with KL criterion is that its value has no intrinsic meaning.
Investigators commonly disply big numbers, only the last digits of which are
used to decide which is the smallest. If the specic structure of the models
is of interest, because it tells us something about the explanation of the observed phenomena, it may be interesting to measure how far from the truth
each model is. But this may not be possible. To do this we can quantify the
dierence of risks between two models, see, [15] and [17]. The Kullback-Leibler
risk takes values between 0 and + where we tends to some value between
0 and 1 for risks. We may relate the KL values to relative errors. We will
make errors by evaluating the probability of an event using a distribution g,
Pg (), rather than using the true distribution h, Ph (). We may evaluate the
relative error RE(Pg (), Ph ()) = (Ph ())1 (Pg () Ph ()). To obtain a simple formula relating KL to the error on Ph (), we consider the case Ph () =
0.5
and g/h constant on and c . in that case we nd RE(Pg (), Ph ()) =
2KL (h, g), the approximation being valid for a small KL value. For
KL values of 104 , 103 , 102, 101 we nd that RE(Pg (), Ph ()) is equal
to 0.014, 0.045, 0.14 and 0.44 errors that we may qualify as negligible, small,
moderate and large, respectively. We propose Constructing a set of reasonable
rival models is of interest in theory and applications [18] and [19]. As a another insight, there is an established relationship between the total variation,
Helinger, Kullback-Leibler and 2 metrics and it is dened as follows:
1/2
dP dP
KL (P, P ).
0 sup | dP dP |
A
P2
P d + 1 .
P
4.1
3309
h
h
h d D2 (h, g)
A()
with equality for h(.) = g(.). This bound is better than the bound introduce in
(1)
Proof: We recall the known inequality log x x 1 with x =
h
h
log
g
hh
.
g
Note that
h
h
g
1.
KL (h, g)
A()
h
g
h
d =
A()
A()
h
g
d 1.
We need to show that this bound is better than the bound which is given in
(1). In fact we need to show that
h
h
g
h2
h.
g
y > 0,
A()
h d
h
A()
h
g
d 1 D2 (h, g)
g
A()
h d log
h
A()
h
g
implies that
KL
A()
h
h
h
g
3310
A. Sayyareh
h(x; , ) =
x
exp
,
h
h
g
h d. For
example the value of the KL (h, g) for (t , c ) = (0.9, 0.3), (0.9, 0.6), (0.9, 0.8)
is equal to the 0.6317, 0.0928, 0.0083 respectively, which are of order 101 , 102
and 103. The similar value for proposed bound has the same order as the KL
loss. But, the order of the 2 bound is greater than the KL loss. So, under
conditions which mentioned in Theorem 4.1 the new bound is better than the
2 bound. It indicates that the new bound is more precise than the known 2
bound to model choice. Only in three situations, when the rival model has a
parameters which are far from the parameters of the true model, the first 2
bound is better than the proposed bound.
3311
= 0.9, c
= 0.9, c
= 0.9, c
= 0.9, c
= 0.9, c
= 0.9, c
= 0.9, c
= 0.9, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.5, c
= 0.1
= 0.3
= 0.6
= 0.8
= 1.1
= 1.2
= 1.3
= 1.4
= 0.3
= 0.5
= 0.7
= 0.9
= 0.1
= 0.2
= 0.3
= 0.4
KL (h, g)
1.3092
0.4319
0.0721
0.0067
0.0213
0.0457
0.0767
0.1137
0.1236
0.0000
0.0635
0.2122
0.8309
0.3163
0.1105
0.0231
h
h
g
1.5498
0.6317
0.0928
0.0083
0.0260
0.0548
0.0918
0.1359
0.1289
0.0000
0.0707
0.2358
1.0558
0.3817
0.1289
0.0263
h d
log{ ( hg h)d + 1}
h2
(
1.5612
0.5878 not satisf ied
0.1187
0.0124
0.0506
0.1178
0.2344
0.3691
0.1744
0.0000
0.1744
1.0217
1.0217not satisf ied
0.4463
0.1744
0.7138
3.7647
0.8000
0.1250
0.0125
0.0519
0.1250
0.2642
0.4464
0.1905
0.0000
0.1905
1.7778
1.7778
0.5625
0.1905
1.0417
h)d
3312
A. Sayyareh
0.5
1.0
1.5
Chi2 bound
New bound
KL
0.0
criterion
1.0
0.5
0.0
criterion
1.5
Chi2 bound
New bound
KL
1.0
2.0
3.0
4.0
1.0
2.0
Index
0.5
1.0
1.5
Chi2 bound
New bound
KL
0.0
0.5
criterion
1.5
Chi2 bound
New bound
KL
1.0
4.0
Index
0.0
criterion
3.0
1.0
2.0
3.0
4.0
Index
1.0
2.0
3.0
4.0
Index
Figure 1: Comparison between the KL , the new bound and the 2 bound
5
5.1
Simulation study
Logistic Model
We consider a simulation study where we have to select between dierent logistic regression models. We consider i.i.d. sample of size n of triples Yi , x1i , x2i ,
i = 1, 2, ..., n, from the following distribution. The conditional distribution
of Yi given (x1i , x2i ) was logistic with
ogit[fY |X (1|x1i , x2i )] = 1.5 + x1i + 3x2i
where fY |X (1|x1i , x2i ) = Ph ((Y = 1|x1i , x2i )), the marginal of (x1i , x2i ) were
bivariate normal with zero expectation and variance equal to the identity matrix. We consider h as true model and consider two nonnested families of
parametric densities G M and F M as G = (g (.))B , F = (f (.)) as
rival models, where
G = {g(., ) : R R+ ; B Rd } = (g (.))B .
3313
2
k x1ik + 3 x2i ,
k=1
where x1ik were dummy variables indicating in which categories x1i fell, the
categories were dened using terciles of the observed distribution of x1 , and this
was represented by two dummy variables x1i1 indicating whether x1i fell in the
rst tercile or not, x1i2 indicating whether x1i fell in the second tercile or not.
Using reduced model, we can use the KL divergence for conditional models.
2
The precise estimate of KL (h, f 0 ), A() [( fh0 )h 1] d, log( ( fh0 h) d + 1)
h2
( f 0
KL (h, f 0 )
(
A()
h h
) d 1 log(
f 0
h2
h) d + 1)
f 0
h2
h) d.
f 0
Classical Models
3314
A. Sayyareh
and (
, ).
Table 2- Results of simulation for model selection based on the proposed bound for gamma distribution as the rival model.
Parameters
(1.9, 1.2) (2.3, 1.4) (2.5,1.6) (3,3)
(3,4)
KL (Wt , Wr )
0.3941
0.5090
0.4771
0.1453 0.0000
KL (Wt , Gr )
1.4343
1.6873
1.8751
2.3355 2.1982
hWt h
{( gG ) Wt hWt }d
3.9264
3.9352
3.9296
3.9554 4.0552
r
h2Wt
log{ ( gG hWt )d + 1}
r
6.7181
4.5223
10.2651
9.4141
6.5211
12
In Table 2, we denote the true density and the rival model by hWt and gGr
respectively. The variability of 2 criteria is greater than the variability of new
bound and always the upper based on the proposed bound is smaller than that
for the known 2 bound.
6
0
criterion
10
KL
New bound
Chi2 bound
Index
Figure 2: Results of simulation based on the proposed bound for Weibull and
lognormal distribution in non-nested case
Application
Example 6.1 :
Let h(x), g(x), for x be two probability mass functions (instead of
1
and Card() is finite. Taking into account
and , respectively), g(x) = Card()
3315
that
KL (h, g) =
1
,
Card()
which reduce to
KL (h, g) = log Card()
h(x) log
where
x
1
,
h(x)
1
h(x) log h(x)
is known as the Shannons entropy. In addition
D2 (h, g) = Card()
h2 (x) 1.
h(x)
h(x)
g(x)
1 =
[Card()h(x)]h(x)
1.
[Card()h(x)]h(x)
1 Card()
h2 (x) 1.
(2)
Because of the condition 0 < h < 1 we obtain that Card()hh(.) (.) Card()h2 (.),
on the other hand x 1 1 which imply that (2) holds.
6.1
A()
h
h
h d.
3316
A. Sayyareh
h(x) log
1
,
h(x)
as
H(X) log ||
[||h(x)]
h(x)
x
1,
1
log || (
|h(x) g(x)|)2
2 x
where || = Card().
Discussion
References
[1] I. Csisz
ar, A note on Jensens inequality, Studian Sci. Math. Hungar,1
(1966), 185-188.
[2] J. H. B. Kemperman, On the optimal rate of transmitting information,
Annals of Mathematical Statistics,40 ( 1969), 2156-2177.
[3] M. S. Pinsker, Information and Information stability of Random Variables
and Processes, A. Feinstein, tr. and ed. San Francisco: Holden-Day, 1964.
[4] S. Kullback, A lower bound for discrimination information in terms of
variation, IEEE Trans. Inf. Theory, 13 (1967), 126-127.
[5] S. Kullback, Correction to a lower bound for discrimination information
in terms of variation, IEEE Trans Inf. Theory, 16 (1970), 652.
[6] I., Vajda, Note on discrimination information and variation, IEEE Trans.
Inf. Theory, (1970), 771-773.
3317
[7] F. Topse, Bounds for entropy and divergence of distributions over a twoelement set, Journal of Ineq. Pure Appl. Math,2 (2001), Article 25.
[8] S. Kullback, R. Leibler, On information and suciency, Annals of Mathematical Statistics,22 (1951), 79-86.
[9] F. Liese, I. Vajda, Convex Statistical Distances, B.G. Teubner Verlagsgesellschaft, Leipzig, 1987.
[10] R. D. Reiss, Approximate Distributions of Order Statistics, SpringerVerlag, New-York, 1989.
[11] S. S. Dragomir, V. Gluscevic, Some Inequalities for the Kullback-Leibler
and 2 -Distances in Information Theory and Applications, Tamsui Oxford
Journal of Mathematical Sciences,17, 2 (2001), 97-111.
[12] S. S. Dragomir, Some Properties for the Exponential of the KullbackLeibler Divergence, Tamsui Oxford Journal of Mathematical Sciences,24
(2008), 141-151.
[13] T.M. Cover, J. A. Thomas, Elements of Information Theory, John Wiley
and Sons Inc, New York, 1991.
[14] H. White, Maximum Likelihood Estimation of Misspecied Models,
Econometrica,50, 1 (1982), 1-26.
[15] D. Commenges, A. Sayyareh, L. Letenneur, J. Guedj, and A. Bar-Hen,
Estimating a Dierence of Kullback-Leibler Risks Using a Normalized
Dierence of AIC, The Annals of Applied Statistics, 2, 3 (2008), 11231142.
[16] A. Sayyareh, R. Obeidi, and A. Bar-Hen, Empirical comparison of some
model selection criteria, Communication in Statistics-Simulation and
Computation,40 (2011), 72-86.
[17] H. Quang Vuong, Likelihood ratio tests for model selection and non-nested
hypotheses, Econometrica, 57, 2 (1989), 307-333.
[18] H., Shimodaira, An application of multiple comparison techniques to
model selection, Annals of Ins. statistical mathematics,50 (1998), 1-13.
[19] H. Shimodaira, Multiple comparisons of log-likelihoods and combining
non-nested models with application to Phylogenetic tree selection, Communication in statistics,30 ( 2001), 1751-1772.
Received: April, 2011