Vous êtes sur la page 1sur 5

Entropy and goodness of t: a new relationship

Ramon Lacayo
Facultad de Ciencias Empresariales y Econmicas
Universidad del Magdalena
Carrera 32 No.22-08
Santa Marta, Magdalena
Colombia
Cell phone:+(57)3007232957
e-address: lacayo.ramon@gmail.com
March 15, 2011
Abstract
In this paper we propose using entropy as a measure of goodness of
t. The suggested methodology consists in applying the probability in-
tegral transformation to the original data in accordance to an assumed
distribution, and then computing an approximate value of the entropy of
the transformed data. Except for an arbitrary constant, this value turns
out to be the sample range of a uniformly distributed sample. Compar-
ison of this value with the known theoretical values of the range of the
uniform distribution serves as the basis for acceptance or rejection of the
distributional assumption.
Keywords: Entropy, normality test, probability integral transform,
sample range, uniform distribution.
1 Introduction
It is often the case that a researcher nds himself pondering whether a given
data set has been drawn from a population with a particular distribution. Per-
haps among the earliest and most enduring goodness of t tests we can nd
is Pearsons chi-square test, which is based on comparing the frequency dis-
tribution of certain outcomes in the sample with those of a theoretical dis-
tribution. Another historically important test with current relevance is that
of Kolmogorov-Smirnov, which compares the cumulative distribution function
(cdf) of the reference distribution and the empirical cdf of the sample. Real
and perceived shortcomings of these tests led to the development of other tests,
whose own shortcomings, again real or perceived, led to further developments
in the eld. A good number of these tests, named by tradition after their cre-
ators and now routinely performed by statistical packages, were developed to
1
test for normality. For obvious reasons which include mainly the central limit
theorem and, perhaps as a close and related second, the assumption of normality
of errors in the normal classical linear regression model (Gujarati ([2]), testing
for normality is widespread. Without trying to be exhaustive, besides the two
tests already mentioned, to this group belong, among others, the tests for uni-
variate normality associated with the following names: Shapiro-Wilk, Shapiro-
Francia, Anderson-Darling, Lilliefors, Dagostinos and Jarque-Bera. There are
also graphical means to determine normality besides the use of the humble his-
togram, one of them being the Q-Q plot. Typically, a test compares the sample
with the reference distribution in some concrete way, and each test performs
as expected only under specic assumptions and conditions. For example, the
Jarque-Bera test for normality is asymptotic and is also based on the fact that
both the asymmetry and the excess kurtosis of a normal distribution are zero.
Judging by the number of existing normality tests, the need for yet another
one is not readily apparent. However, we nd that entropy, being a unique
measure of a distribution, is a natural basis for an accurate and easy to imple-
ment goodness of t test. That is the goal of this paper. Entropy measures the
degree of uncertainty of the outcomes of an experiment and in that capacity it
can be naturally associated with a random variable (rv) which, in turn, serves
to quantify those outcomes (Pugachov[4]). Here we are exclusively interested in
continuous variates, whose entropy is sometimes called dierential entropy to
distinguish it from Shannons information entropy and from physical entropy.
For lack of weighty reasons, we stay away from qualications and call it simply
entropy.
Aside from this Section 1, Introduction, the paper contains 2 more sections.
Section 2, Preliminaries, contains denitions and some background information.
In Section 3, The Test, we explain the basic idea behind the test proposed and
the results based on samples of sizes : = 100 and : = 30. These samples are
mainly drawn on Excel
c
spreadsheets, although we also used Mathematica
c
in the study. A few concluding remarks are also given at the end.
2 Preliminaries
The probability integral transform theorem (see Mood et. al.[3]) states that if
the continuous random variable A has cdf 1
X
(r), then the rv 1
X
(A) has a
uniform distribution on (0,1). Symbolically,
1
X
(A) ~ l:i)or:[0, 1] (1)
The pdf of a uniform variate l on [0, c] is
)
U
(r) =
1
c
, 0 _ r _ c, (2)
and is zero outside that interval.
Given a sample of size :, A
1
, A
2
, ..., A
n
, the sample range is 1 = A
max

A
min
. For a uniform distribution on [0, c], the cdf of the range is ([6], volume 1)
2
1
R
(r) = :
_
r
c
_
n1
(: 1)
_
r
c
_
n
. (3)
Obviously, 1
R
(r) = 0 for r < 0 and 1
R
(r) = 1 for r _ 1.
Simple dierentiation of (3) yields the range pdf in the form
)
R
(r) = :(: 1)(c r)r
n2
,c
n
, (4)
from which the mean
j
R
=
c(: 1)
: + 1
(5)
and the variance
o
2
R
=
2c
2
(: 1)
(: + 1)
2
(: + 2)
(6)
can be found without diculty.
It is interesting to note that as : , j
R
c and o
2
R
0. More precisely,
the maximum of )
R
(r) occurs at r = c(: 2),(: 1), and in the limit this
maximum tends to c while )
R
(r) . This means that the sequence dened by
(4) behaves, in the limit, as a delta function concentrated at c: )
R
(r) c(rc),
: .(Vladimirov[5])
For later reference we include the following table with information about
the two samples employed. The last column refers to the fth percentile of the
distribution of 1, more graphically known as the 5% critical value of the left
tail. The column labeled Q
2
contains the median. Both rows are calculated for
c = 2.
: j
R
Q
2
o
2
R
o
R
1
0:05
100 1.9604 1.9665 0.00076 0.027568 1.90688
30 1.871 1.8894 0.007544 0.0868 1.7028
(7)
For a continuous random variable A with pdf )(r), entropy is dened as
H[A] =
_
1
1
)(r) log )(r)dr. (8)
For the uniform case in (2), the entropy is simply
H[l] =
log c
c
_
c
0
dr = log c. (9)
This result will be used below.
Since our examples deal with the normal distribution, for reference we write
the cdf of a standard normal rv 7,
(.) =
1
2
_
z
1
c
t
2
=2
dt =
1
2
_
1 + erf
_
.
_
2
__
, (10)
where erf() is the error function (Abramovitz[1]). The latter expression
is particularly useful in view of the relationships of the error function with
3
other special functions, e.g., the complementary error function, the conuent
hypergeometric function of the rst kind (Kummers function) and others. For
positive argument, one may obtain a very good approximation of the error
function by the expression
1
erf(r) -
_
1 exp
_
r
2
4

+ar
2
1 +ar
2
__1=2
, a - 0.14.
This is important when the software employed does not incorporate the
function ().
3 The Test
Suppose we have a sample A
1
, ..., A
n
, with no known problems of autocorre-
lation or heteroskedasticity, which we want to test for normality. The sample
mean and standard deviation are A and o, respectively. Let us say that the
assumption of normality is our null hypothesis.
To implement our test we proceed as follows. We rst standardize the sam-
ple by the usual procedure: (A
i
A),o, i = 1, ..., :. As a result, under the
null hypothesis, we obtain a sample with mean 0 and unit standard deviation:
.
1
, .
2
, ..., .
n
.
Remark 1 This transformation obviously changes the distribution of the sam-
ple in a statistical sense, that is, in case we view the sample as a collection of
random variables. But it does not aect in any meaningful way the property that
we are probing in this specic sample since we regard it as a single realization of
.A
1
, ..., A
n
. In particular, it does not alter at all two of the crucial measures of
normality, namely, asymmetry and excess kurtosis. This means, among other
things, that the Jarque-Bera statistic is the same for both the original and the
transformed samples. Note that, in a certain sense, we are treating this set of
values as a population.
Using (10) we next apply the transformation (1) to the standardized sample,
but in order to avoid a zero value of the entropy, as follows from (9), we multiply
the normal cdf by a positive constant c. As a result we obtain a uniform random
variate l on (0, c), that is,
l = c + (.) ~ l:i)or:[0, c], (11)
with pdf given by (2). In essence, we obtained a partition r
1
, r
2
, ..., r
n
of
the interval [0, c].
It is clear that, under the null hypothesis and in a manner similar to the cal-
culation of (9), the Riemann integral (10) can be approximated by
log c
c

i
r
i
.
But the constant in this expression is arbitrary, and the sum (r
2
r
1
) + (r
3

r
2
) +... + (r
n
r
n1
) telescopes to r
n
r
1
= r
max
r
min
= 1, the range.
1
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
4
As a consequence, we have reduced the study of normality to the study of
the range 1 of the uniform distribution on [0, c], whose properties were already
derived.
In particular, the pdf of 1 is highly skewed to the left even for moderate
values of the sample size, and becomes more so as : increases. This is clearly
illustrated, besides the discussion on the delta function, by the values for the
mean and median, as shown in (7).. In view of this, to devise a test for the
null hypothesis based on a two-sided condence interval is impractical. For this
reason we use a one-sided interval, of condence 1 c, as a rule for deciding
whether the null hypothesis must be rejected:
Reject the null hypothesis of normality if 1 < 1

where c is the type I error chosen for the test. In (7) we give critical values
1

for : = 100 and : = 30 corresponding to c = 0.05. Similar values can be


easily be found from (3) for any c and sample size by numerically solving the
equation 1
R
(r) = c.
In simulations done with Excel on normal samples of the mentioned sizes,
the test above performed as expected, with an average type I error of 5%. We
did simulations on samples drawn from chi-square, log-normal, and a few other
distributions. We nd the test a very satisfactory one. We were pleased with the
fact that the test managed to reject obvious non-normal cases. We analyzed no
borderline cases this time. Note that, unlike the so-called back-of the-envelope
normality tests involving the range, like the 6o test, our test is a full one, with a
precise decision rule. It is clear that the test is easy to understand and simple to
implement on a spreadsheet, and may about as easily be adapted for goodness
of t tests for distributions other than the normal. This, however, will be the
subject of another work.
References
[1] Abramovitz, M.; Stegun, I. Handbook of Mathematical Functions. Dover,
1965.
[2] Gujarati, D. Basic Econometrics. McGraw-Hill, 2004
[3] Mood, A.; Graybill, F.; Boes, D: Introduction to the Theory of Statistics.
McGraw-Hill, 1974.
[4] Pugachov, V.C.Theory of Probability and Mathematical Statistics (in
Russian). FizMatLit, 2002.
[5] Vladimirov, V.S. Generalized Functions in Mathematical Physics. Nauka,
1979.
[6] Weisstein, E. The CRC Concise Encyclopedia of Mathematics. CRC Press,
1999
5

Vous aimerez peut-être aussi