Académique Documents
Professionnel Documents
Culture Documents
Winfried Stute
University of Giessen
Contents
1 Introduction
1.1 Counting . . . . . . . . . . . . . . . . . . . .
1.2 Diracs and Indicators . . . . . . . . . . . . . .
1.3 Empirical Distributions for Real-Valued Data
1.4 Some Targets . . . . . . . . . . . . . . . . . .
1.5 The Lorenz Curve . . . . . . . . . . . . . . .
1.6 Ginis Index . . . . . . . . . . . . . . . . . . .
1.7 The ROC-Curve . . . . . . . . . . . . . . . .
1.8 The Mean Residual Lifetime Function . . . .
1.9 The Total Time on Test Transform . . . . . .
1.10 The Product Integration Formula . . . . . . .
1.11 Sampling Designs and Weighting . . . . . . .
2 The
2.1
2.2
2.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
7
13
17
31
34
39
42
44
45
51
3 Univariate Empiricals:
The IID Case
3.1 Basic Facts . . . . . . . . . . . . . . . . . . . .
3.2 Finite-Dimensional Distributions . . . . . . . .
3.3 Order Statistics . . . . . . . . . . . . . . . . . .
3.4 Some Selected Boundary Crossing Probabilities
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
92
97
106
4 U -Statistics
113
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2 The Hajek Projection Lemma . . . . . . . . . . . . . . . . . . 115
4.3 Projection of U -Statistics . . . . . . . . . . . . . . . . . . . . 117
i
ii
CONTENTS
4.4
4.5
5 Statistical Functionals
5.1 Empirical Equations . . . . . . . . . . . . . . .
5.2 Anova Decomposition of Statistical Functionals
5.3 The Jackknife Estimate of Variance . . . . . . .
5.4 The Jackknife Estimate of Bias . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Stochastic Inequalities
6.1 The D-K-W Bound . . . . . . . . . . . . . . . . . . . .
6.2 Binomial Tail Bounds . . . . . . . . . . . . . . . . . .
6.3 Oscillations of Empirical Processes . . . . . . . . . . .
6.4 Exponential Bounds for Sums of Independent Random
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Invariance Principles
7.1 Continuity of Stochastic Processes . . . . .
7.2 Gaussian Processes . . . . . . . . . . . . . .
7.3 Brownian Motion . . . . . . . . . . . . . . .
7.4 Donskers Invariance Principles . . . . . . .
7.5 More Invariance Principles . . . . . . . . . .
7.6 Parameter Empirical Processes (Regression)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
. 133
. 136
. 138
. 140
. . . .
. . . .
. . . .
Vari. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
145
148
152
157
161
. 161
. 163
. 165
. 167
. 172
. 176
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
201
. 201
. 207
. 214
. 216
. 220
CONTENTS
iii
Multivariate Case
Introduction . . . . . . . . . . . . . . . . . . .
Identication of Defective Survival Functions
The Multivariate Kaplan-Meier Estimator . .
Eciency of the Estimator . . . . . . . . . .
Simulation Results . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
225
. 225
. 228
. 231
. 235
. 242
.
.
.
.
.
251
. 251
. 257
. 260
. 264
. 269
.
.
.
.
.
.
273
. 273
. 276
. 280
. 287
. 291
. 296
13 Conditional U -Statistics
301
13.1 Introduction and Main Result . . . . . . . . . . . . . . . . . . 301
13.2 Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 307
iv
CONTENTS
Chapter 1
Introduction
1.1
Counting
Empirical distributions have been designed to provide fundamental mathematical and graphical tools in the context of data analysis.
To begin with, suppose that n customers are asked to name their favorite car
from a list a, b, c, d, e. Denote with Xj the response of the j-th customer.
A quantity of general interest is the number of customers in favor of a,
say. This number is the outcome of a procedure which consists of n steps
and which adds a one to the previous value whenever one obtains another
positive vote for a: counting.
To formalize things a little bit, let S = {a, b, c, d, e} be the sample space of
all possible outcomes. Set A = {a}, a set consisting of only one element. Let
1A be the indicator function associated with A, i.e., the function dened
on S attaining the value 1 on A and 0 otherwise:
{
1 if x A
1A (x) =
.
(1.1.1)
0 if x
/A
In our example, 1A (Xj ) equals 1 if the vote of the j-th customer is in favor
of a and 0 otherwise. The absolute number of customers voting for a of
course equals
n
1A (Xj ).
(1.1.2)
j=1
Needless to say, (1.1.2) also applies to other sets B, C, . . . and other Xs. As
another example, consider the number of days per year at a specic location,
1
CHAPTER 1. INTRODUCTION
for i = j.
i=1 Ci
= S.
1.1. COUNTING
u
u
u
u A
n (A) =
2
8
i=1
n (Ci ) = n
(m
)
Ci
= n (S) = 1.
i=1
pi = 1.
i=1
CHAPTER 1. INTRODUCTION
(1.1.4)
j=1
1.1. COUNTING
the relative number of data being less than or equal to t. For two s and t we
either have s t or t < s. In the rst case (, s] is contained in (, t]
so that by monotonicity of n
Fn (s) Fn (t)
for s t,
1 Fn (t) =
t R,
j=1
j=1
CHAPTER 1. INTRODUCTION
Here Xjl denotes the l-th component of Xj . For ease of notation we write
Xj t i holds coordinatewise. Then Fn becomes
1
1{Xj t} .
n
n
Fn (t) =
j=1
xi i
i=1
x2
*
*
*
3
*
*
*
1
*
4x1 + 2x2 6
1.5
x1
Also the sample space S may be quite general. Later in this monograph,
most of our analysis will rst focus on S = R, i.e., on real-valued data, with
the class of As being the extended intervals (, t]. After that we shall
study multivariate problems. Our nal extension will be concerned with the
important case when the available information comes through functions over
time or space. This will lead us to the study of so-called functional data.
1.2
The indicator function is intimately related with a so-called point mass (or
Dirac measure).
For each x S, dene
{
x (A) =
1
0
if x A
if x
/A
(1.2.1)
x (A1 ) = 0
x (A2 ) = 1
(1.2.2)
Note that for x the point x is kept xed with A varying among a class of
sets, while for the indicator function, the set A is xed with x varying in
the sample space. It is easy to see that each x is a (probability) measure.
In fact,
0 x (A) 1 with x (S) = 1
and
x
i=1
)
Ai
i=1
x (Ai )
CHAPTER 1. INTRODUCTION
for any mutually disjoint subsets Ai of S. One may check this equality
separately for the two cases that x is contained in none of the Ai s and,
alternatively, that x is contained in precisely one of them. In the rst case
both sides equal zero, while in the latter we have one on both sides.
Within this framework, (1.1.3) becomes
1
Xj (A)
n
(1.2.3)
1
Xj .
n
(1.2.4)
n (A) =
j=1
n =
j=1
x (A) =
1A dx = 1A (x).
(1.2.5)
The rst equation just expresses the fact that the integral of an indicator
function always equals the measure of the underlying set. If we put = 1A ,
the second equation may be rewritten to become
dx = (x).
(1.2.6)
The last equality is simple but fundamental and may be easily extended to
other s, not necessarily of indicator type. Since n is a linear combination
of the point masses Xj , 1 j n, we thus obtain
n
1
(Xj ),
(1.2.7)
dn =
n
j=1
n
1
n.
xdn =
Xj X
n
j=1
1 p
Xj ,
n
n
xp dn =
j=1
n
1
exp[iXj ].
n () dn =
n
j=1
n0 () =
1
exp[Xj ].
n
n
0 dn =
j=1
dn = Fn (t).
To present an example for bivariate data Xj = (Xj1 , Xj2 )T , 1 j n, the
choice of
(x1 , x2 ) = x1 x2
leads to
1
Xj1 Xj2 ,
n
n
dn =
j=1
10
CHAPTER 1. INTRODUCTION
dn dFn .
For a deeper analysis the class of quantities which somehow can be computed
from the available data through n or Fn , respectively, need to be enlarged.
Going back to (1.2.3), it is tempting to look also at products of n . For
example, the product measure n n is uniquely determined through its
values on rectangles A1 A2 and is given by
n n (A1 A2 ) =
=
n
n
1
Xi (A1 )Xj (A2 )
n2
1
n2
i=1 j=1
n
n
1{Xi A1 ,Xj A2 } .
i=1 j=1
n
n
(Xi ,Xj ) .
i=1 j=1
Since (1.2.6) holds true for a general Dirac, we get, for an arbitrary function
= (x, y) of two variables:
n
n
1
(Xi , Xj ),
d(n n ) = 2
n
i=1 j=1
1
(Xi , Xj ).
n(n 1)
n
Un =
i=1 j=1
i=j
n2
j=1
11
Actually,
1
1
(Xi Xj )2 =
(Xi Xj )2
2n(n 1)
2n(n 1)
n
i=1 j=1
i=j
n
1
n1
1
n1
i=1 j=1
1
Xi Xj
n(n 1)
n
Xi2
i=1
n
i=1 j=1
n 2
1
n )2 .
Xn =
(Xi X
n1
n1
n
Xi2
i=1
i=1
Our discussion may have already indicated that point measures or Diracs are
just the cornerstones of a hierarchy of more or less sophisticated statistical
tools. After a moment of thought this is not surprising, because in a statistical framework the relevant information is often obtained only through
a set of nitely many (discrete) data so that it is only natural to put the
weights there.
So far we have restricted ourselves to equal weighting. As it will turn out
there will be many data situations when we are led to masses which are not
equal. For further discussion, we rewrite (1.2.4) as
n
1
n =
X .
n j
(1.2.8)
j=1
Wjn Xj .
(1.2.9)
j=1
dn =
j=1
Wjn (Xj ).
(1.2.10)
12
CHAPTER 1. INTRODUCTION
Wjn (x)Xj ,
(1.2.11)
j=1
The xs need not necessarily be taken from the same S but may vary in
another set. More clearly, we have
n (x, A) =
Wjn (x)1{Xj A}
(1.2.12)
j=1
and
Wjn (x)(Xj ).
(1.2.13)
j=1
Adopting terminology from probability theory, we call n (x, A) an empirical kernel. The integrals (1.2.13) computed w.r.t. these n are functions
in x. As one may expect these functions are candidates for estimators of
functions rather than parameters like the mean or variance of a population.
The kernel n (x, A) is called an empirical probability kernel i for all x
Wjn (x) 0 and
n
Wjn (x) = 1.
j=1
In some situations the sum of weights may be less than or equal to one:
0
j=1
Wjn (x) 1,
13
1.3
14
CHAPTER 1. INTRODUCTION
vanishes for all x outside the data set. When x = Xj for some j, then
dj
,
n
where dj is the number of data with the same value as Xj . When no ties
are present, then dj = 1. In this case, Fn has exactly n jumps with jump
size n1 . Hence for large n, Fn is a function with many discontinuities but
small jump heights.
0.0
0.2
0.4
0.6
0.8
1.0
Fn {Xj } =
15
In a particular data situation Xj may denote the result of the j-th experiment or the response of the j-th person in a medical or socio-demographic
survey. As another possibility the index j may refer to time. For example,
in a nancial context, Xj may denote the price of a stock on day j at the
closing of the session. In each case, and contrary to Xi:n , the index j needs
to be chosen before the experiment is run.
Example 1.3.2. In this example we record the times (in seconds) in the
100m nal (men) at the Olympic Games 2004 in Athens. Xj corresponds to
the time of athlete number j in the alphabetic order as outlined below:
Collins
Crawford
Gatlin
Greene
Obikwelu
Powell
Thompson
Zakari
X1
X2
X3
X4
X5
X6
X7
X8
= 10.00
=
9.89
=
9.85
=
9.87
=
9.86
=
9.94
= 10.10
= DN F
From this we see that X1:n = X3 = 9.85 with R3 = 1, i.e., the winner of the
gold medal was Gatlin.
By the denition of ranks, we have
Xj = XRj :n .
(1.3.1)
(1.3.2)
This means, to compute the ranks we need to evaluate Fn at very special ts,
namely the (original) data. Secondly, when we compute Fn at the ordered
data, we obtain a xed value:
i
= Fn (Xi:n ).
n
(1.3.3)
16
CHAPTER 1. INTRODUCTION
Fn1
( )
i
.
n
(1.3.4)
i
i1
< u and 1 i n
n
n
(
)
Fn Fn1 (u) u for 0 < u 1
(1.3.5)
(1.3.6)
(1.3.7)
is a convenient means to measure the spread of the central part of the data.
1.4
17
Some Targets
So far we have introduced and discussed several elementary quantities related to Fn . In this section we formalize their mathematical background.
This enables us to express and view our statistics as estimators of various
targets. In particular, we provide the mathematical framework for the Xj s.
Again we rst restrict ourselves to real-valued Xs.
In mathematical terms, the Xj are random variables dened on a probability
space (, A, P). For the time being we assume that all Xs have the same
distribution function (d.f.)
P( : Xj () t) F (t)
(Xj F ).
18
CHAPTER 1. INTRODUCTION
whole but unknown population and the other one (G = Fn ), revealing some
though limited information about through X. With this in mind we may
consider all quantities discussed in the previous sections also for G = F .
A functional T , which maps G into a real number or a vector T (G), is
called a statistical functional. A simple example of a linear functional is
dened, for a given , as
T (G) = (x)G(dx),
provided the integral exists. The functional T may be also evaluated at
measures which are not necessarily probability measures. T is called linear
because for any two G1 , G2 for which T is well-dened and any real numbers
a1 and a2 we have
T (a1 G1 + a2 G2 ) = a1 T (G1 ) + a2 T (G2 ).
While
T (F ) =
(x)F (dx)
X F,
while
T (Fn ) = E[(X )],
X Fn .
This (very) simple example already reveals the possibility that we have more
or less two options to represent quantities of interest:
(i) a representation in terms of random variables
(ii) a representation in terms of distributions
19
No representation is superior to the other one but each has its own value. A
representation of expectation and variance, e.g., through variables still exhibits that originally these quantities were attached to variables describing
some particular random phenomena. On the other hand, the representation through F gives us a clue how to estimate the target, namely just by
replacing F with Fn (plug-in method).
The V -statistic
n
n
1
(Xi , Xj ) =
(x, y)Fn (dx)Fn (dy)
Vn = 2
n
i=1 j=1
T (G) =
(x, y)G(dx)G(dy)
k
EX = xk F (dx).
The empirical estimator becomes
n
1 k
Xj .
xk Fn (dx) =
n
j=1
Note that if we consider the collection {t } of all t s, we obtain a stochastic process or random function indexed by t R.
20
CHAPTER 1. INTRODUCTION
1
exp[Xj ],
exp[x]Fn (dx) =
n
n
j=1
1
exp[iXj ].
n
n
exp[ix]Fn (dx) =
j=1
1{xt}
F (dx)
F (t) =
=
F (dx).
1 F (x)
1 F (x)
(,t]
In this case, the function consists of two components, the numerator being
a function only of x, while the denominator depends on x through F . The
cumulative hazard function plays an outstanding role in survival analysis
and reliability, when the data are lifetimes (failure times) of individuals
or technical units and are therefore nonnegative. Hence in this case it is
sucient to extend the integral over the positive real line. For a continuous
F , the left-hand limit F (x) coincides with F (x). At this point it is not
21
clear why one should not use F (x) instead of F (x). For a preliminary
answer we compute the empirical cumulative hazard function estimator, the
so-called Nelson-Aalen estimator:
t
n (t) =
0
1{Xi t}
Fn (dx)
n
=
.
1 Fn (x)
j=1 1{Xj Xi }
i=1
When there are no ties we may compute n also by summation along the
order statistics. This yields
n (t) =
1{Xi:n t}
.
ni+1
i=1
We thus see that like Fn also n jumps at the data, the weights 1/(n i + 1)
assigned to the i-th order statistic, however, being strictly increasing as we
move to the right.
While the total mass under (the standard) Fn equals one, the Nelson-Aalen
estimator has total mass
n
i=1
1
1
=
ln n as n .
ni+1
i
n
(1.4.2)
i=1
CHAPTER 1. INTRODUCTION
22
Example 1.4.6. Though the empirical distribution function is a nonparametric estimator in the sense that for its computation as well as its properties no parametric model assumption is required, our view turns out to be
useful also in parametric statistics. For this, let M = {f (, ) : } be a
family of densities parametrized by a set of nite-dimensional vectors. In
the case of independent observations from M, i.e., the true density equals
23
ln f (Xi , )
i=1
(1.4.3)
We thus see that log-likelihood functions are empirical integrals parametrized by . Under some smoothness conditions the maximizer of (1.4.3),
i.e., the Maximum Likelihood Estimator (MLE) satises
ln f (x, ).
(x) = (x).
Then
(x)F (dx) = 0.
Important examples for are
(x) = sign(x) =
1
0
if
if
if
x>0
x=0 ,
x<0
t R.
J(F (x))G(dx).
If F = G and F is continuous, the last integral becomes
1
0
value.
J(u)du, a known
24
CHAPTER 1. INTRODUCTION
(1.4.4)
an empirical integral with (x) = J(Fn (x)). Here, Fn and Gm are the
empirical d.f.s of the two samples X1 , . . . , Xn and Y1 , . . . , Ym from (the
unknown) F and G, respectively. Applying our general formula (1.2.7), we
get
m
1
J(Fn (Yi )).
(1.4.5)
J(Fn (x))Gm (dx) =
m
i=1
Note that since Yi is from the second sample and Fn is from the rst nFn (Yi )
is not the rank of Yi . Rather it provides us with the location (or position) of
Yi w.r.t. the rst sample. E.g., when we put J(u) = u, then (1.4.4) equals
1
1{Xj Yi } .
nm
m
i=1 j=1
F (x)G(dx) =
1
F (x)F (dx) =
1
udu = ,
2
a known value, we may expect that (1.4.5) for J(u) = u is at least close to
1/2 when F = G. At the same time this integral, namely (1.4.5), should
hopefully be far away from 1/2 if F = G. In general, the interesting question will then be how to choose J in order to get large power on certain
alternatives to F = G.
We see from Examples 1.4.7 and 1.4.8 that empirical integrals may also be
utilized for tests in that a null hypothesis such as symmetry or equality
in distribution is rejected if the empirical term deviates too much from a
pre-specied value.
Many of the concepts discussed so far have obvious extensions to the kvariate case. As we know, a convenient class of sets which determines a
distribution is the family of quadrants
Q = (, t],
25
t Rk .
j=1
New questions may come up now which have no counterpart in the univariate
case.
Example 1.4.9. We may be interested in the dependence structure of the
X-coordinates. A simple measure of association between the coordinates of
a bivariate vector X = (X1 , X2 )T , say, is the correlation
Cov(X1 , X2 )
CorrX =
.
VarX1 VarX2
We have already seen in Section 1.2 that variances are simple (quadratic)
functionals of a distribution function. Denoting with F the joint distribution
of X1 and X2 also the covariance allows for a simple representation as a
(statistical) functional of F :
1
Cov(X1 , X2 ) =
(x1 y1 )(x2 y2 )F (dx1 , dx2 )F (dy1 , dy2 ).
2
Example 1.4.10. Much work, e.g., has been devoted to testing for independence of X1 and X2 . Write F1 and F2 for the marginal distributions of
X1 and X2 , respectively. Independence is tantamount to
F (t1 , t2 ) F1 (t1 )F2 (t2 ) = 0
(1.4.6)
26
CHAPTER 1. INTRODUCTION
F (x + h) F (x)
,
h
(1.4.7)
(1.4.8)
Hence
1 F (t) = exp[(t)].
The function
(x) =
f (x)
1 F (x)
F (x + h) F (x)
P(x < X x + h|x < X)
= lim
,
h0
h(1 F (x))
h
27
where
P(A|B) =
P(A B)
P(B)
P(X = i + 1)
.
P(i < X)
Hence (i) equals the probability that given a person has survived i time
units, death will occur in the next period.
Table 1.4.1 presents the hazard functions of the German male and female
population computed for some selected decades in the last century.
The pertaining plot 1.4.2 reveals that the hazard function of a human population is likely to have the shape of a bathtub. If we compare the local risks
for German women in the 1910s and 1980s it becomes apparent that significant improvements as to the rate of mortality have been obtained for young
children (in particular babies) and elderly women exceeding 60 years of
age.
CHAPTER 1. INTRODUCTION
300
28
150
0
50
100
1000
200
250
1901/10
1980/82
20
40
60
80
Age in years
Figure 1.4.2: Plot of two selected hazard functions for the female population
in Germany
Example 1.4.12. Another important local quantity is the so-called regression function. For this, recall that we already discussed some measures
of association for the two components of an observational vector (X1 , X2 )T .
The purpose then was to quantify the degree of dependence between X1
and X2 . In regression analysis one is interested in the kind of dependence
between X1 and X2 . This is of great practical importance when X2 , e.g.,
is not observable and needs to be predicted on the basis of X1 . Hence X1
is sometimes called the independent variable (dose, input) while X2 is the
dependent variable (response, output). To distinguish between the dierent roles played by X1 and X2 , we prefer the notation X and Y for X1 and
X2 , respectively.
If Y has a nite expectation, a result in probability theory guarantees the
29
Table 1.4.1: 1000 (i)
i=0
i=1
i=2
i=5
i = 10
i = 15
i = 20
i = 25
i = 30
i = 35
i = 40
i = 45
i = 50
i = 55
i = 60
i = 65
i = 70
i = 75
i = 80
i = 85
i = 90
1901/10
202,34
39,88
14,92
5,28
2,44
2,77
5,04
5,13
5,56
6,97
9,22
12,44
16,93
23,57
32,60
47,06
69,36
106,40
157,87
231,60
320,02
1924/26
115,38
16,19
6,36
2,42
1,42
1,94
4,27
4,39
4,05
4,25
5,35
7,23
10,30
15,48
23,62
36,92
58,08
93,91
141,96
212,85
284,69
1932/34
85,35
9,26
4,50
2,32
1,33
1,57
2,83
2,97
3,24
3,94
4,82
6,58
9,39
14,18
21,72
34,04
54,01
87,40
136,68
207,69
287,73
Males
1949/51
61,77
4,16
2,46
1,21
0,70
1,04
1,88
2,23
2,28
2,76
3,52
5,16
8,50
12,75
18,91
29,06
45,79
75,08
121,37
190,15
282,56
1960/62
35,33
2,31
1,40
0,80
0,45
0,75
1,85
1,69
1,70
2,09
2,95
4,43
7,39
12,97
22,04
34,33
50,87
78,85
122,97
188,02
279,21
1970/72
26,00
1,55
1,00
0,73
0,47
0,79
2,00
1,61
1,70
2,10
3,20
4,75
7,71
12,06
20,44
34,59
55,92
84,15
122,86
180,95
259,70
1980/82
13,07
0,92
0,63
0,44
0,29
0,52
1,54
1,28
1,32
1,69
2,78
4,44
7,33
10,97
18,25
28,34
45,76
75,30
115,52
169,05
227,77
i=0
i=1
i=2
i=5
i = 10
i = 15
i = 20
i = 25
i = 30
i = 35
i = 40
i = 45
i = 50
i = 55
i = 60
i = 65
i = 70
i = 75
i = 80
i = 85
i = 90
1901/10
170,48
38,47
14,63
5,31
2,56
3,02
4,22
5,37
5,97
6,86
7,71
8,54
11,26
16,19
24,73
39,60
62,06
98,31
146,50
217,39
295,66
1924/26
93,92
14,93
5,74
2,19
1,20
1,81
3,32
3,94
4,14
4,52
5,31
6,44
8,86
12,73
19,47
31,55
51,98
85,29
133,71
198,37
263,08
1932/34
68,39
8,23
3,98
2,15
1,14
1,30
2,27
2,70
3,01
3,48
4,22
5,46
7,91
11,53
17,46
28,53
47,61
80,33
126,51
193,66
273,64
Females
1949/51
49,09
3,60
2,15
0,99
0,47
0,68
1,15
1,35
1,65
1,99
2,55
3,68
5,46
8,13
12,91
22,24
39,11
68,11
114,02
173,62
259,16
1960/62
27,78
2,01
1,08
0,56
0,28
0,40
0,62
0,73
0,99
1,38
2,01
2,99
4,45
6,72
10,85
18,62
32,85
59,61
103,31
166,26
248,21
1970/72
19,84
1,31
0,80
0,50
0,28
0,45
0,65
0,63
0,77
1,16
1,78
2,82
4,56
5,38
9,88
17,11
30,19
54,29
94,43
155,88
234,20
1980/82
10,37
0,84
0,50
0,29
0,20
0,33
0,44
0,51
0,66
0,96
1,42
2,23
3,50
5,30
8,69
13,39
22,77
43,11
76,44
132,62
206,54
30
CHAPTER 1. INTRODUCTION
(1.4.9)
(1.4.10)
yf (x, y)dy
(1.4.11)
m(x) =
f1 (x)
where f1 is the (marginal) density of X. Equation (1.4.11) exhibits the local
avor of m but also points out that in a real world situation, m is unknown
and depends on unknown quantities like f and f1 .
We now introduce the cumulative process pertaining to m. Again we restrict
ourselves to the bivariate case. Let F1 be the d.f. of X and set
I(t) = E[Y 1{Xt} ].
When Y 1, we obtain I(t) = F1 (t). In the general case, we get, using
well-known properties of conditional expectations:
I(t) = E[E(Y 1{Xt} |X)]
= E[1{Xt} E(Y |X)]
= E[1{Xt} m(X)] =
m(x)F1 (dx).
(,t]
31
In (t) =
j=1
1.5
Denition 1.5.1. Let F be any distribution function on the real line with
quantile function F 1 . Assume that = xF (dx) exists and does not
32
CHAPTER 1. INTRODUCTION
LF (p) = L(p) =
F 1 (u)du,
0 p 1.
F 1 (u)du,
Lemma 1.5.3. Assume that > 0. Then the Lorenz function L is convex.
Proof. We have to show that for p1 , p2 from the unit interval and any
0c1
cL(p1 ) + (1 c)L(p2 ) L(cp1 + (1 c)p2 ).
(1.5.1)
The left-hand side equals, for 0 p1 p2 1:
1
p1
p2
(u)du + (1 c)
F 1 (u)du.
p1
p1
L(cp1 + (1 c)p2 ) =
cp1 +(1c)p
2
F 1 (u)du.
(u)du +
p1
F 1 (u)du
cp1 +(1c)p
2
F 1 (u)du.
p1
33
F 1 (cp1 + (1 c)u)du.
p1
2ap + (b a)p2
.
a+b
Fn1 (u)du,
where n = n
n
1
i=1 Xi
i
Since Fn1 is constant between two successive i1
n and n we have for
p nk :
[k1
]
(
)
1 1
k1
Ln (p) =
Xi:n + p
Xk:n .
n
n
n
k1
n
<
i=1
We see that Ln (p) may also be expressed through sums of order statistics.
34
CHAPTER 1. INTRODUCTION
1.6
Ginis Index
1
1
G = 2A = 2 L(p)dp
2
0
1
=12
L(p)dp.
0
G=
,
2
where
=
|x y|F (dx)F (dy).
0
35
Rang
GiniIndex
Rang
GiniIndex
Rang
GiniIndex
Namibia
70,70
Sdafrika
65,00
Nigeria
49
43,70
Kenia
50
42,50
Bangladesch
98
33,20
Griechenland
99
Lesotho
63,20
Burundi
51
33,00
42,40
Frankreich
100 32,70
Botsuana
Sierra Leone
63,00
Russische Fderation
62,90
Cte dIvoire
52
42,00
Taiwan
101 32,60
53
41,50
Tadschikistan
Zentralafrikan. Republik
61,30
102 32,60
Senegal
54
41,30
Kanada
Haiti
103 32,10
59,20
Marokko
55
40,90
Italien
Bolivien
104 32,00
58,20
Georgien
56
40,80
Spanien
105 32,00
Honduras
57,70
Turkmenistan
57
40,80
Timor-Leste
106 31,90
Kolumbien
10
56,00
Nicaragua
58
40,50
Estland
107 31,30
Guatemala
11
55,10
Trkei
59
40,20
Korea, Republik
108 31,00
Thailand
12
53,60
Mali
60
40,10
Tschechische Republik
109 31,00
Hongkong
13
53,30
Tunesien
61
40,00
Armenien
110 30,90
Paraguay
14
53,20
Jordanien
62
39,70
Niederlande
111 30,90
Chile
15
52,10
Burkina Faso
63
39,50
Pakistan
112 30,60
Brasilien
16
51,90
Guinea
64
39,40
Australien
113 30,50
Panama
17
51,90
Ghana
65
39,40
Europische Union
114 30,40
Mexiko
18
51,70
Israel
66
39,20
thiopien
115 30,00
Papua-Neuguinea
19
50,90
Mauretanien
67
39,00
Kosovo
116 30,00
Sambia
20
50,80
Mauritius
68
39,00
Zypern
117 29,00
Swasiland
21
50,40
Venezuela
69
39,00
Zypern
117 29,00
Costa Rica
22
50,30
Malawi
70
39,00
Slowenien
118 28,40
Gambia
23
50,20
Portugal
71
38,50
Serbien
119 28,20
Simbabwe
24
50,10
Moldau
72
38,00
Island
120 28,00
Sri Lanka
25
49,00
Jemen
73
37,70
Belgien
121 28,00
Dominikanische Republik
26
48,40
Vietnam
74
37,60
Ukraine
122 27,50
China
27
48,00
Japan
75
37,60
Belarus
123 27,20
Madagaskar
28
47,50
Tansania
76
37,60
Deutschland
124 27,00
Singapur
29
47,30
Indien
77
36,80
Kroatien
125 27,00
Ecuador
30
47,30
Usbekistan
78
36,80
Finnland
126 26,80
Nepal
31
47,20
Indonesien
79
36,80
Kasachstan
127 26,70
El Salvador
32
46,90
Laos
80
36,70
Slowakei
128 26,00
Ruanda
33
46,80
Mongolei
81
36,50
Luxemburg
129 26,00
Malaysia
34
46,20
Benin
82
36,50
Malta
130 26,00
Peru
35
46,00
Neuseeland
83
36,20
Malta
130 26,00
Argentinien
36
45,80
36,20
sterreich
131 26,00
Philippinen
37
45,80
Litauen
85
35,50
Norwegen
132 25,00
Mosambik
38
45,60
Algerien
86
35,30
Dnemark
133 24,80
Jamaika
39
45,50
Lettland
87
35,20
Ungarn
134 24,70
Uruguay
40
45,30
Albanien
88
34,50
Montenegro
135 24,30
Bulgarien
41
45,30
gypten
89
34,40
Schweden
136 23,00
Vereinigte Staaten
42
45,00
Polen
90
34,20
Guyana
43
44,60
Grobritannien
91
34,00
Kamerun
44
44,60
Niger
92
34,00
Iran
45
44,50
Irland
93
33,90
Kambodscha
46
44,40
Schweiz
94
33,70
Uganda
47
44,30
Aserbaidschan
95
33,70
Mazedonien
48
44,20
Kirgisistan
96
33,40
Rumnien
97
33,30
Land
Land
Land
36
CHAPTER 1. INTRODUCTION
Proof. We have
y
(y x)F (dx)F (dy)
=2
0
y
= 2 yF (y)F (dy)
xF (dx)F (dy) .
0
[
]
1
xF (x)F (dx) =
1 u
(u)udu =
1 1
=
F
0
1dvF 1 (u)du
1 v
(u)dudv =
F 1 (u)dudv,
we come up with
= 2 2
1 v
0
whence
=12
2
F 1 (u)dudv
1
L(v)dv = G.
0
The empirical Gini Index of course equals
1
Gn = 1 2
Ln (p)dp.
0
37
Since Ln is a
piecewise linear function interpolating the values Sk and Sk+1 ,
where Sk = ki=1 Xi:n , we obtain
Gn = 1 2
=1
(
n
1 k
1
n
i=1
n
k=1
k1
n
n
)(
Sk1
Sk
+
Sn
Sn
Sk + Sk1
.
Sn
1
Z(p)dp =
2
1
L(p)dp = G.
2
38
CHAPTER 1. INTRODUCTION
Proof. We have to show that Z(F ()) Z(p) for all 0 p 1. We only
discuss the case F () p, the other being dealt with in a similar way. Now,
Z(F ()) Z(p) = F () p [L(F ()) L(p)]
1
= F () p
F
()
F 1 (u)du.
(1.6.1)
|x |F (dx),
39
Proof. We have
1
=
2
2
1
|x |F (dx) =
2
1
(x )F (dx) +
( x)F (dx)
2
1
1
1
=
( x)F (dx) +
(x )F (dx) =
(x )F (dx)
2
xF (dx) 1 + F () = F ()
= F ()
xF (dx)
0
F
()
F 1 (u)du.
(1.6.2)
1.7
The ROC-Curve
Y =0
Y =1
D=0
True negative
False positive
D=1
False negative
True positive
We denote with
FPF = P(Y = 1|D = 0)
40
CHAPTER 1. INTRODUCTION
and
TPF = P(Y = 1|D = 1)
the false positive and true positive fraction, respectively. The ideal test
yields
FPF = 0 and TPF = 1.
For a useless test which is unable to discriminate between D = 0 and D = 1
we have FPF = TPF. In a biomedical context FPF and 1 TPF are called
sensitivity and specicity, i.e.,
sensitivity = P(Y = 1|D = 0)
specicity = P(Y = 0|D = 1)
In engineering the terms hit rate (TPF) and false alarm rate (FPF)
are quite common. In statistics when the goal is to test the null hypothesis
D = 0, FPF is called the signicance level and TPF power. The quantity
= P(D = 1) is called the population prevalence of disease. Finally, the
misclassication probability equals
P(Y = D) = P(D = 1)P(Y = 0|D = 1) + P(D = 0)P(Y = 1|D = 0)
= (1 TPF) + (1 )FPF.
In applications to real data
1 TPF = P(Y = 0|D = 1)
should be small since usually a negative diagnosis of a patient who has fallen
sick may result in a fatal incidence. On the other hand, a positive diagnosis
of a negative status may only result in some less serious inconveniences. We
nally introduce the predictive values
Positive predictive value = PPV = P(D = 1|Y = 1)
Negative predictive value = NPV = P(D = 0|Y = 0)
A perfect test has
PPV = 1 = NPV
while a useless test yields
P(D = 1|Y = 1) = P(D = 1) and P(D = 0|Y = 0) = P(D = 0).
41
(1.7.1)
(1.7.2)
The problem now is one of choosing the threshold c yielding a good if not
optimal test. To start with, we consider the curve consisting of all values
from (1.7.2) with < c < .
Denition 1.7.1. The ROC curve consists of all tuples (FPF(c), TPF(c))
with < c < .
Since FPF and TPF are both survival functions, we have
lim TPF(c) = 0
lim FPF(c) = 0
and
lim TPF(c) = 1
lim FPF(c) = 1.
42
CHAPTER 1. INTRODUCTION
[ = 1
1{Y1i >Y0j } ,
AUC
nm
i=1 j=1
a two-sample U -statistic.
As for the Lorenz curve we could consider the partial AUC, namely
s
pAUC(s) =
ROC(t)dt,
0 s 1.
1.8
= EX = (1 F (x))dx.
0
The mean residual lifetime function is dened for all x with F (x) < 1:
F (y)dy
F (x)
43
Note that e(0) = . The function e has been primarily used in survival
analysis and reliability theory. It is the mean remaining lifelength of those
individuals who survive beyond x.
Example 1.8.1. For the exponential d.f. F (x) = exp(x), x 0, we have
e(x) = 1 .
There is an intimate relation between the Lorenz function and the mean
residual lifetime function.
Lemma 1.8.2. We have
L(F (x)) = 1
1
F (x)[e(x) + x].
F (x)[e(x) + x] = xF (x) +
F (y)dy
x
F
(x)
F 1 (t)dt
= EX1{x<X} = EX1{Xx} =
0
= (1 L(F (x))).
The next lemma presents a formula connecting F and e.
Lemma 1.8.3. We have
e(0)
dt
.
F (x) =
exp
e(x)
e(t)
0
Proof. Since
1
F (t)
h (t)
=
e(t)
h(t)
F (z)dz
t
we obtain
x
dt
= ln h(t)
e(t)
0
44
CHAPTER 1. INTRODUCTION
and therefore
exp
h(x)
dt h(x)
=
.
=
e(t)
h(0)
We conclude
e(0) exp[
F (z)dz = F (x)e(x).
. . .] = h(x) =
x
1.9
1 (t)
F
F (x)dx,
0 < t < 1.
In survival analysis, the TTT-transform and its sample version are examples
of processes (plots) which aim at analyzing the distributional properties of
a lifetime variable subject to falling below (or above) a given threshold.
By now, there is a huge literature on the empirical TTT-transform. In
Barlow et al. (1972), it appeared in the estimation process of a failure rate
under order restrictions. Barlow and Proschan (1969) studied tests based
on TTT when the data are incomplete. Langberg et al. (1980) showed how
the TTT-transform can be used to characterize lifetime distributions; see
also Klefsjo (1983a, b). More recently, Haupt and Schabe (1997) applied
the TTT concept to construct bathtub-shaped hazard functions.
Example 1.9.1. For the exponential d.f. with parameter we have
H 1 (t) =
t
.
Again, there are some interesting connections with other functions. One of
these examples is the following
45
1 (t)
F
1 (t)
F
F (x)dx =
0
1{x<y} F (dy)dx
1 (t)
F
The empirical version of H 1 is obtained, if one replaces F and F 1 by their
empirical counterparts:
Hn1 (t) =
1
F
n (t)
Fn (x)dx.
0
At t = k/n we get
}
{ k
X
( )
k:n
k
1
Hn1
=
Fn (x)dx =
Xi:n + (n k)Xk:n .
n
n
0
i=1
1.10
46
CHAPTER 1. INTRODUCTION
It became clear that there is an intimate relationship between these functions and each may serve as a tool to model various aspects of a parent
population. In this section we study the connection between the d.f. F
and the cumulative hazard function . As our nal result we shall come
up with the famous product integration formula. It will oer a possibility
to represent empirical d.f.s through products rather than sums. Important
applications of this approach are in survival analysis.
Let us start with an arbitrary not necessarily random sequence of real numbers, say S0 , S1 , S2 , . . . In most cases (Si )i is adapted to a ltration (Fi )i .
For a given x dene the sequence
n
Tn = x (1 + Si ), n = 0, 1, . . .
i=1
n
n
T dS =
Ti1 Si =
Ti1 (Si Si1 ).
[0,n]
i=1
i=1
T dS.
Tn = x +
(1.10.1)
[0,n]
Proof. Obviously, the assertion is true for n = 0 since both sides equal x.
The general case is proved by induction on n. Given that the assertion is
true for n, we have
Tn+1 = Tn (1 + Sn+1 ) = Tn + Tn Sn+1
= x+
T dS + Tn Sn+1 = x +
[0,n]
T dS.
[0,n+1]
Equation (1.10.1) is an example of an Integral Equation. For x = 1, we
call the sequence
T dS
(1.10.2)
Tn Expn (S) =
(1 + Si ) = 1 +
i=1
[0,n]
47
hdS =
[0,n]
hi Si =
i=1
hi (Si Si1 ).
i=1
Tn = Expn
hdS =
(1 + hi Si ).
i=1
[0,n]
Tn = 1 +
T hdS.
[0,n]
Note that Lemma 1.10.1 is a special case of 1.10.2. Just put hn 1. Now, if
S is a function of bounded variation, we may let the number of grid points
tend to innity. Again, put t0 = and keep tn = t xed. Then the limit
of (1.10.2) exists and we obtain the analogue of (1.10.2) in continuous time:
Tt = 1 +
T (x)S(dx).
(,t]
T (x)S(dx)
(1.10.3)
Tt = 1 +
[0,t]
dF
.
1 F
48
CHAPTER 1. INTRODUCTION
Conclude
(1 F )d
F (t) =
(,t]
and
F (t) 1 F (t) = 1
(1 F )d.
(1.10.4)
(,t]
S c (t) = S(t)
S(x)
xt
S(t)
S{x}
xt
Sc
The function
is called the continuous part of S. Note that S admits only
at most countably many discontinuities.
Theorem 1.10.5. Let F be any d.f. with associated cumulative hazard
function . Then we have
1 F (t) =
[1 {x}].
(1.10.6)
xt
49
For the proof of Theorem 1.10.5 we need the following result for exponentials.
Lemma 1.10.6. Let (Si1 )i and (Si2 )i be two sequences. Then we have
Expn (S 1 )Expn (S 2 ) = Expn (S 1 + S 2 + [S 1 , S 2 ]).
Here
[S 1 , S 2 ]n = S01 S02 +
(1.10.7)
Sk1 Sk2
k=1
{x}
xt
c (t) + d (t),
where d stands for discrete part. As before, consider a grid t0 = <
t1 < . . . < tn = t. If we let n tend to innity such that the grid gets ner
and ner, we get
Expn (c )
[1 + c (ti )]
i=1
exp
{ n
}
c
ln[1 + (ti )]
i=1
exp
{ n
(1.10.8)
}
c (ti ) + o(1)
i=1
exp[c (t)].
This argument shows that the exponential of a continuous monoton bounded
function f equals t exp[f (t)]. For the discrete part, however, we get
Expn (d )
[1 + {x}].
(1.10.9)
xt
Now replace with . Then the product of (1.10.8) and (1.10.9) converges
to the right-hand side of (1.10.5). To prove the Theorem, we apply Lemma
50
CHAPTER 1. INTRODUCTION
[
]
1 + Si1 + Si2 + Si1 Si2
i=1
= exp
{ n
}
ln(1 + Si1 + Si2 + Si1 Si2 )
i=1
= exp
{ n
}
ln(1 +
Si1
Si2 )
+ o(1) .
i=1
The last equation is obtained from a Taylor-expansion and the fact that
1
2
Si Si
{x}c (dx) = 0.
i=1
(,t]
In the following we apply the last theorem to the empirical d.f. Fn . For the
sake of simplicity we assume that the order statistics X1:n < . . . < Xn:n
are pairwise distinct. This is always the case (with probability one) if the
underlying d.f. F is continuous. Now, since Fn only jumps at the data, we
have
{
1
Fn {x}
if
x = Xi:n
ni+1
=
n {x} =
.
0
elsewhere
1 Fn (x)
Hence Theorem 1.10.5 yields
[
1 Fn (t) =
1
Xi:n t
1
ni
=
.
ni+1
ni+1
(1.10.10)
Xi:n t
ni
ni
=
.
ni+1
ni+1
Xi:n <t
Xi:n t
51
j1
i=1
j
j1
ni [
ni
ni
nj ]
=
1
ni+1
ni+1
ni+1
nj+1
i=1
i=1
n1n2
1
nj+1
1
...
= ,
n n1
nj+2nj+1
n
as expected.
If we check the proof of Theorem 1.10.5 carefully, we see that the following
arguments were essential
1 F is the exponential of .
For taking logarithms we needed 1 + Si1 + Si2 + Si1 Si2 > 0.
This was true, however, since Si1 and Si1 Si2 became small while
Si2 {x} and 1 {x} > 0.
For taking limits it was important that is nite over (, t] for each
nite t.
We already knew before that 1 F is the exponential of . Equation
(1.10.5) is important since it reveals a possibility to explicitly represent
1 F in terms of the continuous and discrete part of .
1.11
In the previous sections we have seen that empirical d.f.s may serve as
a universal tool to express (and analyze) various statistics which at rst
sight do not seem to have much in common. For this kind of discussion no
particular assumptions like independence of the data were necessary.
In this section we outline several possible data situations which we will study
in future chapters. The classical eld of empirical distributions (with standard weights) deals with independent identically distributed (i.i.d.)
random observations. Therefore, in forthcoming chapters, we shall rst
study this case in greater detail. The theory and their applications are particularly rich when the data are real-valued. For the multivariate case the
fact that the data cannot be ordered causes new challenging questions. In
the context of time series analysis the data feature some dependence.
52
CHAPTER 1. INTRODUCTION
1
g
=
n
n
Win Xi:n
i=1
i=1
i
n+1
)
Xi:n
g(u)F 1 (u)du.
In the above version i/n was replaced by i/(n + 1) since from time to time
one considers functions g only dened on the open unit interval (0,1) such
that g(u) tends to innity when u 0 or u 1.
The choice of g 1 leads to the sample mean. If, for 0 < a < 12 , we set
g(x) =
1
1
(x),
1 2a [a,1a]
53
(n+1)(1a)
Xi:n ,
i=(n+1)a
1
(n+1)(12a)+1
(n+1)(1a)
i=(n+1)a
Xi:n .
0.0
0.2
0.4
0.6
0.8
Win (x) =
54
CHAPTER 1. INTRODUCTION
Note that fn very much depends on the choice of the cells Ij . A slight change
of the Is may result in a dramatic change of fn , if many data are located
close to the boundary of a cell.
Example 1.11.3. This example is similar to the previous one but from
multivariate statistics. Assume we observe bivariate data (Xi , Yi ), 1 i n.
One may be interested in the mean m(x) = E[Y |X = x] of Y given that X
attains a specied value x, i.e., in the regression of Y at x. Since there may
be no Xi with Xi = x, we may open a window with center x and length
2h and average the Yi s for which the Xi s fall into [x h, x + h]. This leads
to
n
Yi 1[xh,x+h] (Xi )
mn (x) = i=1
n
i=1 1[xh,x+h] (Xi )
n
=
Win (x)Yi ,
i=1
where
1[xh,x+h] (Xi )
Win = Win (x) = n
j=1 1[xh,x+h] (Xj )
is the weight of Yi depending on x, h and the locations of all Xj . For small
h > 0 it may happen that the denominator of Win equals zero so that Win
is not well-dened. For large hs we face the risk that we loose the local
avor of the data. So the question is how to choose a good h. Also note
that each window is located around x and no pre-specied choice of cells as
for histograms is required.
55
form, i.e., rather than Yi the time Ci spent in the study is observed.
Y1
Y2
C3
C4
3
4
5
C5
1.
2.
3.
4.
5.
T = end of
study
time of
surgery
In our example Y1 and Y2 are observable. For cases 3 and 5 we observe the
times spent until the end of the study, while patient 4 dropped out of the
study before the end.
Though some data are censored, they provide important information in that
it is known that a patient survived at least for some time. The question then
becomes one of how to properly weight each datum.
In the following example the situation is even worse since some of the data
may be completely lost.
Example 1.11.5. In retrospective AIDS data analysis (e.g., transfusion
associated AIDS) one may be interested in the incubation period Xi =
Wi Si , where Si is the (calendar) time of infection and Wi is the time of
diagnosis. Typically, there is a time T when the study has to be terminated.
A case is only included if Wi T or Xi T Si = Yi . In this case both Xi
and Yi are observed. If Xi > Yi , the data are unknown and the case is not
included. One says that Xi is truncated by Yi from the right. Consequently
sample size n is unknown. Again the question is how to weight the available
data.
Example 1.11.6. Sometimes it may be known that the observations
X1 , . . . , Xn come from a distribution function F satisfying some constraint,
e.g.,
EX =
xF (dx) = 0.
56
CHAPTER 1. INTRODUCTION
57
58
CHAPTER 1. INTRODUCTION
Chapter 2
t R,
(2.1.2)
The variance attains its maximum when F (t) = 12 , i.e., at the median. As we
move to the left or to the right, the variance declines. It vanishes whenever
F (t) = 0 or F (t) = 1, i.e., outside the support of F .
In the next lemma we present the covariance structure of S {St : t R}.
59
60
(2.1.3)
Proof. We have
Cov(Ss , St ) = E(Ss St ) E(Ss )E(St ) = F (s) F (s)F (t),
whence the result. For the last equality note that
Ss St = 1{Xs} 1{Xt} = 1{Xs,Xt} = 1{Xs} .
For a general pair s, t of real numbers the covariance equals
Cov(Ss , St ) = F (s t) F (s)F (t),
(2.1.4)
where
s t = min(s, t).
For arbitrary measurable functions , 1 and 2 we similarly get the following elementary but basic equations.
Lemma 2.1.3. Provided all integrals exist, we have
E((X)) = dF
2
1 2 dF
1 dF 22 dF .
If we set
< 1 , 2 >
1 2 dF,
61
1 2 dF = 0.
If, furthermore, they are also centered, i.e.,
1 dF = 0 = 2 dF,
then
Cov(1 (X), 2 (X)) = 0.
(2.1.5)
(2.1.5) implies, that also the correlation between 1 (X) and 2 (X) equals
zero.
Orthogonality is obtained for all F if 1 2 0. E.g., if i = 1Bi for i = 1, 2
and B1 B2 = , then
1B1 1B2 = 1B1 B2 = 1 = 0.
Other s may be orthogonal only w.r.t a special F but not w.r.t others.
Having studied some elementary properties of the Single Event Process,
we now begin to analyze the probabilistic structure of the whole process.
The distribution of S is uniquely determined by its nite-dimensional
distributions (dis). For this we x any nite collection of ts, say
t1 < t2 < . . . < tk < tk+1 .
To study the distribution of (St1 , St2 , . . . , Stk+1 ) we note that, e.g.,
P(X tk+1 , X > tk , . . . , X > t1 )
P(X > tk , . . . , X > t1 )
P(X tk+1 , X > tk )
=
= P(Stk+1 = 1|Stk = 0).
P(X > tk )
Similar results hold for other possible past values of S. So we may conclude
the following lemma.
Lemma 2.1.4. The Single Event Process is a Markov process with transition probabilities, for s < t,
P(St = 1|Ss = 1) = 1,
P(St = 0|Ss = 1) = 0,
F (t) F (s)
,
1 F (s)
1 F (t)
P(St = 0|Ss = 0) =
.
1 F (s)
P(St = 1|Ss = 0) =
62
(2.1.6)
F (t) F (s)
.
1 F (s)
63
X = E[(X)] =
dF,
our view of things a bit in that the former target dF now becomes only
the starting point of a whole family Xt of quantities which focus on the true
but possibly unobserved value of X rather than a population parameter. As
t increases, Xt incorporates the available information so that one hopefully
comes up with a better updated predictor of (X). The last remark is
intended to point out the dynamic character of Xt .
In the next lemma we present an explicit expression for Xt .
Lemma 2.1.5. We have, for all t with F (t) < 1,
Xt = (X)1{Xt} + 1{X>t}
(u)F (du)
(t,)
1 F (t)
Proof. Note that both summands are measurable w.r.t. Ft . Actually, the
second term is just 1 St multiplied with a deterministic factor. As to the
rst, if X t, we know the precise value of X since Ft contains all {X u}
with u t. Knowing, for each u t, whether {X u} or {X > u}
occurred, is tantamount to knowing the value of X. Finally, using the
Markov-property of S , it remains to show that Xt has the same expectation
as (X) on both {X t} and {X > t}. But this is obvious.
64
(u)F (du)
(t,)
1 F (t)
(u)F (du)
(t,)
= 1{X>t} (X)
1 F (t)
whence
E[(X) Xt ]2 =
{X>t}
(u)F (du)
(t,)
(X)
1 F (t)
(X)dP
2
=
{X>t}
dP
dF ]2
(t,)
1 F (t)
2 (X)dP.
{X>t}
1
2
|X|2 (X)dP = O(t1 ).
(X)dP
t
{X>t}
1 F (s)
1 F (t)
1 F (s)
.
1 F (t)
1{X>t}
1 F (t)
65
1{Xt}
,
F (t)
2.2
Distribution-Free Transformations
In nonparametric statistics no particular assumption about the underlying distribution function F will be imposed. As a computational drawback
critical regions of test statistics or distributions of estimators may depend on
F and could therefore be dicult to obtain. While with high-speed computers at hand Monte Carlo techniques may nowadays be employed to overcome
these problems, beginning in the 1940s, an increasing scepticism about the
appropriateness of small parametric models led to an enormous interest in
statistical methodology which was distribution-free under broad model
assumptions. As we shall see later the ideas elaborated then are useful in
our context and will help to motivate also more advanced new statistical
approaches.
Generally, a distribution-free transformation aims at transforming a variable
(an observation) X to another one which has a known distribution. In this
section only real-valued X will be considered. The most important transformation is the one which comes up with a uniformly distributed variable.
Denition 2.2.1. The uniform distribution (on the unit interval [0, 1])
is given through its d.f.
0 for t < 0
t for 0 t 1
FU (t) =
1 for 1 < t
66
Hence, a random variable from this distribution can only take on values in
(0, 1) (with probability one). Write U U [0, 1].
The other important reference distribution is the exponential distribution.
Denition 2.2.2. The exponential distribution with parameter > 0 is
given through
{
1
for t < 0
1 F (t) =
(2.2.1)
exp[t] for t 0
We shall write X Exp() whenever X has d.f. (2.2.1). Note that such an
X is nonnegative with probability one. The hazard function pertaining to
this F equals the constant function (x) . The following lemma reveals
a way how to generate a variable X with d.f. F .
Lemma 2.2.3. Let U be uniform on [0, 1]. Then
X := F 1 (U ) F.
(2.2.2)
67
Proof. Denote with A = {a} the set of atoms of F , i.e., the set of points
such that F {a} > 0. There are at most countably many atoms, if there are
any. Let V be independent of X and uniformly distributed on [0, 1]. Set
{
F (X)
if X
/A
U=
.
F (a) + F {a}V if X = a A
Check that U U [0, 1] and X = F 1 (U ).
1.0
0.6
U5
0.8
U2
0.4
U1
0.2
U4
0.0
U3
X3
X4
X1
X5
X2
68
For most distributions, the quantile function does not allow for a simple
analytic expression so that Lemma 2.2.3 is of limited value if one really
wants to generate X through U and F 1 . For the exponential distribution,
however,
u = F (t) = 1 exp(t)
immediately yields
so that
F 1 (u) = 1 ln(1 u)
X := 1 ln(1 U ) Exp().
(2.2.3)
Lemma 2.2.4 and (2.2.3) now lead to the desired distribution-free transformations. Recall
F (dx)
F (t) =
,
1 F (x)
(,t]
F (dx)
= ln[1 F (t)].
(2.2.4)
F (t) =
1 F (x)
(,t]
69
by (2.2.3).
Lemma 2.2.4 has some interesting consequences for the representation of the
Single Event Process. Actually, since X = F 1 (U ), we have
S(t) St = 1{Xt} = 1{U F (t)} .
(2.2.5)
If we denote with S X and S U the S-process for X and U , then (2.2.5) states
that
S X (t) = S U (F (t)).
(2.2.6)
The process S U will be restricted to [0, 1] since for t outside the unit interval
S U is constant (= 0 resp. 1) and therefore uninteresting.
Equation (2.2.6) shows that the study of S X could be traced back to that of
S U . Actually, S X equals S U after a proper time transformation through
F . When we observe X, only the left-hand side of (2.2.6) is known. Neither
U nor F are available. Though, (2.2.6) turns out to be a basic equality
to understand and study many distribution-free procedures in statistics.
It is responsible for the special role played by the uniform distribution in
nonparametric statistics.
Therefore our next section will be devoted to the study of U and the associated S U .
2.3
f (t) =
1
0
for 0 t 1
.
elsewhere
1
dx
1x
= ln(1 t),
0 t < 1.
70
1
.
k+1
In particular, the mean equals 1/2, and for the variance we obtain
VarU =
1 1
1
= .
3 4
12
When we study S U and later the uniform empirical d.f. FnU , we also
have to study the space L2 (F U ). In particular, we are looking for systems
of orthonormal functions spanning relevant subspaces of L2 (F U ).
Lemma 2.3.1. Consider the functions
j (t) = 2 sin(jt),
0 t 1, j N.
]
1
cos jt sin jt 1
=2 t
=1
2
2j
0
[
while
1
0
sin[(j k)t] 1 sin[(j + k)t] 1
sin(jt) sin(kt)dt =
2(j k) 0
2(j + k) 0
= 0 for j = k.
Since the functions j vanish at zero and one they are not candidates for
a basis in L2 (F U ). If, however, we restrict ourselves to the subspace of all
functions in L2 (F U ) vanishing at the boundary of [0,1], these functions are
indeed candidates. Coming back to S U , we have S U (0) = 0 but S U (1) = 1.
To obtain a process which also vanishes at t = 1, we set
0 t 1.
71
The upper bar stands for uniform. This process is centered and vanishes at
t = 0 and t = 1. It has the same covariance structure as S U , see (2.1.4),
namely
Cov(
1 (s),
1 (t)) = s t st,
0 s, t 1.
As a function of t, it equals
{
1 (t) =
t
1t
for t < U
.
for U t 1
1 =
j=1
<
1 , j > j ,
(2.3.1)
72
1 (t)j (t)dt =
<
1 , j > =
0
1
=
1
tj (t)dt +
(1 t)j (t)dt
tj (t)dt +
U
cos(jU )
2
j (t)dt = 2
=
j (U ),
j
j
with
j (x) = cos(jx),
0 x 1.
The functions 1 , 2 , . . . satisfy certain properties which make the orthonormal representation (2.3.1) interesting for statistical applications.
Lemma 2.3.2. We have, for each j 1,
1
Ej (U ) =
j (x)dx = 0
0
and, for i = j,
1
E[i (U )j (U )] =
i (x)j (x)dx = 0.
0
Proof. The rst statement follows from integration, while the second is a
direct consequence of
1
0
sin[(i j)x] 1 sin[(i + j)x] 1
cos(ix) cos(jx)dx =
+
2(i j) 0
2(i + j) 0
= 0.
2
<
1 , j >=
j (U )
j
(2.3.2)
73
1
j2 (x)dx
0
[
]
2
cos jt sin jt 1 1
= 2 2
+ t
j
2j
2 0
1
= 2 2.
j
Hence, the variances decrease at the rate j 2 . We may approximate
1 by
the series in (2.3.1) truncated at some integer k:
1k =
<
1 , j > j .
j=1
<
1 , j > 2 .
j=k+1
j=k+1
For k = 0 we get
1
j 22
12 (t)dt =
1
= O( ).
k
<
1 , j >2 .
j=1
1
j 22
j = 1, 2, . . .
0 s, t 1.
74
Therefore representation (2.3.1) constitutes the principal component decomposition of the Single Event Process.
It is also interesting to determine the distribution of the (uncorrelated)
j (U ).
Lemma 2.3.3. 1 (U ), 2 (U ), . . . all have the same distribution. Their
Fourier-transform equals the Bessel-function of order zero, i.e.,
[
E e
it cos(jU )
]
= J0 (t) :=
(1)k t2k
k=0
Proof. Set
(k!)2 4k
1
I1 :=
cos[t cos(jx)]dx
0
and
1
sin[t cos(jx)]dx.
I2 :=
0
We rst show
1
I1 =
cos[t cos(x)]dx,
(2.3.3)
1
I2 =
(1)k1
j
j
k=1
1
sin[t cos(x)]dx,
(2.3.4)
1
sin[t cos(x)]dx = 0,
(2.3.5)
(2.3.6)
and
1
0
j
0
1
cos[t cos(u)]du =
j
j
k=1k1
cos[t cos(u)]du.
75
1
cos[t cos(y + (k 1))]dy
cos[t cos(u)]du =
0
k1
1
=
k1
cos (1)
1
]
t cos(y) dy = cos[t cos(x)]dx.
[
sin (1)
k1
k=1 0
1
j
]
1
k1
(1)
sin[t cos(x)]dx.
t cos(y) dy =
j
k=1
(2.3.7)
Conclude that
1
1
sin[t cos(x)]dx =
sin(tu)(1 u2 )1/2 du = 0,
where the last equation follows from the fact that the integrand is odd.
Finally,
1
1
cos[t cos(x)]dx =
1
0
cos(tu)(1 u2 )1/2 du
k t2k
(1)
2
u2k du
cos(tu)(1 u2 )1/2 du =
.
(2k)!
(1 u2 )1/2
k=0
0
The rst equation uses (2.3.7) again, while the second uses the fact that the
integrand is even. For the last equation, just apply the series expansion of
the cosine function. To compute the last integral, apply (2.3.7) again to get
1
0
u2k du
=
(1 u2 )1/2
1/2
1/2
2k 1
2k
cos (x)dx =
cos2(k1) (x)dx.
2k
0
76
Summarizing,
[
E e
it cos(jU )
]
=
1
cos[t cos(jx)]dx + i
sin[t cos(jx)]dx
0
1
=
cos[t cos(x)]dx.
0
Chapter 3
Univariate Empiricals:
The IID Case
3.1
Basic Facts
i=1
78
in L2 (, A, P),
1
1{Xi t}
n
n
Fn (t) =
i=1
79
is a sample mean of bounded (and hence P-integrable) i.i.d. random variables, the Strong Law of Large Numbers (SLLN) may be applied to get
for each xed t R:
lim Fn (t) = F (t) with probability one.
(3.1.5)
0.0
0.2
0.4
0.6
0.8
1.0
In Figure 3.1.1 we have added the true d.f. F to the empirical d.f. F100
already depicted in Figure 1.3.1. These data were independently generated
from an Exp(1)-distribution. It becomes clear that apart from small deviations F100 follows the graph of F very closely.
Finally, the de Moivre-Laplace version of the Central Limit Theorem
(CLT) yields for each t R, as n , an approximation for the distribution
of a properly standardized Fn (t):
L
(3.1.6)
80
to (3.1.3),
2 = t2 = F (t)(1 F (t)).
The assertions (3.1.2) (3.1.3) in Lemma 3.1.2 as well as (3.1.5) and (3.1.6)
allow for straightforward extensions to general empirical integrals. To apply
the SLLN and CLT appropriate moment conditions are required which are
automatically fullled for indicator functions. See also Lemma 2.1.3 for
sample size n = 1.
Lemma 3.1.3. Assume that X1 , . . . , Xn are i.i.d. from a d.f. F , and let
, 1 , 2 : R R be arbitrary (Borel measurable) functions. Provided that
all integrals on the right-hand side exist, we have:
[
]
E dFn = dF
(3.1.7)
[
]
(
)2
[
]
2
Var dFn = n1
dF
dF
(3.1.8)
[
]
[
]
L
n1/2 dFn dF N (0, 2 )
(3.1.11)
with
dF
2
)2
dF
All statements are classical results from probability theory properly translated into the language of empiricals.
Equation (3.1.4) reveals the covariance structure of Fn when considered
as a stochastic process. To get rid of the factor n1 on the right-hand side we
need to introduce the standardized process
n (t) := n1/2 [Fn (t) F (t)],
t R.
This process is the so-called empirical process. For sample size n = 1 and
for the uniform case it was introduced and studied in Section 2.3. In terms
of n we immediately get from the preceding results
Lemma 3.1.4. Assume that X1 , . . . , Xn are i.i.d. from F . Then we have:
E[n (t)] = 0 for each t R
(3.1.12)
(3.1.13)
(3.1.14)
81
and
lim Fn (t) = 1 = lim F (t).
Hence
lim n (t) = 0 = lim n (t).
(3.1.15)
(3.1.16)
1
Fn (u) =
1{Ui u} ,
n
(3.1.17)
0 u 1,
i=1
82
t R,
(3.1.18)
1 i n.
Proof. The proof follows from (3.1.17) and the denition of quantile functions. See also Figure 2.2.1.
In Chapter 1 we introduced the rank of an observation as a means to describe
its position within the sample. Let
R = (R1 , . . . , Rn )
denote the vector of ranks. The following result presents a well known
fact, namely that under some weak assumptions R is distribution-free.
83
Lemma 3.1.6. Let R be the vector of ranks for an i.i.d. sequence of observations from a continuous d.f. F . Then the distribution of R is distributionfree, i.e., does not depend on F .
Proof. Under continuity,
F F 1 (u) = u for all 0 < u < 1.
Moreover, Xi = F 1 (Ui ) for all 1 i n. Altogether, this gives us for each
1 i n:
Ri = nFn (Xi ) = nFn (F (Xi )) = nFn (F F 1 (Ui )) = nFn (Ui ),
the second equality following from (3.1.17). We see that F has dropped
out. The proof is complete.
Lemma 3.1.6 does not specify the distribution of the vector of ranks. It only
guarantees that it suces to consider uniformly distributed observations.
Due to continuity, there will be no ties among the data and each rank is
well-dened. The vector R attains its values in the set of all permutations
of the integers 1, . . . , n. The event {R = r} corresponds to a unique ordering
of the U s. For independent U s each of the n! possible orderings has equal
probability, namely
1
...
1du1 . . . dun = .
n!
0<u <...<u <1
1
84
Proof. The proof of this result presents another nice application of (3.1.17).
First we have
Dn = sup |Fn (F (t)) F (t)|.
(3.1.19)
tR
(3.1.20)
0u1
0i<k
Then use the monotonicity of Fn and the identity function to get for any u
between ui and ui+1 :
Fn (u) u Fn (ui+1 ) ui |Fn (ui+1 ) ui+1 | +
and similarly
|Fn (ui ) ui | Fn (u) u.
In conclusion,
sup |Fn (u) u| max |Fn (ui ) ui | + .
0ik
0u1
The maximum is taken over nitely many points. Hence, by (3.1.5), the
max tends to zero with probability one. We obtain
n
lim sup D
n
P-almost surely.
85
(3.1.21)
Next observe that, because of Fn and F are right-hand continuous and have
left-hand limits,
Dn = sup |Fn (t) F (t)|,
tQ
with Q denoting the set of all rational numbers. Hence Dn , being the supremum of countably many random variables |Fn (t)F (t)|, is again measurable.
In particular, the event
0 = { lim Dn = 0}
n
86
0.0
1.0
1.0
0.0
0.5
Sample 2 with n = 50
0.5
Sample 1 with n = 50
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.6
0.8
1.0
0.0
1.0
1.0
0.0
0.5
0.5
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
0.5
0.0
0.5
Sample 2 with n = 50
0.5
Sample 1 with n = 50
0.0
0.5
0.5
0.0
0.5
0.5
87
0.0
1.0
1.0
0.0
0.5
Sample 2 with n = 50
0.5
Sample 1 with n = 50
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.6
0.8
1.0
0.0
1.0
1.0
0.0
0.5
0.5
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
0.5
0.0
0.5
Sample 2 with n = 50
0.5
Sample 1 with n = 50
0.0
0.5
0.5
0.0
0.5
0.5
88
Figures 3.1.4 and 3.1.5 show the plots of n = n1/2 (Fn F ) for the same set
of data. Under the microscope, the roughness of the sample paths becomes
apparent. Also the paths do not explode so that multiplication with the
standardizing factor n1/2 seems to keep everything in balance. Mathematically, at least for a xed t, this is justied by the CLT guaranteeing that
the distribution of n (t) has a nondegenerate limit.
The next lemma presents an algorithm how to compute Dn in nitely many
steps.
Lemma 3.1.9. For a continuous F , we have
{
}
i1 i
i
F (Xi:n ) Dn
n
i1
= lim [F (t) Fn (t)] Dn .
tXi:n
n
89
where
Dn+
and
Dn
1in
tR
]
i
F (Xi:n )
n
]
[
i1
= sup[F (t) Fn (t)] = max F (Xi:n )
1in
n
tR
denote the two one-sided deviations between Fn and F . Also Dn+ and Dn
are distribution-free as long as F is continuous. Moreover, both have the
same distribution. This is most easily seen by introducing the variables
Ui = F (Xi ) and Ui = 1 Ui , 1 i n. Under continuity of F , the Ui s are
also i.i.d. from FU . Moreover, in an obvious notation, Dn+ = Dn whence
Dn+ = Dn in distribution.
Our nal results are concerned with the consistency of empirical quantiles.
So far we have only studied the consistency of empirical integrals, with
special emphasis on Fn (t). Under appropriate integrability conditions on ,
consistency was a straightforward consequence of the SLLN. For quantiles
the situation is dierent. They are dened through Fn1 and F 1 to which
a priori no SLLN applies.
We rst study uniform quantiles Fn1 (u). Lemma 3.1.10 shows that the KS distance between Fn1 and the identity function equals the K-S distance
between Fn and the identity function.
Lemma 3.1.10. We have
n .
Dn := sup |Fn (u) u| = sup |Fn1 (u) u| =: D
0u1
0<u<1
i1
= lim [Fn1 (u) u]
n
u i1
n
90
and
i
i
Ui:n = Fn1
n
n
we obtain
( )
i
,
n
n D
.
D
n
Conversely,
i1
i
< u implies Fn1 (u) = Ui:n
n
n
and therefore
n Ui:n
D
i
i1
n,
Fn1 (u) u Ui:n
D
n
n
n D
n.
from which D
An application of the Glivenko-Cantelli Theorem to a uniform sample together with Lemma 3.1.10 immediately yield with probability one
0.
sup |Fn1 (u) u| = D
n
(3.1.22)
0<u<1
For a general F , things are less obvious. The next lemma gives a positive
answer as to pointwise consistency.
Lemma 3.1.11. Assume that F 1 is continuous at 0 < u < 1. Then
lim Fn1 (u) = F 1 (u) with probability one.
91
u0 uu1
0<u<1
since X1:n Fn1 (u) Xn:n and F 1 is unbounded. This simple counterexample shows that compactness arguments are indeed crucial to obtain
uniform convergence, and that the uniform convergence valid for Fn may
not hold for other estimators. Another example is the cumulative hazard
function. We have noted in (2.2.4) that for a continuous F
F (t) = ln[1 F (t)]
and therefore
F (t) as t bF ,
where bF is the smallest upper bound for the support of F , possibly innite.
The Nelson-Aalen estimator
n (t) =
Fn (dx)
1 Fn (x)
(,t]
One may argue that the bad t of n is caused by ts in the far right tails,
namely t Xn:n . This is true only to a certain extent, since typically a bad
t of an estimating function is not abrupt but takes place gradually. See
Figure 1.4.1. In other words, n (t) is not a reliable estimator of (t) already
for those ts ranging between extreme order statistics Xn:n > Xn1:n >
Xn2:n > . . ..
92
3.2
Finite-Dimensional Distributions
In the last section we have seen that for i.i.d. random variables with distribution , we have
( )
n k
P(nn (A) = k) =
(A)(1 (A))nk ,
k = 0, 1, . . . , n.
k
Since the number of observations n is xed the event nn (A) = k automat = n k. In other words, the event on the left-hand
ically implies nn (A)
side equals
= n k}.
{nn (A) = k, nn (A)
forms a simple partition of the sample space S. We
The collection {A, A}
now discuss the extension to more than two sets. Let {A1 , A2 , . . . , Am } be
a (measurable) partition of the sample space, i.e.,
Ai Aj = for i = j,
i, j = 1, . . . , m and
Ai = S.
i=1
2 -test
0 (Ai )
(3.2.1)
n
P(nn (Ci ) = ki for 1 i m) =
[(Ai )]ni ,
n1 , n2 , . . . , nm+1
i=1
93
Though Lemmas 3.2.1 and 3.2.2 provide us with exact distributions the
formulas are not tractable for moderate and large sample sizes. Moreover,
for a deeper forthcoming analysis they are also not very insightful to detect
hidden structures. In particular, if we want to study the dynamic behavior
of empirical and related processes it is useful to restate the last lemma in
terms of conditional probabilities.
Lemma 3.2.3. In the situation of Lemma 3.2.2 we have
P (nn (Cm ) = km |nn (Ci ) = ki for 1 i m 1)
= P(nn (Cm ) = km |nn (Cm1 ) = km1 )
]
]
(
)[
[
n km1
(Cm \ Cm1 ) km km1
(Cm \ Cm1 ) nkm
=
1
.
1 (Cm1 )
1 (Cm1 )
km km1
Proof. Write the conditional probability as a ratio of two probabilities and
apply the last lemma.
Lemma 3.2.3 states that nn has a Markov-property when it is evaluated on increasing sets. Furthermore the transition probability equals a
(shifted) binomial distribution with new parameters nkm1 and (|Cm1 ).
More precisely, given nn (Cm1 ) = km1 , nn (Cm ) has the same distribution as km1 + (n km1 )nkm1 (Cm \ Cm1 ), and where the data dening
the last empirical distribution are i.i.d. from restricted (and re-normalized)
to Cm1 .
For real-valued data, the most important Ci s are again the extended intervals (, ti ]. Because of their importance, the preceding results are summarized for these Cs in the following Theorem. It constitutes the extension
of Lemma 2.1.4 to sample size n > 1.
Theorem 3.2.4. Assume that X1 , . . . , Xn are i.i.d. from F . Then Fn is a
Markov-process such that for t1 < t2 < . . . < tm and 0 k1 k2 . . .
km n:
P(nFn (tm ) = km |nFn (tm1 ) = km1 )
(
)[
]
[
]
n km1
F (tm ) F (tm1 ) km km1
F (tm ) F (tm1 ) nkm
=
1
.
km km1
1 F (tm1 )
1 F (tm1 )
94
.
t k + (n k)Fnk
1 F (t0 )
Similarly, for the process in reverse time, we have that conditionally on
nFn (t0 ) = k, nFn (t) on t t0 has the same distribution as the process
(
)
F (t)
t k Fk
.
F (t0 )
Informally speaking, we see that conditionally on Fn (t0 ) = k, the process
nFn on t t0 resp. t t0 equals in distribution a uniform empirical d.f. with
sample size and time transformation properly adjusted. These statements
are (hopefully) easier to remember than those in Theorems 3.2.4 and 3.2.5.
They readily enable us to compute conditional expectations. For example,
in the forward case, let Ft = (Fn (s) : s t). Then the rst statement in
Theorem 3.2.6 implies for t0 t:
E[nFn (t)|Ft0 ] = nFn (t0 ) + (n nFn (t0 ))
F (t) F (t0 )
,
1 F (t0 )
(3.2.2)
95
F (t)
.
F (t0 )
(3.2.3)
1Fn (t)
1F (t)
the process t
Fn (t)
F (t)
t R,
(3.2.6)
96
. . . n (Am1 ) the limit distribution of (n (A1 ), . . . , n (Am )) when properly standardized will be degenerate. Therefore we restrict ourselves to the
rst m 1 coordinates.
Lemma 3.2.9. Let A1 , . . . , Am be a partition of the sample space. Then,
as n ,
n1/2 [n (A1 )(A1 ), . . . , n (Am1 ) (Am1 )]T
L
Nm1 (0, ),
where is the (m 1) (m 1) matrix
p1 (1 p1 ) p1 p2
p2 p1
=
..
..
.
.
pm1 p1
...
p1 pm1
..
.
pm1 (1 pm1 )
and pi = (Ai ).
Proof. The assertion follows from the multivariate CLT. For the covariance
matrix, recall (2.1.6).
We only mention the well-known fact that Lemma 3.2.9 is in charge of the
limit distribution of 2 from (3.2.1). Actually, setting pi = n (Ai ) we get
2 = n(
p1 p1 , . . . , pm1 pm1 )I(
p1 p1 , . . . , pm1 pm1 )T
with
1 . . .
p1 + p1
pm
p1
m
m
..
p1
.
m
I=
..
..
.
.
1
1
1
pm
. . . . . . pm1 + pm
2 Z,
where Z has a 2 -distribution with m 1 degrees of freedom.
3.3
97
Order Statistics
In this section we collect some basic distributional properties of order statistics. Special emphasis is given on order statistics from the uniform and the
exponential distribution. Order statistics are important for at least three
reasons:
Special quantiles like the median, upper and lower quartiles or, more
generally, linear combinations of order statistics are important (robust)
competitors to estimators based on empirical moments (like sample
means or sample variances).
As Lemma 3.1.9 has shown, properly transformed order statistics may
appear in distribution-free statistics. To determine the distribution of
these statistics, a detailed study of order statistics will be indispensable.
In some data situations, e.g., in life testing, n items are put on test and
monitored until the r-th failure. In such a situation only X1:n , . . . , Xr:n
are observable and any statistical inference therefore needs to be based
on them.
Our rst result deals with the distribution of a single Xi:n . For example,
when i = n, we get
P(Xn:n x) = P(Xi x for all i = 1, . . . , n)
= F n (x), by independence.
For i = 1, we obtain
P(X1:n x) = 1 P(X1:n > x) = 1 P(Xi > x for all i = 1, . . . , n)
= 1 [1 F (x)]n .
These examples are two special cases of the following more general result.
Lemma 3.3.1. Let X1 , . . . , Xn be a sample of independent observations
from a d.f. F . Then
n ( )
n
Gi (x) = P(Xi:n x) =
F k (x)(1 F (x))nk , 1 i n.
k
k=i
98
and (3.1.1).
The function Gi only gives limited information about the distributional character of order statistics. Therefore we next determine the joint distribution
of all Xi:n s. Since by construction X1:n X2:n . . . Xn:n this distribution is supported by the set K of all x = (x1 , . . . ,
xn ) of n-vectors
satisfying x1 x2 . . . xn . The set of rectangles ni=1 (ai , bi ] with
a1 b1 . . . an bn uniquely determines a distribution on K. From the
i.i.d. property of the Xi s, we obtain
P(ai < Xi:n bi for 1 i n) = n!
When F admits a Lebesgue density f , the Xi:n s also have a Lebesgue density
supported by K.
Lemma 3.3.2. Assume that X1 , . . . , Xn are independent from a density f .
Then (X1:n , . . . , Xn:n ) has the Lebesgue density
{
99
(Ei:n Ei1:n ) =
i=1
i=1
Yi
.
ni+1
The interesting fact about this representation is that the Yi s have a simple
distributional structure. See Renyi (1953).
Lemma 3.3.3. The random variables Y1 , . . . , Yn are i.i.d. from Exp(1).
Y1
E1:n
..
.
. = A ..
Yn
En:n
Proof. We have
n
0
...
(n 1)
n1
0
0
(n
2)
n
A=
..
0
...
...
0
0
.
..
2 2 0
...
1 1
...
...
x1
x1
{
100
i=1
Yi
.
ni+1
(3.3.1)
1 Ur:n
Qr :=
1 Ur1:n
]nr+1
= exp(Yr ).
(3.3.2)
We may therefore conclude from Lemma 3.3.3 the following Corollary. See
Malmquist (1950).
Corollary 3.3.4. The random variables Q1 , . . . , Qn are i.i.d. from FU .
There is another interesting variant of Lemma 3.3.3 which brings the cumulative hazard function F back into play.
Corollary 3.3.5. Assume that X1 , . . . , Xn are i.i.d. from a continuous F .
Then the variables
Zi = (n i + 1) [F (Xi:n ) F (Xi1:n )] ,
1in
Yi
.
(3.3.3)
1 Ur:n = exp
ni+1
i=1
101
Since the Y s are independent, the partial sums are Markovian. So are
U1:n , . . . , Un:n . Moreover, for k + 1 j n, we have
[ j
]
Yi
Uj:n = 1 exp
ni+1
i=1
[ k
]
[
]
j
Yi
Yi
= 1 exp
exp
,
ni+1
ni+1
i=1
i=k+1
whence
Uj:n Uk:n
1 Uk:n
= 1 exp
[
i=k+1
Yi
ni+1
Yi
= 1 exp
(n k) (i k) + 1
i=k+1
[ jk
]
Yi+k
= 1 exp
nki+1
(3.3.4)
]
i=1
= Ujk:nk .
From this we immediately get the following lemma.
Lemma 3.3.6. For each n 1, U1:n , . . . , Un:n is a Markov sequence such
that for 0 y1 . . . yk 1:
)
(
Ui:n yk
, k + 1 i n|U1:n = y1 , . . . , Uk:n = yk
L
1 yk
= L(U1:nk , . . . , Unk:nk ).
Here again L stands for distribution.
The following lemma presents a reverse Markov property for the Ui:n s.
Lemma 3.3.7. For each n 1 and all 0 < yk+1 . . . < yn < 1 we have
L (Ui:n , 1 i k|Uk+1:n = yk+1 , . . . , Un:n = yn ) = L(yk+1 Ui:k , 1 i k).
i = 1 Ui . The U
i s are again i.i.d. from FU . Since Ui:n =
Proof. Put U
ni+1:n , we get from the preceding lemma, for any a1 b1 . . .
1U
102
ak bk yk+1 :
P (ai < Ui:n bi for 1 i k|Uk+1:n = yk+1 , . . . , Un:n = yn )
(
)
ni+1:n < 1 ai for 1 i k|U
nk:n = 1 yk+1 , . . . , U
1:n = 1 yn
= P 1 bi U
(
)
ki+1:k < ai + yk+1 for 1 i k
= P bi + yk+1 yk+1 U
= P (ai < yk+1 Ui:k bi for 1 i k) .
Our nal result in this direction is the following
Lemma 3.3.8. For 1 < r < n and 0 < x < 1, given Ur:n = x, the
random vectors (U1:n , . . . , Ur1:n ) and (Ur+1:n , . . . , Un:n ) are independent
and distributed as (xU1:r1 , . . . , xUr1:r1 ) and (x + (1 x)U1:nr , . . . , x +
(1 x)Unr:nr ).
When F has a derivative f = F , then also Gi has a Lebesgue density
which may be immediately derived from the formula in Lemma 3.3.1. In the
following we shall rather apply the Markov structure of the Ui:n s to get some
experience with a recursion technique, which also applies in a conditional
setting.
First
Gn (x) = P(Un:n x) = xn for 0 x 1.
Furthermore, for 1 i < n and 0 x 1, we have
1
Gi (x) = P(Ui:n x) =
1
= P(Ui+1:n x) +
1
= Gi+1 (x) + xi
y i Gi+1 (dy).
(3.3.5)
This recursion will be the key tool to compute the density gi of Ui:n .
103
1
ta1 (1 t)b1 dt,
B(a, b) =
a, b > 0
(a)(b)
(a + b)
for a N.
1
= ixi1
y i y i (1 y)ni1 dy/B(i + 1, n i)
1
(1 y)ni1 dy/B(i + 1, n i)
= ixi1
x
= x
i1
104
(3.3.6)
B(j + i, n i + 1)
.
B(i, n i + 1)
i
i,n
n+1
i(i + 1)
.
(n + 1)(n + 2)
From this
1
i,n (1 i,n ).
n+2
These formulae may also be easily obtained by using (3.3.3). For example,
since
1
E [exp(E)] =
1+
for > 0,
Var Ui:n =
E[1 Ur:n ] =
i=1
1
ni+1
1+
ni+1
i=1
ni+2
whence
nr+1
r
=1
n+1
n+1
r
.
n+1
Furthermore, putting j = k + 1 in (3.3.4), we get
(
)
Yk+1
Yk+1
nk
nk
Uk:n + 1 e
Uk+1:n = e
EUr:n =
105
nk
1
Uk:n +
.
nk+1
nk+1
1in
1 i n.
The reverse Markov property of the Ui:n s gives rise to a reverse martingale.
The proof is similar to that of Lemma 3.3.12 and therefore omitted.
Lemma 3.3.13. The sequence
(n + 1)Ui:n /i,
1in
b1 b2x1
=
...
a1 a2 x1
an+1 x1 ...xn
106
Substituting yn+1 = xn+1 + (x1 + . . . + xn ) one obtains for the last integral
b1 b2x1
bn+1
...
a1 a2 x1
b1b2
an+1
bn+1
...
a1 a2
an+1
(bi ai )
i=1
b1xn+1
bnxn+1
...
a1 xn+1
xn exp(x)dx = n!
an xn+1 0
(bi ai ),
i=1
as desired.
The results presented in this section are fundamental tools for proving various properties of empirical d.f.s. In the following section they will be used
to get some exact boundary crossing probabilities for Fn .
3.4
Fn (t)
.
t:0<F (t) F (t)
sup
By the SLLN,
Fn (t)
1 with probability one.
F (t)
(3.4.1)
(3.4.2)
so that
Gn () = P( min nUk:n /k > ).
1kn
Recall that Un:n has the Lebesgue density ny n1 on 0 < y < 1. Thus, for
n > 1, we may apply Lemma 3.3.7 and use the validity of (3.4.2) for n 1
to get
(
1
Gn1
Gn () =
(n 1)
ny
)
ny n1 dy
1 [
1
]
(n 1)
ny n1 dy = 1 .
ny
108
If we set
1
= 1 + with 0
(3.4.3)
This inequality is also true when F (t) = 0, since then with probability one
all data exceed t so that Fn (t) = 0. Since for uniformly distributed data,
F (t) = t is a linear function, the bound (3.4.3) is sometimes called a linear
bound (on the F -scale) for Fn .
Next we investigate lower bounds corresponding to (3.4.3). Again it suces
to consider the case of uniformly distributed U s.
For 0 s n and a, b > 0 such that a + sb < 1, put
Pns (a, b) = P(Uk:n a + (k 1)b for 1 k s, Us+1:n > a + sb),
with Un+1:n 1.
Lemma 3.4.2. (Dempster). We have
( )
n
Pns (a, b) =
a(a + sb)s1 (1 a sb)ns .
s
Proof. The assertion is true for n = 1. Also, for a general n, it holds for
s = 0. For n 2 and s 1 we obtain upon conditioning on U1:n :
[
a
n(1 y)
n1
Pns (a, b) =
0
Pn1,s1
]
a+by
b
,
dy.
1y
1y
)
max nUk:n /k >
1kn
n1
(
s=0
)
[
]
n ( )s
(1 + s) ns
s1
(1 + s)
1
s
n
n
s=0
)
max nUk:n /k >
1kn
( )
,
n n
s=0
[
]
n1
(n) ( )s
1 + s) ns
s1
(1 + s)
1
=
.
n
n
s
=
n1
Pns
s=0
1kn1
so that
(
1
P inf Fn (t)/t <
t>U1:n
(
=P
)
max nUk+1:n /k > .
1kn1
110
[
]
/n
r
n1
n y
n(1 y)
Pn1,s
,
dy,
1 y n(1 y)
s=0
where r is the largest integer < n/. Applying Dempsters formula we get
the following Corollary.
Corollary 3.4.4 (Chang). For 0 < < n,
(
P
)
max nUk+1:n /k >
1kn1
(
)
r+1 ( )
(
)n n ( )i
i ni
= 1
+
1
(i1)i1 .
n
n
n
i
i=1
(e )k (k 1)k1
P
max nUk+1:n /k > e +
.
1kn1
k!
k=1
tR
when F is continuous.
Theorem 3.4.5 (Birnbaum-Tingey). For all 0 < x 1 and n 1 we
have
<n(1x)> (
P(Dn+ x) = 1 x
j=0
)
n
j
j
(1 x )nj (x + )j1 .
j
n
n
(3.4.4)
yn
y
K+1
J(x, n, K) = n!
...
1x
1
1x n
y2
yK
...
1x nK
n
dy1 . . . dyn .
0
By induction we obtain
y2
yK
dy1 . . . dyK1 =
...
0
K1
yK
.
(K 1)!
1 x nK
n
J(x, n, K) = J(x, n, K + 1)
K!
)K
1
n!
y
K+2
...
1x
(
J(x, n, K + 1)
)
nK K
1x
K!
1x
dyK+1 . . . dyn
n(K+1)
n
I(x, n, K + 1)
whence
J(x, n, K) = J(x, n, n)
n1
i=K
1x
i!
)
ni i
n
I(x, n, i + 1).
Note that
J(x, n, n) = 1 (1 x)n
while for the Is we obtain through induction on i:
(
)ni1
x x + ni
n
I(x, n, i + 1) = n!
,
(n i)!
i = K, . . . , n 1.
112
Chapter 4
U -Statistics
4.1
Introduction
i1 =1
...
h(Xi1 , . . . , Xik ).
ik =1
(n k)!
h(Xi1 , . . . , Xik ),
n!
114
CHAPTER 4. U -STATISTICS
j (Xi ),
j = 1, 2
i=1
we obtain
a1 L1n + a2 L2n =
(a1 1 + a2 2 )(Xi ),
i=1
which is a linear statistic with score function a1 1 +a2 2 . In other words, the
linear statistics form a subspace of all statistics. We enlarge this subspace to
the subspace of all sums, where the transformations of the data may dier
from i to i:
n
L=
i (Xi ).
(4.1.1)
i=1
In the following section we present a famous result due to Hajek, who was
able to compute the projection of a general square integrable statistic. Then
we shall apply Hajeks projection lemma to U -statistics.
4.2. THE HAJEK
PROJECTION LEMMA
4.2
115
The H
ajek Projection Lemma
The Projection Lemma which we will discuss now was established by Hajek
(1968) and then applied to achieve asymptotic normality of linear rank
statistics. Much earlier, Hoeding (1948) has already used projection techniques to linearize U -statistics.
The original Hajek projection lemma only assumes independence of the underlying variables. Equality of the distributions is not required.
Lemma 4.2.1. (Hajek). Let X1 , . . . , Xn be independent random variables
and let S = S(X1 , . . . , Xn ) be a square-integrable function (statistic) of
X1 , . . . , Xn . Set
n
S =
E(S|Xi ) (n 1)ES.
i=1
(ii) E(S S)
(iii) For any L of the form (4.1.1), one has
2 + E(S L)2 ,
E(S L)2 = E(S S)
n
]
[
]
E (S S)(S L) =
E (S S)(E(S|X
i ) li (Xi ))
i=1
[
]
i) .
E (E(S(Xi ) i (Xi ))E(S S|X
i=1
Because of independence,
{
E(E(S|Xj )|Xi ) =
E(S)
for i = j
.
E(S|Xi ) for i = j
116
CHAPTER 4. U -STATISTICS
[
]
S L) = 0
E (S S)(
Zi ,
(4.2.1)
i=1
E(S|Xi , F) (n 1)E(S|F).
i=1
Then
(i) E(S|F)
= E(S|F)
2 |F] = Var(S|F) Var(S|F)
(ii) E[(S S)
(iii) For any admissable L, one has
2 |F] + E[(S L)2 |F],
E[(S L)2 |F] = E[(S S)
i.e., S minimizes the left-hand side.
117
L = E(S|F) = E(S|F).
i=1
E[(S S)(E(S|X
i , F) Zi )|F]
{
}
i , F]|F .
E (E(S|Xi , F) Zi )E[S S|X
i=1
E(S|F)
for i = j
.
E(S|Xi , F) for i = j
It follows
i , F) = E(S|Xi , F)
E(S|X
whence
S L)|F] = 0.
E[(S S)(
Conclude (iii).
Equality (ii) is likely to be applied in the following way. Assume that as
n the right-hand side converges to zero in probability or P-a.s. Then
so does the left-hand side. In case we may apply an integral convergence
theorem, we infer S S 0 in L2 . Coming back to the right-hand side,
the -elds may depend on n. In case they are increasing or decreasing,
martingale arguments might be useful. Conditioning on F is always useful
when S contains awkward F-measurable components.
4.3
Projection of U -Statistics
As before, consider
Un =
(n k)!
h(Xi1 , . . . , Xik ),
n!
118
CHAPTER 4. U -STATISTICS
hj (x) =
h(x1 , . . . , xj1 , x, xj+1 , . . . , xk )F (dx1 ) . . . F (dxk ).
Rk1
n
k
[hj (Xi ) ],
i=1 j=1
+
j
i
j=1
n!
n!
nk
+ n1
hj (Xi ).
n
k
j=1
j
0
denote summation over all not containing i resp.
resp.
Here
containing i at position j. It follows that
n =
U
E(Un |Xi ) (n 1)
i=1
= (n k) + n
n
k
hj (Xi ) (n 1)
i=1 j=1
=+n
n
k
[hj (Xi ) ].
i=1 j=1
h(x)
=
[hj (x) ],
j=1
119
i ).
h(X
i=1
1 ), . . . , h(X
n ) are i.i.d. and centered. So
The random variables h(X
n ) N (0, 2 ) in distribution
n1/2 (U
(4.3.1)
with
2
2 (x)F (dx).
h
(n k)!
H(Xi1 , . . . , Xik )
n!
120
CHAPTER 4. U -STATISTICS
4.4
1 ,2
r=0 1 ,2
= ...
h(x1 , . . . , xk )h(y1 , . . . , yk )F (dx1 ) . . . F (dx2kr ),
R2kr
121
(n k)! 2 (n r)!
Var(Un ) =
I(1 , 2 ).
(4.4.1)
n!
(n 2k + r)!
|1 |=r=|2 |
r=1
)
k ( )1 ( )(
n
k
nk
r=1
kr
Ir .
(4.4.2)
122
CHAPTER 4. U -STATISTICS
k
]
1 [
E h(Xij )|Xj1 , . . . , Xjk
k
j=1
1
= hl (Xjs ) h(X
js ).
k
(4.4.4)
with
2
4.5
2 (x)F (dx).
h
(n m)!
h(Xi1 , . . . , Xim ),
n!
123
1
Un (u, v) =
h(Xi , Xj )1{Xi u,Xj v} .
n(n 1)
1i=jn
Write
1
1{Xi x} ,
n
n
Fn (x) =
x R,
i=1
the empirical d.f. of the sample. Then Un (u, v) becomes (assuming no ties)
n
Un (u, v) =
n1
u v
h(x, y)1{x=y} Fn (dx)Fn (dy).
Now, a standard method to analyze the (large sample) distributional behavior of Un is to write Un as
n + Rn ,
Un = U
n is the Hajek projection of Un and the remainder Rn = Un U
n
in which U
is a degenerate U -statistic that is asymptotically negligible when compared
n . In fact, provided that h has a nite pth moment and zero mean,
with U
n N (0, 2 )
n1/2 U
and
in distribution
E|Rn |p Cnp .
in distribution.
124
CHAPTER 4. U -STATISTICS
is insucient for handling Un (u, v) [resp. Rn (u, v)] as a process in (u, v).
Particularly, the (pointwise) Hajek approach does not yield bounds for
supu,v |Rn (u, v)|. Such bounds are, however, extremely useful in applications. For example, in survival analysis, U -statistic processes or variants of
them appear in the context of estimating the lifetime distribution F and
the cumulative hazard function when the data are censored or truncated
[cf. Lo and Singh (1986), Lo, Mack and Wang (1989), Major and Rejto
(1988) and Chao and Lo (1988)]. In Lo and Singh (1986) the analysis of
the remainder term incorporated known global and local properties of empirical processes. In Lo, Mack and Wang (1989) the error bounds were
improved by applying a sharp moment bound for degenerate U -statistics
due to Dehling, Denker and Philipp (1987). In Major and Rejto (1988) a
bound for supu |Rn (u, 1)| of large deviation type due to Major (1988) was
applied, which required h to be bounded. In all these papers estimation of
F (resp. ) could only be carried through on intervals strictly contained
in the support of the distribution of the observed data; similarly in Chao
and Lo (1988) for truncated data situations. This general drawback mainly
arose because of lack of a sharp bound for supu,v |Rn (u, v)| when the kernel
h is not necessarily bounded.
Classes of degenerate U-statistics also have been studied, from a dierent
point of view, by Nolan and Pollard (1987). In their Theorem 6 they derive
an upper bound for the mean of the supremum by rst decoupling the U process of interest and then using a chaining argument conditionally on the
observations. Now, by Holder, a more ecient inequality would be one
relating the pth order mean of the supremum to the pth order mean of the
envelope function, p 2.
At least this is a typical feature of many other maximal inequalities. We also
refer to de la Pe
na (1992) and the literature cited there. In these papers the
main emphasis is on relating the maximum of interest to the maximum of a
decoupled process. No explicit bounds for a degenerate U -statistic process
are derived that are comparable to ours. Note, however, that in applications
the leading (Hajek) part is well understood and it is the degenerate part that
creates the more serious problems.
In this section we shall employ martingale methods to provide a maximal
bound satisfying the above requirements. As a consequence we would be able
to improve the a.s. representations of the product-limit estimators of F for
censored and truncated data as discussed above; see Stute (1993, 1994).
125
h(s, Xj )1{Xj v}
s xed
j
126
CHAPTER 4. U -STATISTICS
1i<jn
1j<in
In (u, v) =
[ u
v
h(x, Xj )1{Xj v} F (dx) +
1i<jn
u v
]
h(x, y)F (dx)F (dy) + Rn (u, v),
]
sup
uu0 ,vv0
(4.5.1)
C C
u0 v0
1/p
|h(x, y)|p F (dx)F (dy)
127
u v
]
h(x, y)F (dx)F (dy) + Rn (u, v).
u v
h(x, y)1{x=y} Fn (dx)Fn (dy)
u v
u v
Inequality (4.5.1) together with the Markov inequality yield, with probability 1,
(4.5.2)
sup |Rn (u, v)| = o(n1+1/p (ln n) ),
uu0
vv0
128
CHAPTER 4. U -STATISTICS
also becomes small, to the eect that the bound in (4.5.2) may be replaced
by smaller ones. The last remarks also apply to the results that follow.
Interestingly enough (4.5.2) may be improved a lot. This is due to the fact
that according to Berk (1966) a sequence of normalized U -statistics is a
reverse-time martingale. Utilizing this, we get the following result.
Theorem 4.5.3. Under the assumptions of Theorem 4.5.1, with probability
1,
sup |Rn (u, v)| = o(n(ln n) )
uu0
vv0
whenever p > 1. For bounded hs, we may therefore take any > 0.
With some extra work the logarithmic factor may be pushed down so as to
get a bounded LIL. The necessary methodology may be found, for a xed
U -statistic rather than a process, in a notable paper by Dehling, Denker
and Philipp (1986). After truncation, they applied their moment inequality,
at stage n, with a p = pn depending on n such that pn slowly, to
the eect that for a bounded LIL the moment inequality serves the same
purpose as an exponential bound (personal communication by M. Denker).
Since this method is well established now, we need not dwell on this here
again.
In the next theorem we are concerned with a two-sample situation. Let
X1 , . . . , Xn be i.i.d. with common d.f. F and let, independently of the Xs,
Y1 , . . . , Ym be another i.i.d. sequence with common d.f. G. We shall derive
a representation of the process
nmUnm (u, v) =
n
m
i=1 j=1
129
n
m [ u
i=1 j=1
]
h(x, y)F (dx)G(dy) + Rnm (u, v).
C C
u0 v0
1/p
|h(x, y)|p F (dx)G(dy)
whenever p > 1.
Another variant (resp. extension) of Theorem 4.5.1, which is extremely
useful in applications, comes up when, in addition to Xi , there is a Yi paired
with Xi . Typically Xi is correlated with Yi . We may then form
In (u, v) =
h(Xi , Yj )1{Xi u,Yj v} .
1i<jn
130
CHAPTER 4. U -STATISTICS
[ u
v
h(x, Yj )1{Yj v} F (dx) +
1i<jn
u v
]
h(x, y)F (dx)G(dy) + Rn (u, v),
where Rn satises (4.5.1) and the h-integral in the bound is taken w.r.t
F G. The assertion of Theorem 4.5.3 also extends to the present case.
Remark 4.5.7. The results of this section may be extended to U -statistic
processes of degree m > 2, but proofs become more complicated and the
notation even more cumbersome. As far as applications are concerned, however, the case m = 2 is by far the most important one.
We end this section by presenting ve examples to which the theorems may
be applied. For these we remark that in the formulation of the previous
results, the point innity could also be included in the parameter set. What
matters is that the parameter sets of the coordinate spaces need to be linearly ordered.
Example 4.5.8. (Censored data). In the random censorship model the
actually observed data are Zi = min(Xi , Ci ) and i = 1{Xi Ci } , where Xi is
the variable of interest (the lifetime), which is at risk of being censored by
Ci , the censoring variable. For estimation of the cumulative hazard function
of X, a crucial role is played by the (one-parameter) process
1i<jn
i 1{Zj >Zi }
1
= In (u),
(1 H)2 (Zi ) {Zi u}
u R.
131
and, therefore,
In (u) =
1i<jn
1{Zj >Yi }
1
= In (u, )
(1 H)2 (Yi ) {Yi u}
h(Xi , Yj ),
Tnm =
i=1 j=1
1)Un0 (, )
[n] [n]
h(Xi:n , Xj:n ),
i=1 j=1
132
CHAPTER 4. U -STATISTICS
Since
n(n
1)Un0 (, )
n
n
i=1 j=1
we see that Un0 (, ) is related to Un (Fn1 (), Fn1 ()), neglecting the sum
over i = j for a moment. Observe that u = Fn1 () and v = Fn1 () are random in this case. The (uniform) representation of Un in terms of a (simple)
sum of independent random processes together with their tightness in the
two-dimensional Skorokhod space allows for a simple analysis of Un0 (, ),
not just for a xed (, ) [as done in Gijbels, Janssen and Veraverbeke
(1988)], but as a process in (, ). Details are omitted.
U -statistic processes also occur in the analysis of linear rank statistics. We
only mention the possibility of representing a linear signed rank statistic
(up to an error term) as a sum of i.i.d. random processes [cf. Sen (1981),
Theorem 5.4.2].
Example 4.5.12. (Linear signed rank statistics). For a sample X1 , . . . , Xn
and a proper score function , it is required to represent the double sum
F 1 (u)du
0 0
Chapter 5
Statistical Functionals
5.1
Empirical Equations
Empirical equations have been addressed briey in Section 1.4. For a more
detailed discussion, consider a parametric family M = {f : } of
densities w.r.t. a measure . Assume that the true distribution function F
belongs to M, i.e., F = F0 = f0 for some (unknown) 0 . Set
K(, 0 ) =
ln
f0 (x)
F (dx) =
f (x)
]
[
f (x)
ln 0
f0 (x)(dx).
f (x)
TF : ln f (x)F (dx)
which of course depends on F . Denote with T (F ) the maximizer of TF , i.e.,
T (F ) = arg max TF ().
134
sample from F , we may compute Fn and consider the mapping TFn instead
of TF :
n
1
ln f (Xi ).
TFn () = ln f (x)Fn (dx) =
n
i=1
(5.1.1)
i=1
with
ln f (x)
.
Equation (5.1.1) is an example of an empirical equation. The same principle also applies to other s. See below. In such a situation the resulting
estimator is called an M- (or maximum likelihood type) estimator and the
associated functional attaching to a d.f. G the solution of
(x, )G(dx) = 0
(x, ) =
135
}
Xi + Xj
: 1 i, j n ,
2
2
d (G, F ) = [G(x) F (x)]2 G(dx).
The parameter of interest is the minimizing d2 (F, F ). Under smoothness
this minimizer satises
(x, , F )F (dx) = 0
with
F (x).
More examples are discussed in Stute (1986), Stoch. Proc. Appl. 22, 223244.
(x, , G) = [G(x) F (x)]
136
5.2
Put
YT =
(1)|T \U | SU .
U T
S=
YT ,
T {1,...,n}
YT =
(1)|T \U | SU
T {1,...,n} U T
T {1,...,n}
U {1,...,n}
n|U |
SU
(1)
k=0
n |U |
k
)
.
The sum over k equals zero unless |U | = n, in which case it is one. For (i),
E(YT ) =
U T
where r = |T | 1.
(1)
|T \U |
( )
r
k r
E(S) = E(S)
(1)
= 0,
k
k=0
E(YT2 |Xi , i T1 ) =
=
(1)|T2 \U | E(S|Xi , i U T1 )
U T2
|T2 ||W ||T2 \T1 |
(1)
E(S|Xi , i W )
W T1 T2
V T2 \T1
Wk =
YT ,
1 k n.
T {1,...,k}
Then
Wk =
(1)|T \U | SU
T {1,...,k} U T
k|U |
SU
U {1,...,k}
(1)
r=0
)
k |U |
= E(S|Xi , i = 1, . . . , k),
r
|T |1
YT =
i=1
E(S|Xi ) (n 1)ES.
138
5.3
(1)|T \U | SU .
U T
Then
S=
YT .
T {1,...,n}
Var(S) =
Var(YT ).
=T {1,...,n}
In this section we study estimation of Var(S) in the case when the Xs are
i.i.d. and S is symmetric. Then Var(YT ) only depends on the cardinality of
T . Write k2 = Var(YT ) when |T | = k. Hence
s2n
Var(S) =
n ( )
n
k=1
k2 .
:=
[S(i) S() ]2 .
i=1
where denotes
summation over all T {1, . . . , i1, i+1, . . . , n} contain
ing j, and denotes summation over all T {1, . . . , j 1, j + 1, . . . , n}
139
containing i. Since thus all T appearing on the right-hand side are dierent,
the corresponding Y s are uncorrelated. By symmetry, we thus get
n1
(
E(S(i) S(j) ) = 2
2
k=1
Now,
2
s2n1 = n1
)
n2 2
.
k1 k
(S(i) S(j) )2 ,
i=j
whence
E
s2n1
= (n 1)
n1
(
k=1
Recall
s2n1
n1
(
k=1
)
n2 2
.
k1 k
)
n1 2
k .
k
We nd that
E
s2n1
s2n1
n1
k=2
)
n1 2
(k 1)
k 0.
k
Y{i} =
i=1
E(S|Xi ) (n 1)E(S),
i=1
i.e. S coincides with its Hajek projection. Such Ss are called linear. They
are called quadratic i k2 = 0 for k 3. Such an S admits a representation
S = E(S) +
i=1
Y{i} +
Y{i,j} ,
i=j
where
extends over all T {1, . . . , n} with |T | m. Hence i2 = 0
140
(Si
S ) = (n 1)
i=1
(S(i) S() )2 ,
i=1
(Si S )2 .
i=1
= X
n . Since X
n
Example 5.3.1. For the sample mean, Si = Xi and S
n
2
2
5.4
n
[Xm+1:n Xm:n ]2 .
4
Let Sn S = S(X1 , . . . , Xn ) be an integrable function of i.i.d. random variables X1 , . . . , Xn F . Assume that S is designed to estimate a population
parameter (F ). Consider the bias
bias := EF (S) (F ).
Quenouilles estimate of bias is
BIAS := (n 1)[S() S]
n
n1
S(i) (n 1)S.
=
n
i=1
141
a1
a2
a3
+ 2 + 3 + O(n4 ),
n
n
n
+ a3
+ O(n3 ),
n n1
n2 (n 1)2
whence the assertion.
Example 5.4.2. Take (F ) = 2 (F ), and let S = n1
simple calculation shows
i=1 (Xi
n )2 . A
X
1
n )2 .
(Xi X
n(n 1)
n
BIAS =
i=1
yielding
1
n )2 ,
S =
(Xi X
n1
n
i=1
s2n1 =
[S(i) S() ]2 W.
i=1
Clearly,
W =
i=1
2
S(i)
i,j=1
S(i) S(j) .
142
Denote with Sir the value of S taken at the sample deleted by Xi and Xr .
Then
n
n
1 2
1
W() =
Sir
Sir Sjr
n
n(n 1)
r=1 i=r
r=1 i,j=r
so that by symmetry
2
EW() = (n 2)ES12
(n 2)ES13 S23 .
W = W BIAS = nW (n 1)W() ,
we nd that
EW = n(n 1)[ES12 ES1 S2 ]
2
(n 1)(n 2)[ES13
ES13 S23 ].
But
1
ES1 S2 = ES12 E(S1 S2 )2
2
and
1
2
ES13 S23 = ES13
E(S13 S23 )2 .
2
Conclude that
1
1
n(n 1)E(S1 S2 )2 (n 1)(n 2)E(S13 S23 )2
2
2
n1
n2
(n 2)
(n 3)
2
= n(n 1)
(n 1)(n 2)
2
.
k 1 k,n1
k 1 k,n2
EW =
i=1
k=1
To make this argument work one also needs some smoothness of the variance
as a function of n.
143
BIAS = S n1
Si .
i=1
k=0
ES(Fn ) ES(Fn1 )
i=1
Taking the last term also for estimating ES(Fn+k ) ES(Fn+k1 ) for k > 0,
and letting m = n 2, we nally arrive at BIAS.
144
Chapter 6
Stochastic Inequalities
6.1
This section deals with the most famous bound on the deviation between
Fn and F , the Dvoretzky-Kiefer-Wolfowitz (1956) exponential bound for the
upper tails of Dn+ .
Theorem 6.1.1. There exists a universal constant c > 0 so that for all
x 0 and n 1
P(n1/2 Dn+ > x) c exp(2x2 ).
(6.1.1)
Proof. In view of (3.1.17), it suces to study the case F = FU . The
original proof of Dvoretzky-Kiefer-Wolfowitz, which is presented here, rests
on Theorem 3.4.5. Since the left-hand side of (??) equals zero for x n,
n1
j=<x n>+1
Qn (j, x),
(6.1.2)
( )
n
Qn (j, x) =
(j x n)j (n j + x n)nj1 nn .
j
x (1 x/ n)n exp(2x2 )
with
(1 x/ n)n exp(2x2 ).
145
146
n
xn2
(j x n)(n j + x n) n j + x n
)2
4x
16x ( n
<
4x
j
+
x
n ,
(
)2
n2 2
1 n42 n2 j + x n
.
n
2
3
9n
Integration therefore leads to
[
8x2
Qn (j, x) Qn (j, 0) exp 2x2 2
n
8x2
Qn (j, x) c1 Qn (j, 1) exp 2x 2
n
]
)2
n
4x4
2x n
j+
2
3
9n
(6.1.3)
]
)2
2x n
4x4
n
j+
(6.1.4)
2
3
9n
with a universal constant c1 > 0. To bound Qn (j, 0) and Qn (j, 1), we rst
consider the case |j n2 | n4 . From Stirlings formula
( )k
k
k!
2k
e
we may infer that
Qn (j, 0) c2 n3/2
for some generic constant c2 . Denote with the sum over those j in (6.1.2)
satisfying |j n2 | n4 , and let be the sum over the remaining ones. From
(6.1.3) we obtain
[
)2 ]
(
j
2x
1
Qn (j, x) c2 n3/2 exp(2x2 ) exp 8x2
+
2 n 3 n
2c2 n
3/2
exp(2x )
exp[8x2 j 2 /n2 ]
j=0
1
2c2 n1/2 exp(2x2 ) +
n
c3 x
1 1/2
exp(2x )
exp(8x2 t2 )dt
147
2x n/3 n/8, the second term in the exponent of (6.1.4) does not ex
ceed x2 /8. For x > 3 n/16, the last term is less than or equal to
(4/9)(3/16)2 x2 so that
Qn (j, x) c1 Qn (j, 1) exp[2x2 c4 x4 ]
c5 x1 Qn (j, 1) exp(2x2 ).
148
6.2
F
(t)
F
(t)
1
2
n
1/2
P n
x
eu /2 du = P( x), (6.2.1)
2
F (t)(1 F (t))
x
where N (0, 1). Mills ratio states that for all x > 0
(
)
1
1
1
1 1
3 exp(x2 /2) P( x) exp(x2 /2),
x x
x 2
2
(6.2.2)
i.e., neglecting the factors 1/x and 1/x3 , the upper tails of a standard normal
distribution decrease exponentially fast. In this section we show that a
similar bound also holds, for nite n, for a standardized binomial variable.
First, the Chebychev inequality leads to
)
(
Fn (t) F (t)
1/2
x x2
P n
F (t)(1 F (t))
which is far worse than what we may expect from (6.2.1) and (6.2.2). A
much sharper bound is obtained if we incorporate the moment generating
function of a binomial random variable.
Lemma 6.2.1. Let Bin(n, p), 0 < p < 1. Then the moment generating
function of equals
M (z) = E[exp(z)] = [(1 p) + p exp z]n .
Proof. Use the fact that equals, in distribution, the sum of n independent
Bernoulli random variables with parameter p, whose moment generating
function equals (1 p) + p exp z.
To bound P( np ) for > 0, note that the Markov inequality yields
for each z 0:
P( np ) E [exp(z( np ))]
= M (z) exp[z(np + )].
We thus get the following bound.
149
so that
= exp[f (np + )].
The function f is called the Cherno-function of M .
Clearly, f is nonnegative and nondecreasing.
Lemma 6.2.3. We have
u ln u + (n u) ln nu
np
n(1p)
f (u) f (u, n, p) =
n ln p
for
for
for
for
u0
0<u<n
.
u=n
n<u
f (u, n, p) = np 1 +
ln 1 +
np
np
[
] [
]
+ n(1 p) 1
ln 1
.
n(1 p)
n(1 p)
150
x2
x3
x4
+
...
12 23 34
and
x2
x3
x4
+
+
+ ...
12 23 34
With x1 = /np and x2 = /n(1 p) we thus get
[
]
[ 3
]
x31
x41
x2
x42
2
+np
+
. . . +n(1p)
+
+ ... .
f (u, n, p) =
2np(1 p)
23 34
23 34
(1 x) ln(1 x) = x +
2np(1 p) 1 p
(
)
2
.
2np(1 p)
np
The function is dened, for |x| < 1, as
(x) = 1
x x2
(1)k 2xk
+
... +
+ ...
3
6
(k + 2)(k + 1)
Check that
(x) = 2h(1 + x)/x2
(6.2.3)
h(x) = x(ln x 1) + 1.
(6.2.4)
with
In terms of (6.2.3) and (6.2.4), is dened for all positive x. Putting
( )
/2np(1 p)
g1 () = f (np + , n, p) and g2 () = 2
np
we easily nd that
g1 (0) g2 (0) and g1 / g2 /.
The second inequality is equivalent to
[
]
[
]
(1 p) ln 1
+ p ln 1 +
0
n(1 p)
np
which easily follows from the concavity of x ln(1 + x). We thus obtain
g1 () g2 ().
151
Lemma 6.2.4. Let Bin(n, p). Then we have for all > 0
(i)
( )]
2
P( np ) exp
2np(1 p)
np
[
(
)]
np
= exp
h 1+
.
1p
n(1 p)
[
(ii)
2
P( np ) exp
2np(1 p)
n(1 p)
)]
.
Improved bounds are obtained if we use sharper lower bounds for f (u, n, p).
For example, when 1 p p and therefore p 1/2, we get for u = np +
f (u, n, p) 2 /2np(1 p).
(6.2.5)
2np(1 p)
6
np
)3
[
]
(1 p)
2
1
(6.2.6)
=
2np(1 p)
3np
2 (1 )
.
2np(1 p)
(6.2.7)
For the standardized Fn (t), we obtain for all x 0 and a given threshold
0<<1
(
) {
n1/2 [Fn (t) F (t)]
exp[x2 /2]
if p = F (t) 1/2
P
x
2 (1 )/2] for a general p = F (t),
exp[x
F (t)(1 F (t))
provided that x
3 np
.
(1p)3/2
152
6.3
n (a) = sup |
n (t)
n (s)|.
|ts|a
(
P
)
(m
)
sup
n (ti ) > x = P
Ai .
0im
i=0
P
Ai
= P
Ai Am + P
Ai Am
i=0
i=0
P(Am ) +
i=0
m1
(
)
P Am Ai A0 . . . Ai1 .
i=0
153
n (tj ) = xj . Write
P(Am ,
n (tj ) = xj for 0 j i)
= P(Am |
n (tj ) = xj for 0 j i)P(
n (tj ) = xj for 0 j i)
= P(A |
n (ti ) = xi )P(
n (tj ) = xj for 0 j i),
m
P
Ai
P(Am ) + c
P(Ai A0 . . . Ai1 )
i=0
i=0
P(Am ) + cP
(m
)
Ai
i=0
1
P(
n (a) > x ).
1c
Since the right-hand side does not depend on the chosen grid, the bound
also holds when we let the mesh tend to zero. Since
n is continuous from
the right and has left-hand limits, we nally arrive at
(
)
1
P sup
n (t) > x
P(
n (a) > x ).
(6.3.1)
1c
0ta
Arguments similar to those which led to the maximal inequality (6.3.1)
are well-known and have been successfully applied also outside empirical
process theory.
We now investigate conditions on x x guaranteeing
P(Am |
n (ti ) = xi ) c
for each xi > x, Now,
P(
n (tm ) x |
n (ti ) = xi )
)
)
(
(
tm ti
nx + tm n N ,
= P (n N )FnN
1 ti
154
P (n N )FnN
(n N )
n x xi
.
1 ti
1 ti
1 ti
Since xi > x,
[
]
[
]
1 tm
1 tm
n x xi
n x x
1 ti
1 ti
n [x x(1 a)] .
2
2
nx /4
x
where the last inequality holds provided that 8a x2 2 .
Lemma 6.3.1. Assume that
(i) a < /2
(ii) 8a x2 2
(
Then
P
)
sup
n (t) > x
0ta
2P (
n (a) > x(1 )) .
A similar bound holds for the lower tails so that summarizing we get under
(i) - (ii)
(
)
P sup |
n (t)| > x 2P (|
n (a)| > x(1 )) .
(6.3.2)
0ta
155
P(
n (a) > s a) C a1 exp[s2 (1 )5 /2],
with C = 64 2 .
n (a)
max sup |
n
+t
n ( )|
0iR1 0ta
R
R
(
)
i
i
+ 2 max
sup |
n
+
n ( )|.
0iR1 0 1/R
R
R
Since
n has stationary increments, we obtain
(
(
)
)
i
i
x
P(
n (a) > x) P
max sup |
n
+t
n ( )| >
0iR1 0ta
R
R
1+
(
)
)
(
i
x
i
+
n ( )| >
+ P 2 max
sup |
n
0iR1 0 1/R
R
R
1+
(
)
(
)
x
x
RP sup |
n (t)| >
+ RP
sup |
n (t)| >
.
1+
2(1 + )
0ta
0t1 /R
It follows from (i) - (ii) and the choice of R, that (6.3.2) is applicable to
each of the above probabilities. Hence
[ (
)
(
)]
x(1 )
1
x(1 )
P(
n (a) > x) 2R P |
n (a)| >
+ P |
n ( )| >
.
1+
R
2(1 + )
Under (iii), we may apply (6.2.7) and thus get the nal result.
Theorem 6.3.3. For each > 0 and every > 0 there exists a small a > 0
such that for all n n0 (, )
P(
n (a) ) .
(6.3.3)
Proof. Put = 1/2. We shall apply Lemma 6.3.2 with s = a1/4 . Conditions (i) - (iii) from Lemma 6.3.2 are then all satised for a > 0 suciently
small and n suciently large. Moreover, we may choose a > 0 so small that
156
also a1/4 and C1/2 a1 exp[a1/2 /64] hold. This completes the
proof.
We also mention that Theorem 6.3.3 may be readily applied to bound the
oscillation modulus of empirical processes from a general F satisfying some
weak smoothness assumptions.
If, for example, F is Lipschitz:
|F (x) F (y)| M |x y|
(6.3.4)
n (a)
n (M a),
(6.3.5)
(6.3.6)
where
|ts|a
is the oscillation modulus pertaining to a general empirical process. Condition (6.3.4) is satised if F has a bounded Lebesgue density.
We may also let a = an tend to zero as n
. A rough upper bound for
r
n1 an
n (an ) = O( an ln a1
n ).
Improvements of these results may be found in Stute (1982). Extensions to
a general F satisfying (6.3.4) utilize (6.3.5).
6.4
Xi ,
S := S/n
i=1
and
= n1
:= E(S)
E(Xi ).
i=1
ehnhE(S)
EehXi .
(6.4.1)
i=1
b E(X) ha E(X) a hb
e +
e .
ba
ba
Proof. We have
ehX
bX
Xa
= eh ba a+h ba b
b X ha X a hb
e +
e ,
ba
ba
P(S )
+
1
en
2 g()
e2n ,
2
158
where
g() =
ln 1
1
12
1
2(1)
Ee
hXi
(1 i + i eh ).
i=1
i=1
Now use the fact that the geometric mean is always less than or equal to
the arithmetic mean. It follows that
]n
[ n
n
[
]n
1
h
h
(1 i + i e ) = 1 + eh .
(1 i + i e )
n
i=1
i=1
(1 )( + )
.
(1 )
Inserting this h we obtain the rst inequality. For the second we note that
the rst inequality may be written as
P(S ) en
2 G(,)
with
G(, ) =
+
ln
2
)
+
1
ln
2
1
1
)
.
Now check
g() =
Finally, g() 2.
inf
01
G(, ).
Hoedingss inequality may be extended to the case, when the bounds vary
with i.
P(S ) exp n
.
2
i=1 (bi ai )
Proof. Omitted.
160
Chapter 7
Invariance Principles
7.1
162
For any function f dened on I = [0, 1], the degree of smoothness may be
measured through the oscillation modulus
(f, ) =
sup
|t1 t2 |
(7.1.1)
while Lipschitz-continuity
|f (t1 ) f (t2 )| C|t1 t2 |
yields
(f, ) C.
Since
(f, 1 ) (f, 2 ) if 1 2 ,
(7.1.1) is equivalent to
lim (f, n ) = 0 for any sequence n 0.
(7.1.2)
(7.1.3)
163
(7.1.4)
t t
7.2
Gaussian Processes
Lemma 3.2.8 asserts that the dis of n have normal limits. Processes
which have normal dis not only in the limit are of special importance
since they are candidates for limit processes.
Denition 7.2.1. A stochastic process (St )0t1 is called a Gaussian Process if all dis are normal distributions.
Normal distributions are uniquely determined through
m(t) = ESt the mean function
and
K(s, t) = Cov(Ss , St ) the covariance function.
If m(t) = 0 for all t, (St )t is called a centered Gaussian Process. Recall
that K is symmetric:
K(s, t) = K(t, s)
and nonnegative denite: for any t1 , . . . , tk and 1 , . . . , k R:
k
k
i=1 j=1
i K(ti , tj )j 0.
164
It can be shown that for any K satisfying these two properties there exists
a centered Gaussian process (St )t such that
Cov(Ss , St ) = K(s, t).
We now introduce the most famous Gaussian Process the Brownian Motion. This belong to
K(s, t) = min(s, t).
Clearly, K is symmetric. For 0 t1 < t2 < . . . < tk let X1 , . . . , Xk be
independent normal random variables such that
X1 N (0, t1 ), X2 N (0, t2 t1 ), . . . , Xk N (0, tk tk1 ).
Put
S1 = X 1
S2 = X1 + X2 , . . . , Sk = X1 + . . . + Xk .
= E[Si Sj ] = E Si
ij
Si +
)]
Xk
k=i+1
= ESi2 = ti = min(ti , tj ).
Hence K(ti , tj ) is a covariance function and therefore nonnegative denite.
Theorem 7.2.2. Let B = (Bt )0t1 be a Gaussian Process with
EBt = 0 and Cov(Bs , Bt ) = min(s, t).
Then we have:
(i) The process B exists
(ii) B has continuous sample paths
(iii) B0 = 0 w.p. 1
(iv) B has independent increments with
Bt Bs N (0, t s) for 0 s t 1
[
]
(v) E (Bt Bs )2 = t s for 0 s t 1
165
N (0, 1).
7.3
Brownian Motion
166
t0
is a Brownian Motion.
t )t is a centered Gaussian process with covariance
Proof. (B
s , B
t ) = 1 cov(Bs , Bt )
Cov(B
= 1 s = s for 0 s t.
Lemma 7.3.2. Fix 0 < t0 . Then
t = Bt +t Bt ,
B
0
0
t0
is a Brownian Motion.
is a centered Gaussian process with covariance
Proof. B
s , B
t ) = E [(Bt +s Bt )(Bt +t Bt )]
Cov(B
0
0
0
0
= t0 + s t0 t0 + t0 = s for 0 s t.
The following process is of central importance in the theory of Empirical
Processes.
Lemma 7.3.3. Let B = (Bt )0t1 be a Brownian Motion on the unit
interval. Put
Bt0 = Bt tB1 , 0 t 1.
Then (Bt0 )0t1 is a centered Gaussian process with covariance
Cov(Bs0 , Bt0 ) = s st for all 0 s t 1.
167
Proof.
Cov(Bs0 , Bt0 ) = E [(Bs sB1 )(Bt tB1 )]
= s st st + st = s st.
Remark 7.3.4. Note that n and B 0 are both centered and have the same
covariance structure. Also
n (0) = B00 = 0 =
n (1) = B10 .
Therefore B 0 is called a Brownian Bridge.
7.4
1
2
1/2
lim P n
(Xi EX1 ) x = (x) =
ey /2 dy,
n
2
i=1
where 2 = VarX1 .
In other words, if Gn denotes the d.f. of (n 2 )1/2
i=1 (Xi
EX1 ), then
168
Sn S,
if and only if
lim P(Sn A) = P(S A) for all A X such that P(S A) = 0.
(7.4.2)
169
Then we have:
Sn S in distribution.
In many applications, the limit process will be Gaussian, and the convergence of the dis follows from an application of the multivariate CLT. One
possibility to check (7.4.2) is to show that (7.1.3) holds uniformly in n, with
the same K, a and b.
The following Theorem constitutes the counterpart of Theorem 7.4.2 for the
case D[0, 1].
Theorem 7.4.3. Let (Sn (t))0t1 and (S(t))0t1 be stochastic processes
in D[0, 1] such that
S has continuous sample paths
(7.4.3)
(7.4.5)
Then:
Sn S in distribution.
The last Theorem covers an important special case, namely that the limit S
is continuous though for each n 1 the process Sn may have jumps. This
is exactly the case for the uniform empirical process
n . In view of Lemma
3.2.8 the candidate for the limit is the Brownian Bridge S = B 0 . While
this process has continuous sample paths,
n has n discontinuities. Note,
however, that the jump size n1/2 tends to zero so that, as n ,
the number of jumps tends to innity
the jump sizes tend to zero
the limit process is continuous.
In many cases the continuity of S is obtained through an application of
Theorem 7.1.1, while the convergence of the dis will again follow from an
application of the multivariate CLT. Verication of (7.4.5) typically requires
more work.
In the literature the convergence of Sn to some limit S is often called an
Invariance Principle. The reason for this becomes clear if one recalls the
170
Sn B,
where B is the Brownian Motion.
Theorem 7.4.5. (Donsker). Let U1 , . . . , Un be i.i.d. from U [0, 1]. Denote
with
n the uniform empirical process. Then
L
n B 0 ,
where B 0 is the Brownian Bridge.
Proof. The assertion follows from Theorem 7.4.3 upon applying Lemma
3.2.8 and Theorem 6.3.3.
There is an important technique which shows how useful invariance principles may be. This is based on the so-called Continuous Mapping Theorem (CMT).
L
171
tR
and
Dn = sup |Fn (t) F (t)|.
tR
tR
and
Dn = n1/2 sup |Fn (t) F (t)|.
tR
0t1
Similarly,
Dn sup B 0 (t).
(7.4.7)
0t1
We see that three convergence results easily follow from one basic result:
the invariance principle for the underlying process
n.
Another example is the Cramer-von Mises statistic
Fn (t) t
]2
n2 (t)dt.
dt =
0
172
CvM
[B 0 (t)]2 dt.
0
The distributions of the limits in (7.4.6) and (7.4.7) are known and tabulated.
Lemma 7.4.7. For x 0 we have
(
)
0
= exp[2x2 ]
P sup B (t) x
(
P
0t1
sup |B (t)| x
= 2
(1)k+1 exp[2k 2 x2 ] K(x).
0
0t1
k=1
(7.4.8)
Equation (7.4.8) is the basis for the classical Kolmogorov-Smirnov Goodnessof-Fit Test. The null hypothesis to be tested is
H0 : F = F0 ,
where F0 is a fully specied distribution (simple hypothesis). Consider
Dn := n1/2 sup |Fn (t) F0 (t)|.
tR
For a given signicance level 0 < < 1, choose c > 0 so that K(c) = .
Reject H0 i Dn c. From (7.4.8) we get that asymptotically the error of
the rst kind equals .
The same procedure works for CvM. In this case the critical value c needs
1
to be taken from the distribution of [B 0 (t)]2 dt, which is tabulated.
0
7.5
173
where at this moment we leave it open how the standardizing factor looks
like. To determine critical values one needs to know or at least approximate
the distribution of Dnm under H0 . Now, under H0 ,
m (t)|.
Dnm = sup |Fn (t) G
0t1
Put
m (t) t],
n
N,
we get
m (t)] =
N [Fn (t) G
[
]
N n1/2
n (t) m1/2 m (t)
1/2
= N
Assuming that
N for some 0 < < 1,
we get from Donsker that
m ] 1/2 B 0 (1 )1/2 B 1 B,
N [Fn G
K(s, t) = E[B(s)
= 1 s(1 t) + (1 )1 s(1 t)
[
]
1
1
1
=
+
s(1 t) =
s(1 t).
1
(1 )
This gives us a clue how to choose the standardizing factor in the denition
of Dnm . Put
m ].
N = N (1 N )N [Fn G
174
Then
in distribution,
N B
is a Brownian Bridge. Hence H0 is rejected at level i
where B
nm
Dnm =
sup |Fn (t) Gm (t)| c,
N tR
where c is the (1 )-quantile of K in Lemma 7.4.7.
In our next example we use empirical process theory to design tests for
symmetry of F at zero:
H0 : F (x) + F (x) = 1 for all x R.
(7.5.1)
x R.
0 t 1.
L
n B,
with
= B 0 (t) + B 0 (1 t).
B(t)
175
Note that
= R(2t 1),
B(t)
0 t 1.
0 t 1,
The process
{
n (t) : 0 t 1} {R(2t 1) : 0 t 1},
the CMT implies
L
sup |
n (t)| sup |R(2t 1)|.
0t1
0t1
0t1
0t1
0t1
0t1
176
7.6
So far invariance principles were discussed for processes where the parameter
set was identical with the scale of the observations X1 , . . . , Xn . In this
section we study a typical situation in parametric statistics. The example is
taken from regression. The observations (Xi , Yi ), 1 i n, are independent
from the same unknown d.f. F . The Xs take their values in Rd , while the
Y s are real-valued. If E|Yi | < , each Yi may be decomposed into
Yi = m(Xi ) + i ,
where E[i |Xi ] = 0. The function m is the regression function of Y w.r.t. X
and is also unknown. Very often one assumes that m belongs to a parametric
family M = {m(, ) : Rp } of functions, i.e., one assumes that the
true m satises
m(x) = m(x, 0 ) for some 0 and all x Rd .
(7.6.1)
xi i .
i=1
xi i .
i=1
n = arg min n
i=1
177
i=1
= EY 2
2
EY 2
m2 (x)F0 (dx) +
(7.6.3)
Since EY 2 m2 (x)F0 (dx) does not depend on , (7.6.2) and (7.6.3) only
dier by a constant. Hence they have the same minimizer, say 0 . In other
words,
0 = arg min
m(x, )
and
2 m(x, )
mij (x, ) (M (x, ))ij ,
i j
1 i, j p.
(7.6.4)
As we shall see second order dierentiability is indispensable for the understanding of n if H0 is not satised. All vectors will be column vectors,
178
and T will be their transpose. Recall 0 . A crucial role in our analysis will
be played by the p p matrix
and
n1/2 (n 0 ) = n1/2 A1
The summands on the right-hand side are centered with, under independence
of Xi and i , covariance matrix
[
]
=VarE m (X, 0 )[m (X, 0 )]T
(7.6.7)
[
]
+ E (m(X) m(X, 0 ))2 m (X, 0 )[m (X, 0 )]T .
(7.6.8)
The second term vanishes under H0 .
Equation (7.6.6) immediately yields the asymptotic (multivariate) normality
of n1/2 (n 0 ):
n1/2 (n 0 ) N (0, A1 A1 ) in distribution.
179
A1 A1 = VarA1 ,
0=
(Yi m(Xi , n ))m (Xi , n )
i=1
1/2
i=1
1/2
=n
(7.6.9)
i=1
i=1
1/2
n
(Yi m(Xi ))m (Xi , n ) = n (n ) = n (0 ) + oP (1).
i=1
1/2
i=1
+ n
1/2
i=1
= n1/2 (n 0 )T
m (Xi , n )m (Xi , n )
i=1
+ n1/2
i=1
+ n1/2 (n 0 )T
i=1
180
for appropriate n , n tending to 0 along with n . By the assumed continuity of M as a function of and a uniform version of the SLLN, equation
(7.6.9) may be rewritten as
n (0 ) + oP (1) = n1/2
i=1
+ [A + oP (1)]n1/2 (n 0 ).
Note that because 0 is the minimizer of [m(x) m(x, )]2 F0 (dx) lying in
the interior set of , the variables
[m(Xi , 0 ) m(Xi )]m (Xi , 0 ),
1 i n,
1/2 1
= n1/2 A1
n [
]
Yi m(Xi , 0 ) m (Xi , 0 ),
i=1
(7.6.10)
i=1
181
Our discussion exhibits, for the rst time, that an assumed model may
be incorrectly specied. Therefore n needs to be studied under both
m M and m
/ M.
To derive the i.i.d. representation we use arguments which will appear,
in modied form, also in other situations. First regularity (or smoothness) of the model is important. Taylors expansion is important for a
local linearization at 0 .
This 0 is the minimizer of the distance (squared)
2
D () = [m(x) m(x, )]2 F0 (dx)
between m and M. In this way m(, 0 ) is the projection
of m onto
182
Chapter 8
Empirical Measures: A
Dynamic Approach
8.1
In many applied elds researchers are interested in the time X elapsed until
a certain event occurs. To give only a few examples, X could be
the survival time of a patient, i.e., the time from surgery to death
the free disease survival time, e.g., the time elapsed from a rst chemotherapy until reappearance of metastases
the appearance of the rst customer in a service station
the time a bond defaults
the breakdown of a technical unit
Since X very often is a random variable, the associated process
St = 1{Xt} ,
t R,
8.2
P-almost surely
for all i = 0, . . . , k 1.
Being a martingale thus means that the best predictor of Si+1 is just the
previous value Si .
Of course it would be very naive to believe that martingales are all around
and the last observation would always be an appropriate (and simple) choice
for a predictor. Though, as the next important result will show us, we are
at least very close to such a situation.
Lemma 8.2.2. (Doob-Meyer). Let (Si )0ik be adapted to (Fi )0ik .
Then, for a given initial value a, there is a unique decomposition (almost
surely) of Si into
Si = Ai + Mi , i = 0, 1, . . . , k
such that
(Mi )i is a martingale
(Ai )i is predictable
A0 = a
S0 a
for i = 0
for i 1
Clearly, by induction,
Si = Ai + Mi ,
i = 0, 1, . . .
]
]
[
[
E Ai Ai |Fi1 = E Mi Mi |Fi1 .
Since Ai and Ai are measurable w.r.t. Fi1 and Mi and Mi are martingales,
the last equation yields
Ai Ai = Mi1
Mi1 = Ai1 Ai1 .
(8.2.1)
8.3
F (ti ) F (ti1 )
.
1 F (ti1 )
Proof. Write
Si = 1{Xti } = 1{Xti } 1{Xti1 } + 1{Xti } 1{X>ti1 }
= 1{Xti1 } + 1{ti1 <Xti } .
The rst summand is measurable w.r.t. Fi1 . For the second summand we
get
[
]
E 1{ti1 <Xti } |Fi1
1{ti1 <X}
1{ti1 <Xti } dP
=
{ti1 <X}
P(ti1 < X)
F (ti ) F (ti1 )
= 1{X>ti1 }
.
1 F (ti1 )
F (ti ) F (ti1 )
.
1 F (ti1 )
1{X>ti1 }
i=1
F (ti ) F (ti1 )
.
1 F (ti1 )
1{X>ti1 }
i=1
F (ti ) F (ti1 )
.
1 F (ti1 )
From this we may get an idea how M and A look like when all processes are
considered in continuous time. Just let the grid {ti } of points get ner and
ner. If we keep tk = t xed, then Ak converges to
1{xX}
At =
F (dx).
1 F (x)
(,t]
1{xX}
F (dx).
Mt = 1{Xt}
1 F (x)
(,t]
Remark 8.3.3. Note that (St )t and (At )t are nondecreasing. The martingale (Mt )t is trend-free and satises
E(Mt ) = 0 for all t R.
dF
,
1 F
(t) =
F (dx)
.
1 F (x)
(,t]
F (dx)
(t) =
= ln(1 F (t))
1 F (x)
(,t]
whence
F (t) 1 F (t) = exp((t)).
(8.3.1)
f
dx dx.
1F
f (x)
,
1 F (x)
xR
1{Xx} (x)dx.
(,t]
(x) = lim
Hence, for small x,
1
on 0 < x < 1.
1x
Given that X has exceeded x, the intensity for a default in the near future
increases to innity at a hyperbolic rate.
8.4
where
Mn (t) = Fn (t)
1 Fn (x)
F (dx).
1 F (x)
(8.4.1)
(,t]
1
(s).
n
191
with
1
(s)
n
F {x}
F (dx).
1 F (x)
(s) = F (s)
(,s]
= E12{Xs} 2E 1{Xs}
+ E
1{xX}
F (dx)
1 F (x)
(,s]
1{xX}
F (dx)
1 F (x)
(,s]
1{xX}
F (dx) .
1 F (x)
(,s]
1{xX}
E 1{Xs}
F (dx) =
1 F (x)
E1{xXs}
F (dx)
1 F (x)
(,s]
(,s]
F (s) F (x)
F (dx).
1 F (x)
(,s]
Finally,
1{xX}
F (dx) =
1 F (x)
(,s]
(,s] (,s]
1{xX} 1{yX}
F (dx)F (dy).
(1 F (x))(1 F (y))
2
E[. . .] =
1 F (max(x, y))
F (dx)F (dy).
(1 F (x))(1 F (y))
(,s] (,s]
The double integral is split into two pieces: y x s and x < y s. This
leads to
(1 F (x))
1{yx}
F (dx)F (dy)
(1 F (x))(1 F (y))
(,s] (,s]
1
1{yx}
F (dx)F (dy) =
1 F (y)
=
(,s] (,s]
F (s) F (y)
F (dy),
1 F (y)
(,s]
and
1{x<ys}
(,s] (,s]
1 F (y)
F (dx)F (dy)
(1 F (x))(1 F (y))
1{x<ys}
1
F (dy)F (dx)
1 F (x)
(,s] (,s]
F (s) F (x)
F (dx).
1 F (x)
(,s]
F {x}
(s) =
1
F (dx) =
[1 {x}]F (dx).
1 F (x)
(,s]
(,s]
Since
F {x} = P(X = x) P(X x) = 1 F (x),
we get {x} 1. Conclude that is nondecreasing.
As a nal comment, if F is continuous, then
n (F (x))
Mn (x) = M
with
n (t) = Fn (t)
M
1 Fn (u)
du,
1u
193
0 t 1,
[0,t]
1 Fn
dF.
1 F
(8.4.2)
1 Fn
dF
dMn =
dFn
1 F
Fn F
=
d(Fn F ) +
dF.
1 F
8.5
Mean and variance are two basic concepts in elementary probability theory
and statistics. In a dynamic framework we have seen so far that means
should be updated at each time t so as to incorporate the current available information. This led to the Doob-Meyer Decomposition, in which the
martingale part could be viewed as a noise process and the mean, i.e., predictable part, is captured by the compensator. Extending these ideas to the
variance, we need to also include expected variations over time.
Denition 8.5.1. Let S0 , S1 , . . . , Sk be a nite sequence of square-integrable
random variables adapted to the ltration F0 = {, } F1 . . . Fk .
Then we call
< S >k < S, S >k = S02 +
E{2 Si |Fi1 }
i=1
i = 0, 1, . . . , k
Next we compute the predictable quadratic variation of the martingale
Mk = 1{Xtk }
1{X>ti1 }
i=1
F (ti ) F (ti1 )
1 F (ti1 )
First,
Mk = Mk Mk1 = 1{tk1 <Xtk } 1{X>tk1 }
F (tk ) F (tk1 )
.
1 F (tk1 )
and therefore
[F (tk ) F (tk1 )]2
[1 F (tk1 )]2
F (tk ) F (tk1 )
2 1{tk1 <Xtk }
1 F (tk1 )
Conclude that
F (tk ) F (tk1 )
1 F (tk1 )
[F (tk ) F (tk1 )]2
+ 1{tk1 <X}
[1 F (tk1 )]
[F (tk ) F (tk1 )]2
2 1{tk1 <X}
[1 F (tk1 )]2
whence, with t0 = ,
< M >n =
k=1
1{tk1 <X}
195
1{xX} (1 F (x))
F (dx).
[1 F (x)]2
< M >t =
(,t]
For sample size n 1, the martingale Mn from (8.4.1) has the predictable
quadratic variation
1
(1 Fn )(1 F )
< M n >t =
dF.
n
(1 F )2
(,t]
1F
1
1
dF,
EMn2 (t) = E < Mn >t = (t) =
n
n
1 F
(,t]
8.6
Fn F
dF.
1 F
(8.6.1)
n = n1/2 (Fn F ),
Bt0
dt.
1t
Bt0
dt + dB.
1t
0
B (t) = (1 t)
1
B(ds),
1s
0 t 1.
[0,t]
Going back to nite sample size, the following result therefore is not surprising.
Theorem 8.6.1. Under F = U [0, 1],
Fn (t) t = (1 t)
Mn (ds)
.
1s
[0,t]
Fn (ds)
1 Fn (s)
(1 t)
ds .
1s
(1 s)2
[0,t]
(8.6.2)
[0,t]
1{X1 s}
1
1
1.
ds =
=
(1 s)2
1 s
1 t X1
1{X1 t}
1X1 ,
while
[0,t]
+ 1 = 1 t = 1{X1 t} t.
(1 t)
1 X1 1 X1
If X1 > t, we obtain
[
(1 t) 0
]
1
+ 1 = 1 + 1 t = 1{X1 t} t.
1t
Fn (t) t
t
=
1t
[0,t]
Mn (ds)
,
1s
0t<1
8.7
197
Stochastic Exponentials
1 Z(s)
[A(ds) B(ds)]
1 B{s}
Zt =
(,t]
c
st (1 A{s}) exp(A (t))
.
Z(t) = 1
c
st (1 B{s}) exp(B (t))
Proof. The function 1 Zt is the exponential of the process (Ct )t given by
dCt =
dBt dAt
.
1 B{t}
B = Bc + Bd.
[
]
n
c
c
B{ti } A{ti }
Bi Ai
1
= exp
ln 1 +
+
+ o(1)
1 B{ti }
1 B{ti } 1 + B{ti }A{ti }
i=1
1B{ti }
]
[
B{s} A{s}
exp[Btc Act ].
1+
1 B{s}
st
Fn (t) t
Mn (ds)
t
=
1t
1s
[0,t]
1 Fn (t)
,
1t
0 t < 1.
1 Fn (t)
1t
Fn (t) t
1t
Mn (ds)
1 Fn (s) Mn (ds)
= 1
=1
1s
1s
1 Fn (s)
= 1
[0,t]
[0,t]
n (s)Mn0 (ds),
= 1+
(8.7.1)
[0,t]
where
Mn0 (ds) =
Mn (ds)
.
1 Fn (s)
1{sXn:n }
Mn (ds).
1 Fn (s)
(8.7.2)
For t Xn:n , (8.7.1) remains the same. For t > Xn:n we have n (t) = 0.
Hence the right hand side of (8.7.1) becomes
X
n:n
1
0
Mn (ds)
Fn (Xn:n ) Xn:n
=1
= 0.
1s
1 Xn:n
1 Fn (t)
,
1t
0t<1
199
dMn
dFn
dF
=
+
.
1 Fn
1 Fn 1 F
n (t) = 1
n (s)(n (ds) (ds)).
[0,t]
(8.7.3)
for all 0 s < 1. Obviously, this inequality also holds true for s = 1.
Inequality (8.7.3) provides us with a so-called linear upper bound for 1 Fn .
Such bounds are very useful in proofs if one needs to bound 1 Fn from
above.
Chapter 9
Introduction To Survival
Analysis
9.1
202
Y1
Y2
C3
C4
3
4
5
C5
1. Entries 2.
3.
4.
5.
End of Study
Time
We see that entries into the study are staggered. Patients 1 and 2 died after
staying Y1 resp. Y2 time units in the group, while patients 3 and 5 were
still alive at the end. Finally, patient 4 dropped out due to follow-up losses.
In the last three cases, rather than Yi , we observe Ci , the time spent under
observation until follow-up losses or due to some other reason. Summarizing,
rather than Yi , we observe a variable Zi which constitutes the minimum of
Yi and a so-called censoring variable Ci , together with the information
indicating which of Yi or Ci was actually observed:
Zi = min(Yi , Ci )
i = 1{Yi Ci } .
203
To derive such an estimator we shall make heavy use of the approach outlined
in the rst chapter. For this, recall the cumulative hazard function of F ,
given through
dF
d =
.
1 F
Then clearly
(1 G )dF
(1 G )dF
d =
,
(9.1.1)
(1 G )(1 F )
(1 H )
where
1 H (z) = (1 F (z))(1 G(z))
= P(Y z)P(C z) = P(Y z, C z) = P(Z z).
Here the last equality follows from the assumed independence of Y and C.
Recall that Z1 , . . . , Zn are observable. Hence H can be nonparametrically
estimated by the empirical d.f. of the Zi s.:
1
1{Zi z} ,
Hn (z) =
n
n
z R.
i=1
=
(1 G(y))F (dy).
(,z]
Also H 1 can be nonparametrically estimated through an Empirical SubDistribution:
n
1
Hn1 (z) =
1{Zi z,i =1} .
n
i=1
204
Hn1 (dy)
1 Hn (y)
n (z) =
dHn1
(1 Hn )
(,z]
n
i=1
1
,
n Ri + 1
1 Fn (t) =
[1 n {z}].
(9.1.2)
zt
[
n Ri
i
1 Fn (t) =
=
1
.
n Ri + 1
n Ri + 1
Zi t
Zi t,i =1
[
Zi:n t
]
[i:n]
.
1
ni+1
(9.1.3)
When all data are uncensored, all equal one and Win collapses to 1/n.
Hence, when no censorship is present, the Kaplan-Meier estimator becomes
the ordinary empirical d.f. In the general case, the weights Win depend on
the location of the -labels and are therefore random.
205
When F has discontinuities, the data may have ties. In this general case
the Kaplan-Meier estimator is computed as follows:
Since 1 Fn is the exponential of n , we have
1 Fn (t) = 1
(1 Fn )dn .
(,t]
(9.1.4)
Since Hn1 attributes positive mass only to the uncensored data, the same is
true for Fn . Now, let t1 < . . . < tk be the pairwise distinct values among
Z1 , . . . , Zn . From (9.1.4) we obtain a recursive formula for the masses at ti .
For t1 we obtain
k1
Fn {t1 } = Hn1 {t1 } ,
n
where k1 is the number of Z-data with Zi = t1 and i = 1. Conclude that
k1
n k1
1 Fn (t1 ) = 1 Fn {t1 } = 1
=
n
n
n k1
n1n2
...
=
n n1
n k1 + 1
]
k1 [
[i:n]
=
1
.
ni+1
i=1
Practically, we have done the same for the k1 smallest uncensored Zi as for
the empirical d.f., when no censorship was present. It could be that also
some censored variables take on the value t1 . They do not contribute to the
last product. If we continue in this way we nd out that also in the case of
ties the formula
]
[
[i:n]
1 Fn (t) =
1
(9.1.5)
ni+1
Zi:n t
206
dFn =
Wjn (tj ).
j=1
1
dFn =
(Xi ),
n
n
i=1
in distribution,
dFn =
F 1 dFn .
207
Proof. Elementary.
When we apply the previous lemma to the pairs (Yi , Ci ), we obtain a sample
(Ui , Ui ) with the same censorship structure as the original sample. In the
U -sample no ties are present with probability one. For a general integrand
, we have
(Yi ) = (F 1 (Ui )), if i = i = 1.
The Kaplan-Meier weights remain unchanged. As a consequence
dFn =
Win (F 1 (Zi:n
))
i=1
9.2
1
(1 G )dF.
H (t) = P(Z t, = 1) =
(,t]
Finally, set
(1 F )dG.
H (t) = P(Z t, = 0) =
0
(,t]
208
Hn (t) =
Hn1 (t) =
Hn0 (t) =
1
n
1
n
i=1
n
i=1
n
i=1
dH 1
,
1 H
dHn1
.
1 Hn
(
)
Ft = 1{Zs} , 1{Zs,=1} : s t .
{Z>tk1 }
H 1 (t
209
From this we get the Doob-Meyer decomposition in discrete time and, nally,
by going to the limit, in continuous time.
Theorem 9.2.1. Consider the process St = 1{Zt,=1} with the adapted ltration Ft = (1{Zs} , 1{Zs,=1} , s t). Then (St )t admits the innovation
martingale
1{xZ}
1
H 1 (dx)
Mt = 1{Zt,=1}
1 H(x)
(,t]
= 1{Zt,=1}
1{xZ}
F (dx).
1 F (x)
(,t]
Corollary 9.2.2. The process Hn1 is adapted to Ft = (1{Zi s} , 1{Zi s,i =1} :
i = 1, . . . , n, s t) and has the innovation martingale
1 Hn (x)
1
1
Mn (t) = Hn (t)
F (dx).
(9.2.1)
1 F (x)
(,t]
1 Hn (x) 0
0
0
H (dx)
Mn (t) = Hn (t)
1 H(x)
(,t]
Hn0 (t)
1 Hn (x)
(1 F (x))G(dx).
1 H(x)
(,t]
1 Hn (x)
G(dx).
1 G(x)
(,t]
1 Hn
dF.
1 F
= dn d.
1 Hn
1 Hn 1 F
(9.2.2)
210
Integration leads to
(n )(t) =
dMn
1 Hn
on t Zn:n .
(9.2.3)
(,t]
1 Z(s)
Zt =
[n (ds) (ds)].
1 {s}
(,t]
1 Fn (t)
st (1 n {s})
=1
.
Z(t) = 1
c
1 F (t)
st (1 {s}) exp( (t))
Hence, for F (t) < 1,
1 Fn (t)
1 F (t)
= 1
1 Z(s)
[n (ds) (ds)]
1 {s}
(,t]
= 1
1 Fn (s)
[n (ds) (ds)].
(1 F (s))(1 {s})
(,t]
Since
(1F (s))(1{s}) = (1F (s))(1
F {s}
) = 1F (s)F {s} = 1F (s)
1 F (s)
we obtain
1 Fn (t)
=1
1 F (t)
(,t]
1 Fn (s)
[n (ds) (ds)].
1 F (s)
(9.2.4)
211
Equations (9.2.2) and (9.2.4) will enable us to further study the KaplanMeier estimator. First, from (9.2.4)
Fn (t) F (t)
=
1 F (t)
1 Fn (s)
[n (ds) (ds)].
1 F (s)
(,t]
Fn (t) F (t)
=
1 F (t)
1 Fn (s)
M 1 (ds).
(1 F (s))(1 Hn (s)) n
(9.2.5)
(,t]
Theorem 9.2.4 constitutes an extension of Theorem 8.6.1 to the KaplanMeier estimator. Note that in the classical case 1 Fn = 1 Hn , so that
the (predictable) random ratio (1 Fn )/(1 Hn ) cancels out. Therefore,
with no censorship present, the restriction t Un:n was not necessary. For
the ratio of 1 Fn and 1 F a similar argument yields, for t Zn:n ,
1 Fn (t)
=1
1 F (t)
1 Fn (s)
M 1 (ds).
(1 F (s)) (1 Hn (s))(1 {s}) n
(,t]
(9.2.6)
Theorem 9.2.5. The process
n (t) =
1 Fn (t)
1 F (t)
n dened through
equals, on t Zn:n , the exponential of the process M
n =
dM
dMn1
.
(1 Hn )(1 {s})
212
1{xZ}
(s) = E 1{Zs,=1}
F (dx)
1 F (x)
= H1 (s) 2E
+ E
(,s]
(,s]
1{xZs,=1}
F (dx)
1 F (x)
2
1{xZ}
F (dx) ,
1 F (x)
(,s]
H 1 {x}
1
F (dx).
(s) = H (s)
1 F (x)
(9.2.7)
(,s]
Theorem 9.2.6. The innovation martingale Mn1 has the covariance function
Cov(Mn (s), Mn (t)) =
1
(s) for s t
n
with as in (9.2.7).
For a continuous H 1 , we have = H 1 . When there is no censorship (9.2.7)
and Lemma 1.4.2 coincide.
To determine the limit distribution of Fn , since the weights Win are random,
the CLT for sums of independent random variables is not directly applicable.
In such a situation, a representation in terms of stochastic integrals, see
(9.2.5), may be helpful.
In the following we sketch the arguments. Starting with (9.2.5), the so-called
Kaplan-Meier process
n (t) = (1 F (t))
(,t]
1 Fn (s)
n (ds),
M
(1 F (s))(1 Hn (s))
213
1 F (s)
0
n (ds)
n (t) = (1 F (t))
M
(1 F (s))(1 H(s))
(,t]
= (1 F (t))
n (ds)
M
.
(1 F (s))(1 G(s))
(,t]
dB
n () (1 F ())
= B().
(1 F )(1 G )
(,]
= (1 F (s))(1 F (t))
(1
(,st]
F )2 (1
)2
[dH 1
H 1 {}
dF ].
1 F
dF
= (1 F (s))(1 F (t))
(,st]
d
.
(1 H )
(9.2.8)
214
P(sup |
n (t)| x) P(sup |B(t)|
x).
tt0
tt0
Proof. This assertion follows from the Breslow-Crowley result upon applying
the continuous mapping theorem.
9.3
In this section we demonstrate how the results of the previous section may be
applied in a statistical context. Again, assume that we observe (Zi , i ), 1
i n. We want to construct a condence band for the unknown d.f. F , i.e.,
for each t we aim at constructing an interval
(
)
x
x
d
dH 1
C(t) =
=
.
1 H
(1 H )2
(,t]
(,t]
C(t)
1 + C(t)
K(t)
.
1 K(t)
Then
= (1 F )B
B
=
1F
K
K
=
(1 K)B
1K
1K
1K
1F 0
B K,
1K
n,
1F
we get
n B 0 K in distribution.
From the Continuous Mapping Theorem,
(
)
(
)
(1 K(t))
n (t)
0
P sup
sup |B K(t)| x .
x P tt
1 F (t)
tt
0
0uK(t0 )
For selected values of K(t0 ) and x, the last probabilities are tabulated.
Choosing x such that
(
)
P
|B 0 (u)| x
sup
= c,
0uK(t0 )
Cn
1 + Cn
with
Cn (t) =
(,t]
Summarizing we obtain
dHn1
(1 Hn )2
(9.3.1)
216
P
sup
|B(C(t))|
x
n (t)
tt0 1 F
tt0
(
)
(
)
= P
sup |B(u)| x = P sup |B(uC(t0 ))| x
(
= P
0uC(t0 )
0u1
(t)
n
P sup
x Cn (t0 ) P sup |B(u)| x .
n (t)
tt0 1 F
0u1
The resulting condence band is of the form
(
)
9.4
217
(1 G2 )dF2 ,
(,t]
218
W d = W d under H0 .
Focus on and rst. Putting
d =
(1 G1 )(1 G2 )(1 F2 )
dF1
1 H
d =
(1 G1 )(1 G2 )(1 F1 )
dF2 ,
1 H
and
where
H=
we get that
n1
n2
H1 +
H2 ,
n1 + n2
n1 + n2
W (H(x))[(dx) (dx)]
vanishes under the null hypothesis. Note that in I the weight function
becomes W H rather than W , with W being dened on the unit interval. The function H may be viewed as the d.f. of the pooled sample
Z11 , . . . , Zn11 , Z12 , . . . , Zn22 . The measures and are closely connected with
the cumulative hazard functions. Actually, we have
d =
d =
1 H2
dH11
1 H
1 H1
dH21 .
1 H
The nal test statistic is just the empirical analog of I. This leads to
[
]
2 (x)
1 (x)
1
H
1
H
1
1
1 (dx)
2 (dx) .
I W (H(x))
H
H
1 H(x)
1 H(x)
219
H(x)
=
1{Z 1 x} +
1{Z 2 x} .
i
i
n1 + n2
i=1
i=1
1 and H
2 are the empirical d.f.s of the Z-subsamples:
The functions H
n1
n2
1
1
H1 (x) =
1{Z 1 x} und H2 (x) =
1{Z 2 x} .
i
i
n1
n2
i=1
Conclude that
=
H
i=1
n1
1 + n2 H
2,
H
n1 + n2
n1 + n2
and henceforth
=
1H
n1
1 ) + n2 (1 H
2 ).
(1 H
n1 + n2
n1 + n2
Set
1 =
n1
n1 + n2
2 =
n2
.
n1 + n2
Then we obtain
[
]
1
1
1
n
1
H
1
H
1
1
1
1
.
dH
I = W H
dH
dH
1
1
2
2
n2 1 H
1H
Denote the data in the pooled Z-sample with T1 < T2 < . . . < TN , where
N = n1 + n2 . Let (k) be the label of Tk . If we introduce
{
1 if Tk comes from the rst sample
k =
,
0 if Tk comes from the second sample
then we easily express I as a sum:
[
]
N
k
N
nk1
I=
W ( )(k) k
,
n1 n2
N
N k+1
k=1
with
nk1 =
1{Tj Tk } j
j=1
being the number of data in the pooled sample, which exceed the k-th order
statistic and are from the rst sample.
Two famous tests:
W
= 1 : Mantel-Haenszel Test
220
9.5
f (Yi )
i=1
f (x)
1 F (x)
Conclude that
f (Y ) = (Y )(1 F (Y ))
Y
(t)dt
= (Y ) exp
= (Y ) exp
1{tY }
F (dt) .
1 F (t)
n
n
1
i=1 {tYi }
Ln () =
(Yi ) exp
F (dt) .
1 F (t)
i=1
221
So far we assume in this section that the Y s are i.i.d. and observable. As we
shall see, in the presence of covariates, that very often the assumption that
the Y s are identically distributed and come from a homogeneous population
cannot be justied. In other words, the hazard function may vary with i
for 1 i n. With the same arguments as before, we obtain for the
likelihood-function
Y
i
n
i (t)dt .
Ln () =
i (Yi ) exp
i=1
(9.5.1)
Ln () =
i=1
Yi
(s)ds .
222
f (Xi , Yi ),
i=1
Density of Xi
Conditional density of Yi given Xi .
Yi
i (t)dt
f1 (Xi ).
i=1
n and Ln coincide up to a
If the density f1 does not depend on , then L
factor not depending on . In this case their maximizers coincide. In the
n are
general case, f1 will depend on and the maximizers of Ln and L
distinct. The common name of Ln is also Partial Likelihood.
As another popular model we mention the following example.
Example 9.5.3. Now the individual risk acts as an additive component:
i (x) = (x) + i (x).
A possible choice for i (x) is, if x = Xi , t Xi .
So far our discussion focused on the likelihood approach. Interestingly
enough the compensator appeared in the exponent of Ln . A direct analysis which also found a lot of interest is the martingale itself. Our approach
223
i=1
1{xYi } i (x)dx.
(,t]
:= Fn (t) n
i=1
1{xYi } i (x)dx
(9.5.2)
(,t]
1{xZi }
Fi (dx) =
1 Fi (x)
1{xZi } i (x)dx.
(,t]
(,t]
Hn1 (t)
i=1
(,t]
1{xZi } i (x)dx.
(9.5.3)
224
Chapter 10
It is worthwhile recalling the scenario which gave rise to data censored from
the right. After entry into a study an individual is observed over a period
of time until a certain event can be observed. Due to an early end of the
study or due to follow-up losses censorship may then occur. To determine the
exact value of Zi , however, it is necessary to monitor the history of a patient,
e.g., without gaps. In a real life situation things may dier, however, in as
patients appear for a check only at times T0 < T1 < T2 < . . . and monitoring
is only possible then. Such a situation gives rise to the following sampling
design.
Example 10.1.1. Let T0 < T1 < . . . be an increasing sequence of monitoring times. Suppose that at time Tj it became known that a default took
place at time R between Tj1 and Tj . The associated indicators
{
1 if
Ti1 < R Ti
i =
0 else
together with the monitoring times Ti constitute the available information
in the form (Tk , k ), k = 1, 2, . . . In the above example, j = 1 while all other
labels vanish. At time t, only those (Tk , k ) with Tk t are available.
T0
T1
T2
T3
225
T4
T5
226
U , if Y < U
Y , if U Y V
Z=
V , if V < Y
together with its label
1 , if Y < U
2 , if U Y V .
=
3 , if V < Y
Y1
Incubation Period Y2
Incubation Period Y3
w1
w2
w3
227
case two has not been diagnosed at time c so it also cannot be included.
More formally, put
= Time of Infection
,
w = Time of Diagnosis
and let Y = w for the Incubation Period. A case is included if and
only if w c or, equivalently, i
Y U = c .
(10.1.1)
228
( 0 , T1 )
Y
(0,0)
(T1 , 0 )
(T2 ,0 )
10.2
The Single-Event-Process
St = 1{Y t}
1{Ti t} .
i=1
It can be shown that every (Nt )t admits a compensator and thus an innovation martingale. In the examples discussed so far the compensator admitted
an intensity of the form
x 1{xY } (x) resp. x 1{xZ} (x),
where
dF
= dx.
1 F
Mt := Nt
(x)Y (x)dx
(,t]
is a martingale.
(10.2.1)
230
In the case of the Single-Event process, Y (x) = 1{xY } and (x) is the
hazard function of F .
In this section we study the problem of estimating the function
(t) =
(x)dx
(,t]
First work on this problem was due to Nelson (1969) and Aalen (1975).
Therefore the resulting estimator is called the Nelson-Aalen Estimator.
To start with, rewrite (10.2.1) in terms of dierentials:
dMt = dNt Y dt
or
Y dt = dNt dMt .
(10.2.2)
(10.2.3)
and therefore
(t) =
Y 1 (x)M (dx).
(x)N (dx)
(,t]
(,t]
(t)
Y 1 (x)N (dx) =
Y 1 (Ti ).
Ti t
(,t]
If Y (x) may attain the value zero we need to introduce the function J(t) =
1{Y (t)>0} and consider the equation
J(t)(t)dt =
J(t)
J(t)
dNt
dMt .
Y (t)
Y (t)
The estimator
(t)
=
(,t]
J(x)
dNx
Y (x)
(t) =
J(x)(x)dx.
(,t]
We have
(t) E(t)
=
Therefore, in general, (t) is biased. The bias becomes small when P(Y (x) =
0) is small for x t.
When we have independent replications of (Nt )t the procedure needs to be
slightly modied. In this case the arguments for (10.2.2) and (10.2.3) now
yield
)
)
( n
)
( n
( n
Mit
Nit d
Yi (t) dt = d
(t)
i=1
i=1
i=1
and nally
(t)
=
(,t]
)
( n
J(x)
n
d
Nix .
i=1 Yi (x)
i=1
Conclude that
(t)
=
J(x)
H 1 (dx).
1 Hn (x) n
(,t]
The correction through J(x) is only necessary for x > max(Zi : 1 i n).
But Hn1 has no mass there so that after all
(t) =
H 1 (dx).
1 Hn (x) n
(,t]
10.3
In this section we apply the concept developed so far for testing hypotheses.
For this, imagine we observe counting processes N1 , N2 , . . . , Nk with k xed.
232
i=1
Mj (t) = Nj (t)
j (x)Yj (x)dx.
(,t]
Denote with
j (t) =
j (x)dx
(,t]
the cumulative hazard function associated with j . One question which has
been studied in detail in the literature was how to test for the hypothesis
H 0 : 1 = . . . = k = 0
with 0 unspecied.
In the following we assume that N1 , . . . , Nk are independent. The rst step
in our analysis is to form the Grand Sum Process
N =
Nj .
j=1
Similarly
Y =
Yj
j=1
dM = dN
j Yj dt
j=1
dN (Y ) dt.
dNi = i Yi dt + dMi ,
respectively
dMj
dNj
= j dt +
Yj
Yj
(10.3.1)
dN
dM
= 0 dt +
.
Y
Y
(10.3.2)
Under the null hypothesis, the drift parts in (10.3.1) and (10.3.2) coincide so
that dNj /Yj and dN /Y only dier in their martingale parts. The dierence
between these two processes we can weight appropriately and then integrate.
Then we come up with a linear statistic
[
]
Nj (dx) N (dx)
Kj (x)Jj (x)
.
Zj (t) =
Yj (x)
Y (x)
(,t]
234
Yj (x)
Zj (t) =
K(x)Nj (dx)
K(x)
N (dx)
Y (x)
(,t]
(,t]
K(x)Mj (dx)
=
(,t]
(,t]
K(x)
l=1 (,t]
Yj (x)
M (dx)
Y (x)
]
Yj (x)
Ml (dx),
K(x) lj
Y (x)
[
K(x)Hn1j (dx)
Zj (t) = nj
l=1
(,t]
nj nl
K(x)
N
(,t]
1 Hnj (x) 1
H (dx).
(x) nl
1H
(x))H 1 (dx)
(1H
nj
Zj (t) = nj
l=1
(,t]
nj nl
N
(,t]
l=1
j l
(,t]
235
l dHl1 = (1 H )d0 .
l=1
Conclude
k
l=1
(1
Hj )dHl1
(1 Hj )(1 H )d0
(,t]
(,t]
(1 H )dHj1 .
=
(,t]
10.4
Ln () =
(Yi ) exp
(t)dt .
i=1
(,Yi ]
ln (Yi )
ln Ln () =
1{tYi } (t)dt .
i=1
(10.4.1)
ln Ln () =
(t)dt .
i ln (Zi )
i=1
(,Zi ]
236
Of more interest is the case when depends on covariates. The best studied
case is
(x) i (x) (x) exp[ t Xi ].
(10.4.3)
Since
i (x)
j (x)
[
]
= exp t (Xi Xj )
does not depend on x, this model is called the Cox Proportional Hazards
Model. The partial Log-Likelihood Function becomes
Zi
n
(
)
i ln (Zi ) + i t Xi exp t Xi
ln Ln () =
(t)dt . (10.4.4)
i=1
If the baseline-function does not depend on a nite-dimensional parameter but may be any function, we arrive at a semiparametric model. In this
case j needs to be replaced with = (t), and maximization takes place
w.r.t. and .
The following presents a detailed analysis of the Cox model. For a reference,
see Tsiatis (1981). Assume that
i (x) (x) = (x) exp[ t Xi ]
and that
Yi and Ci are independent conditionally on Xi
(10.4.5)
237
Because of independence
[
]
H 1 (y|x) = E 1{Zy,Y C} |X = x
=
[1 G(z |x)]F (dz|x)
(,y]
=
(,y]
= exp( t x)
[1 H(z |x)](z)dz.
(10.4.6)
(,y]
1
H (y) =
H 1 (y|x)c(x)dx
=
exp( t x)[1 H(z |x)](z)c(x)dzdx.(10.4.7)
(,y]
=
(x)[1 H(y |x)]c(x)dx.
Finally, put
E1 (y) = E[(X)1{Zy,=1} ] = E[(X)H 1 (y|X)]
=
(x)H 1 (y|x)c(x)dx.
238
By (10.4.6)
E1 (y) =
(,y]
Proof. As before.
If we take for the function
0 (x) = exp( t x)
we get two important relations.
Corollary 10.4.3.
(y) =
[H 1 (y)]
.
E0 (y)
(10.4.8)
[H 1 (y)] E0 (y)
E0 (y)
(10.4.9)
which is an easy consequence of Lemma 10.4.1, Lemma 10.4.2 and the denitions of the involved functions.
Its time to go back to the partial log-likelihood function (10.4.4). Write
Xi = (Xi1 , . . . , Xip )t .
If we take the partial derivative of (10.4.4) w.r.t. j , 1 j p, we obtain
n
ln Ln
j
j
t
=
(t)dt .
(10.4.10)
i Xi Xi exp( Xi )
j
i=1
(,Zi ]
239
n
ln Ln
t
=
(t)dt .
(10.4.11)
i (Xi ) (Xi )(exp Xi )
j
i=1
(,Zi ]
The right-hand side also makes sense for s other than the true parameter. Therefore, from now on, we denote the true parameter with 0 . From
(10.4.8) and (10.4.9) we get, by Fubini,
{
}
E (Xi )0 (Xi )
(t)dt
=
(t)E (Xi )0 (Xi )1{tZi } dt
(,Zi ]
=
=
(t)E0 (t)dt =
E1 ()
[E1 (t)] dt
E1 ()
]
= E (X)1{=1} .
Conclude
Lemma 10.4.4. At = 0 we have
[
]
ln Ln
E
= 0 for 1 j p.
j
1
(t)dt =
H 1 (dt).
E0 (t)
(,Zi ]
(,Zi ]
Hn1 (y) =
i=1
240
and
E(y)
=
exp( t Xi )1{Zi y}
n
n
i=1
1
1 (Zk )
lj ()
i Xij Xij exp( t Xi )
k 1{Zk Zi } E
n
=
i=1
n
i=1
n
k=1
i Xij
i Xij
i=1
1
n
n
n
1 (Zk )
k Xij exp( t Xi )1{Zk Zi } E
k=1 i=1
n
j
n
t
k
.
n
t
i=1 exp( Xi )1{Zi Zk }
k=1
We need to solve
lj () = 0 for 1 j p
Denote with = (1 , . . . , p )t its solution.
The rest of this section is a little technical. The lemmas will be needed to
prove consistency of .
Lemma 10.4.5. With probability one, we have
sup |Hn1 (y) H 1 (y)| 0
(10.4.12)
(y) E (y)| 0
sup |E
(10.4.13)
yR
yR
and
1 (y) E1 (y)| 0
sup |E
(10.4.14)
yR
und E
1 are the estimators of E and E 1 , respectively.
Here E
Proof. Follows from the Strong Law of Large Numbers and its extension to
uniform convergence.
Lemma 10.4.6. For each , with probability one
Ej (y) 1
1
1
lim lj () = Ej ()
H (dy),
n n
E (y)
where (x) = exp( t x).
241
Recall that the right-hand side in Lemma 10.4.6 equals zero for = 0
Theorem 10.4.7. (Tsiatis). With probability one,
lim = 0 .
1 dHn1
(t)
E
=
=
(,t]
n
1
n
1 (Zi )1{Z t}
i E
i
i=1
n
k=1
i=1
i 1{Zi t}
k )1{Z
exp(X
.
k Zi }
lim (t)
= (t)
1 F (t|x) = exp
(,t]
= exp exp( t x)
(s)ds .
(,t]
242
10.5
Right-Truncation
(10.5.1)
1{Yi Zi }
i=1
(10.5.2)
10.5. RIGHT-TRUNCATION
243
By induction,
k1
(10.5.3)
j=0
k1
P(tj+1 Y Z tj , Y < tj )
244
=
=
Hence the sum in (10.5.4) is the F -integral of a certain step function which,
as the grid gets ner and ner, tends to the limit
F (du)
1{Y u<Z}
F (u)
[t,)
When F and G have no atoms in common, then (10.5.4) tends to zero.
Without further mentioning we assume that this holds.
Lemma 10.5.1. The process
t 1{tY Z}
has the innovation martingale (in reverse time)
F (du)
Mt = 1{tY Z}
1{Y u<Z}
.
F (u)
[t,)
Hn1 (t) =
i=1
Mn (t) =
Hn1 (t)
Cn (u+)
F (du).
F (u)
[t,)
where
1
Cn (x) =
1{Yi uZi } .
n
n
i=1
10.5. RIGHT-TRUNCATION
245
Cn (x) =
i=1
Fn (y)
i=1
Hn (y, z) =
i=1
H (y, z) =
y z
1{uv} F (du)G(dv)
yz
= 1
Putting z = , we get
F (y) =
[1 G(u)]F (du).
(10.5.5)
246
thus the unknown is contained in both C and F0 . There is some hope that
taking appropriate ratios we may arrive at estimable quantities in which
has cancelled out. Now, rewrite (10.5.5) to get
dF = (1 G )dF
and therefore
dF =
dF
dF
= 1
.
(1 G )
(1 G )
Introduce
C(x)
= P(Y x Z|Y Z) = 1 F (x)(1 G(x)).
then
dF
dF
=
.
(10.5.6)
F
C
Equation (10.5.6) is the required identiability equation. Both terms on the
right-hand side are estimable. For example, if F has a density f , then
f (x)
dF
=
dx = ln F (x) = ln F (t)
F
F (x)
t
and therefore
F (t) = exp
dF
.
F
(10.5.7)
We need (10.5.7) fo rthe general case, i.e., also when F has atoms. The
product integration formula for survival functions is obtained by multiplying
the variable of interest with minus one. So, write
F (t) = P(Y t) = P(Y t) = P(X s)
where X = Y and s = t. Furthermore
P(X s) = lim P(X > u).
us
Let, for a moment, G denote the d.f. of X, let denote the pertaining
cumulative hazard function, i.e.
dG
(x) =
.
1 G
(,x]
10.5. RIGHT-TRUNCATION
247
Then
1 G(u) = P(X u) = P(Y u) = F (u)
and therefore
[
]
[
]
G(du)
1
1
= E 1{Xx}
= E 1{Y x}
F (u)
F (X)
F (Y )
x
(x) =
F (dy)
.
F (y)
(10.5.8)
[x,)
{x} =
Conclude that
F (t) =
=
ut
ut
F (x)
F (x)
ut
[1 {x}]
xu
xu
= exp[c (t)]
F (y)
F (y)
y>t
with
F (dy)
F (y)
(t)
=
(10.5.9)
[t,)
F (y)
y>t
F (y)
c (t)]
= exp[
[1 {y}].
y>t
248
F (dy)
=
F (y)
(t)
=
[t,)
F (dy)
.
C(y)
[t,)
Fn (y) =
i=1
and
1
Cn (y) =
1{Yi yZi } .
n
n
i=1
n we set
For
Fn (dy)
Cn (y)
n (t)
=
=
[t,)
n
1
n
i=1
1{tYi }
j=1 1{Yj Yi Zj }
i=1
n {y}].
[1
Fn (t) =
y>t
If the
Yi
[
Yi >t
]
1
1
.
nCn (Yi )
(10.5.10)
10.5. RIGHT-TRUNCATION
249
Integration leads to
n (dy) = 1 F (t).
F (y)
(10.5.11)
(t,)
n (dy) = 1 Fn (t).
Fn (y)
(10.5.12)
(t,)
(1 G )dF.
)dF.
= (1 G
250
Chapter 11
Introduction
1 i m,
and
i = 1{Xi Yi } ,
251
252
the censoring status or label of the datum, are available. Given a number
of independent replications of
Z = (Z1 , . . . , Zm ) and = (1 , . . . , m )
it is then the goal of statistical inference to estimate the unknown distribution function (d.f.) F .
While for m = 1, i.e., for real-valued survival data, the analysis already
started, in a nonparametric framework, with the landmark paper of Kaplan and Meier (1958), the multivariate case was only studied since the mid
1980s. An early reference is Dabrowska (1988) whose estimator, however,
was not a bona de estimator but could give negative (empirical) probability
masses to certain events. Van der Laan (1995) derived implicit estimators
which were asymptotically ecient under strong assumptions like complete
availability of the censoring variables. Gill et al. (1995) showed that the
estimators obtained until then were ecient under complete independence
and continuity but inecient otherwise. As a conclusion, the problem to
nd ecient bona de estimators for F under censorship in a general scenario is still open when m 2. It is the goal of this chapter to provide
such an estimator, i.e., to extend the Kaplan-Meier (KM)-estimator to the
multivariate case. The outline of this chapter is as follows. In the rest of
this section we discuss some identiability issues. In Section 11.2 we further motivate our approach while in Section 11.3 we derive the multivariate
KM-estimator. Section 11.4 provides its eciency while in Section 11.5 we
report on some simulation results.
In the rest of this section we rst show why the multivariate case is so much
dierent from the univariate situation. For this, we remind the reader how
the univariate KM-estimator is usually derived. When m = 1, the survival
function
F (x) = P(X x)
satises the trivial equation
F (x) = 1 F (x),
where
F (x) = lim F (y)
yx
11.1. INTRODUCTION
253
dH 1 = GdF
and 1 H = (1 F )(1 G),
(11.1.1)
F (x) = 1 F (x) = 1
F (dy)
F (y)
[0,x)
=1
F (y)(dy),
(11.1.2)
[0,x)
where d = dF
is the hazard measure associated with F . In view of (11.1.1),
F
(11.1.2) becomes
H 1 (dy)
F (x) = 1
F (y)
.
(11.1.3)
H(y)
[0,x)
G(y)
> 0 for all y < x. Actually, suppose that G(y)
= 0 for all y in a
left neighborhood (x , x) of x but dF > 0 there. This implies that with
probability one all Y s are less than or equal to x . Hence possible Xs
which exceed x are necessarily censored. As a conclusion F can not
be fully reconstructed from a set of censored data. Stute and Wang (1993)
contains a detailed discussion of this issue. To avoid this here we assume
throughout without further mentioning that the support of F is included in
that of G.
There are more important facts about (11.1.2) and (11.1.3). First, equation
(11.1.2) is an inhomogeneous Volterra integral equation such that, given
d =
dH 1
,
H
0t<x
[1 d {t}],
(11.1.4)
254
x
x
2
1
Fa (x1 , x2 ) = exp a1 (t1 )dt1 exp a1 2 (t2 )dt2
0
11.1. INTRODUCTION
255
F = H.
dH 11 = GdF
and G
Conclude that
H 11 (dt1 , dt2 )
F (t1 , t2 )
.
H(t1 , t2 )
F (x1 , x2 ) =
(11.1.5)
[x1 ,)[x1 ,)
(11.1.6)
256
Denition 11.1.2. For any bivariate vector X and 0 < < 1 let
{
X
with probability 1
X =
,
x with probability
where x is a vector, possibly (, ), which exceeds all (x1 , x2 ) (componentwise) in the support of F .
Note that in the discussion below as well as in the estimation process the
choice of x is not important. It only serves as an intellectual tool for
introducing a defective distribution on R2 with survival function
{
(1 )F (x1 , x2 ) + if (x1 , x2 ) in R2
.
F (x1 , x2 ) =
if x = (x1 , x2 ) = x
Of course
F (x) = lim F (x) uniformly in x = (x1 , x2 ) R2 .
0
(11.1.7)
(11.1.10)
In the following section we shall show that for a given > 0, F is the only
solution of (11.1.9) satisfying q (0, 0) = 1 and q (x ) = . In other words,
for defective distributions, the survival function is reconstructable from its
hazard function . Also we need to nd a one-to-one relation between F
and serving the same purpose as (11.1.4) in the univaritate case. Finally,
recall (11.1.7) to get the original target F back from the F s.
11.2
q (x) = 1 +
. . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )
i=1
1+
i=1 . . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )
(11.2.2)
=
1+
i=1 . . . 1{ti ti1 ...t1 0,ti =x } P (dti ) . . . P (dt1 )
Proof. It is easy to check that the q from (11.2.2) is nonnegative and satises
q (0, 0) = 1. Also the series in the numerator vanishes at x = x so that
q (x ) = . We shall see from the proof below that both series converge.
Actually, the positive has been introduced only for one reason, namely
to enforce convergence in the Neumann representation. Finally, check that
(11.2.2) is indeed a solution of (11.2.1). Hence it remains to prove that q
from (11.2.2) is the only solution of (??). So, let q be any other nonnegative
solution of (11.2.1) satisfying the two constraints q(0, 0) = 1 and q(x ) = .
By (11.2.1), it is necessarily non-increasing in x and bounded from above
by 1. If we iterate (11.2.1), we obtain
= q(x ) +
1{t2 t1 x,t1 =x } q(t2 )P (dt2 )P (dt1 )
+
1{t2 t1 x,t2 =x } q(t2 )P (dt2 )P (dt1 ).
258
k1
q(x) = q(x ) 1 +
. . . 1{ti ti1 ...x,ti =x } P (dti ) . . . P (dt1 )
i=1
...
bi + ck .
q(x ) 1 +
bi + ck = 1 +
i=1
i=1
1+
bi q(x).
i=1
{
}
0 ck q(0) . . . 1{tk tk1 ...x,tk1 =x } P (dtk ) . . . P (dt1 ) q(0)bk1 P (R+ )2 + 1 .
Therefore, ck 0 as k and thus
}
{
q(x) = 1 +
bi
{
= 1+
i=1
...
i=1
1= 1+
. . . 1{ti ...t1 0,ti =x } P (dti ) . . . P (dt1 ) ,
i=1
(1 )dH 11
+
+
+ on R R
(1 )H
F = H,
dH 11 = GdF
and G
we have
dP0 =
(1 )GdF
(1 )dF
= d ,
(1 )GF +
(1 )F +
(11.2.2) yields
q0 (x) q (x) for all x.
To obtain a lower bound we need to strengthen the identiability condition
(11.1.7) a little bit, namely that
supp F supp G strictly.
In technical terms this means that for each x in the support of F we have
G(x)
>0
(11.2.4)
for an appropriate (small) 0 < < 1. This means for a vector X = (X1 , X2 )
taking its value in the extreme north-eastern corner, that there is a little
though positive chance of not being censored. Under (11.2.4), we have, for
0 < < < 1,
(1 )dF
(1 )dF
(1 )F + /G
(1 )F + /
(1 /)dF
.
(1 /)F + /
dP0 =
(11.2.5)
260
11.3
In this section we derive and study the multivariate extension of the KaplanMeier estimator. Again, we restrict ourselves to dimension two. We start
by their empirwith equation (11.1.5) and replace the unknown H 11 and H
ical counterparts
1
1j 2j 1{Z1j t1 ,Z2j t2 }
n
n
Hn11 (t1 , t2 ) =
n (t1 , t2 ) = 1
H
n
j=1
n
1{Z1j t1 ,Z2j t2 } ,
j=1
1
Hn11 (dt1 , dt2 )
q (x1 , x2 ) =
+
1{t1 x1 ,t2 x2 } q (t1 , t2 )
.
n + 1 )(t1 , t2 )
n+1
(H
[x1 ,)[x2 ,)
n
(11.3.1)
Denote with
Fn = q (Pn0 )
the unique nonnegative solution of (11.3.1) satisfying q (0, 0) = 1, where
n + 1/n). Fn will be the bivariate analog of the univaridPn0 = dHn11 /(H
ate Kaplan-Meier estimator. Extensions to higher dimension are straightforward at the expense of some more notation. As to computational aspects, one possibility would be the Neumann representation (11.2.2) with
dP = dPn0 . We prefer to solve (11.3.1) directly. Forgetting the leading
summand 1/(n + 1) for a moment, equation (11.3.1) says that Fn is the
eigenfunction (with eigenvalue one) of an appropriate (empirical) operator
equation.
To begin with the computation, since Hn11 is a discrete measure supported
by certain (Z1i , Z2i ), 1 i n, so is Fn . Let pi denote the mass given to
(Z1i , Z2i ) under Fn . These are the quantities we are interested in because
they uniquely determine Fn . Conclude that
Fn (Z1i , Z2i ) =
j=1
aij pj +
1
,
n+1
261
1i 2i
Hn11 (Z1i , Z2i )
1 = nH
Hn (Z1i , Z2i ) + n
n (Z1i , Z2i ) + 1
aij pj =
j=1
n+1
k=1
n+1
k=1
aik bk
n+1
akl pl =
l=1
n+1
[n+1
l=1
k=1
]
aik bk akl pl .
(11.3.2)
0
1
aij =
1 or 0
if
if
if
j<i
j=i ,
j>i
(11.3.3)
262
assuming for the time being that no ties are present. An ordering of the
second components would lead to the same numerical results. Now, under
(11.3.3), the matrix A becomes an invertible upper triangular matrix, and
equation (11.3.2) boils down to
p = BAp.
(11.3.4)
Of course, now pj is the mass given to (Z1j:n , Z[2j:n] ), the second Z being
the concomitant of the rst. Furthermore, upon setting j = 1j 2j , we have
BA = ((bi aij ))1i,jn+1
with
i aij
bi aij = n+1
.
k=i aik
aik
1 i n + 1.
pi = 1.
i=1
n+1
bi aij pj .
j=1
n+1
bi aij pj .
j=i+1
n+1
j=i+1
aij pj ,
(11.3.5)
263
with
bi
, 1 i n,
1 bi
being well-dened because of bi < 1. Since all bi and aij are nonnegative,
so are the pi s. Furthermore, by backward recursion based on (11.3.5), all
p
i s are unique up to a common factor. Hence by dividing each pi through
n+1
j=1 pj , we obtain the admissible solution.
ci =
pi = ci pn+1 1 +
aij cj +
aij ajk cj ck
j=i+1
j=i+1 k=j+1
n+1
[
pi = pn+1 1 +
i=1
]
ci di .
i=1
.
pn+1 = 1 +
ci di
i=1
(11.3.5) then applies to compute the other pi s from the set of cs and as,
i.e., from the data. As mentioned earlier, in the univariate case, the KaplanMeier estimator is always defective and thus unprotected to loss of masses
when the of the largest Z equals zero. Hence it is interesting to look at
our approach also in dimension one. In this case aij = 1 for j i, after
ordering. Conclude that
bi =
i
,
n+2i
1 i n + 1.
n+1
pj ,
1 i n.
j=i+1
ci
bi+1
pi+1
264
(1 + cj )
= ci pn+1
pi =
j=i+1
pn+1 = n
ci
j=1 (1
+ cj )
= bi
i1
[1 bj ].
j=1
These are exactly the Kaplan-Meier weights given to the sample (Zi , i ), 1
i n, enriched by x . The choice of x is not important when we restrict
our estimator to the data. Furthermore, as was shown by Stute and Wang
(1993) and Stute (1995), the fundamental large sample results, the Strong
Law of Large Numbers and the Central Limit Theorem, continue to hold for
KM-integrals and are not aected by the possible (negligible) defectiveness
of the estimator. Similarly for the multivariate case, where (before normalization) the mass 1/(n + 1) given to x is positive but small as n gets
larger.
11.4
While in the previous section the focus was on the construction of a nonparametric estimator of a multivariate survival function and some of its
numerical properties, this section is devoted to its statistical behavior. For
this, let P be again any nite measure on the Euclidean space enriched by
x and giving mass one to x . Recall that
1+
i=1 . . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )
(11.4.2)
q(x) =
1+
i=1 . . . 1{ti ti1 ...t1 0,ti =x } P (dti ) . . . P (dt1 )
265
Since q(x) is uniquely determined through P , we may view q(x) as the value
of a statistical functional q(x) = Tx (P ) evaluated at P . Equation (??)
exhibits that Tx is a ratio of two innite order U -functionals. Similar to
Sering (1980) one can show that Tx is Gateaux-dierentiable. The lefthand side of (11.4.1) yields not only a function in x but also a distribution
which is again denoted with q. By taking increments in (11.4.1) we obtain
for any rectangle I
q(I) =
1I (t)q(t)P (dt).
(11.4.3)
dq = (t)q(t)P (dt)
(11.4.4)
for any elementary function . Finally, by going to the limit, (11.4.4) holds
also for any bounded continuous function , and nally for any bounded
measurable . The boundedness is needed to guarantee, for any nite P ,
the existence of the integrals. Relevant q obtained and discussed so far are
the ones associated with
dP = dP0 =
dP = d ,
(1 )dH 11
+
(1 )H
and
dP =
dPn0
(1
(1
1
11
n+1 )dHn
n+1 )Hn
1
n+1
dHn11
.
n + 1
H
n
The true but unknown hazard measure dP = dF/F does not belong to this
collection since in many cases it puts innite mass to the Euclidean space.
To also include integrals w.r.t. the last P , we assume that the function
is such that F is bounded away against zero on the support of . This
guarantees that all integrals and moments to appear below exist. Important
examples of such s are the indicators of the rectangles [0, x1 ] [0, x2 ] with
x1 T1 and x2 T2 such that F (T1 , T2 ) > 0. Of course, for such a ,
dF = F (x1 , x2 ).
Recall that for sample size n the mass sitting at x equals 1/(n+1). With
this = n , we have
Fn (x) = Tx (Pn0 ).
266
Furthermore, set
Fn0 (x) = Tx (P0 ),
a non-random survival function. To
study the eciency of our estimator
dFn0 =
Conclude that
11
11
dH
dH
n
0
II = Fn
n + 1
+1
H
H
n
n
[
]
1
Fn
11
11
11
11
HdHn Hn dH + (dHn dH ) .
=
n + 1 )(H
+ 1)
n
(H
n
n
Standard empirical process theory thus yields, in view of Theorem 11.2.2,
F
1/2
1/2
11
11
n II = n
2 [HdHn Hn dH ] + oP (1)
H
}
{
F
F
1/2
11
11
11
=n
+ oP (1)
(dHn dH ) +
2 [H Hn ]dH
H
H
{
}
= n1/2
(dHn11 dH 11 )
.
(11.4.5)
H
267
Conclude that
n1/2 II =
dn + oP (1),
G
It will also be relevant for I. Actually, since both Fn and Fn0 pertain to the
same , namely = 1/(n + 1), we obtain from (11.2.2)
Fn (t)Fn0 (t)
1
=
n+1
i=1
...
The linear expansion of the right hand side is obtained through the Hajek
n
.
projection; see Sering (1980). More precisely, dene
n through d
n = d
F
Then we have, up to an oP (1) term,
[
]
n1/2 Fn (t) Fn0 (t)
i
1
. . . 1{ti ...t1 t,ti =x } P0 (dti ) . . .
n (dtk ) . . . P0 (dt1 )
=
n+1
i=1 k=1
1
=
. . . 1{tk ...t1 t,tk =x }
n+1
k=1
{
}
0
0
1+
. . . 1{ti ...tk ,ti =x } P (dti ) . . . P (dtk+1 )
n (dtk ) . . . P0 (dt1 )
i=k+1
...
k=1
k=1
...
F (dtk1 )
F (dt1 )
F (tk )1{tk ...t1 t,tk =x }
n (dtk )
...
.
F (tk1 )
F (t1 )
268
I=
...
k=1
(t)
1{t ...t1 t,tk =x } n (dtk ) . . .
F (t) k
F (dtk1 )
F (dt1 )
...
F (dt).
F (tk1 )
F (t1 )
dFn
1/2
dFn0
dn + oP (1)
(11.4.6)
with
(s) = (s) +
...
k=1
F (dtk )
F (dt1 )
(t1 )1{stk ...t1 }
...
.
F (tk )
F (t1 )
(11.4.7)
1 dF ]
dH111 H
d1 =
H
1 2 (Z1 , Z2 )
(z1 , z2 )
=
1{Z1 z1 ,Z2 z2 }
F (dz1 , dz2 ).
G(Z1 , Z2 )
H(z1 , z2 )
[
In follows that
[
d1 |X1 , X2 = (X1 , X2 )
(z1 , z2 )
F (dz1 , dz2 ).
1{X1 z1 ,X2 z2 }
F (z1 , z2 )
If we plug (11.4.7) into the last expression, the terms in the series cancel
out and the conditional expectation drops down to
[
E
]
d1 |X1 , X2 = (X1 , X2 )
(11.4.8)
11.5
269
Simulation Results
Fehat
Fbar
1.0
0.8
0.8
0.6
6
0.4
5
0.2
1
2
2
3
x1s
eq
1
5
4
0
x2s
eq
0.0
2
3
eq
x1s
1
5
eq
0.2
x2s
t
Feha
0.4
Fbar
0.6
270
Bibliography
[1] Dabrowska, D. M., Kaplan-Meier estimate on the plane, Ann. Statist.,
16, 1475-1489 (1988).
[2] Gill, R. D., van der Laan, M. J. and Wellner, J. A., Inecient estimators
of the bivariate survival function for three models, Ann. Inst. H. Poincare
Probab. Statist., 31, 545-597 (1995).
[3] Prentice, R. L., Moodie, F. and Wu, J., Hazard-based nonparametric
survivor function estimation, J. R. Stat. Soc. Ser. B. Stat. Methodol., 66,
305-319 (2004).
[4] Sering, R. J., Approximation theorems of mathematical statistics, Wiley, New York, (1980).
[5] Stute, W., The central limit theorem under random censorship, Ann.
Statist., 23, 422-439 (1995).
[6] Stute, W. and Wang, J. L., The strong law under random censorship .
Ann. Statist., 21, 1591-1607 (1993).
[7] van der Laan, M. J., Ecient estimation in the bivariate censoring model
and repairing NPMLE, Ann. Statist., 24, 596-627 (1996).
[8] van der Vaart, A., On dierentiable functions, Ann. Statist., 19, 178-204
(1991).
271
272
BIBLIOGRAPHY
Chapter 12
Nonparametric Curve
Estimation
12.1
F (x + a2 ) F (x a2 )
a
we could consider the right-hand side, for a xed bandwidth a > 0, and
estimate the ratio through
fn (x) =
Fn (x + a2 ) Fn (x a2 )
.
a
F (x + a2 ) F (x a2 )
a
273
274
and
Varfn (x) =
1 (
a
a )(
a
a )
F
(x
+
)
F
(x
)
1
F
(x
+
)
+
F
(x
)
na2
2
2
2
2
f (x)
.
na
The formula for Efn (x) reveals that in general fn (x) is a biased estimator
of f (x). Put
Biasfn (x) := Efn (x) f (x).
If f is twice continuously dierentiable, then Taylors formula yields
Biasfn (x) =
a2
f (x) + O(a3 ).
24
a4
c2
f (x)
= c1 a4 +
.
[f (x)]2 +
2
24
na
na
c2
a=
4c1 n
]1/5
= cn1/5 .
an )]
an )
nan [fn (x) Efn (x)] = an1/2 n x +
n x
2
2
275
(
=O
ln a1
n
)
.
ln n
n2/5
)
.
The analytic error (bias) equals as we have seen before, O(a2n ) = O(n2/5 ).
In this section we extend fn to a more general class of estimators. Put
1
K(y) = 1{ 1 y< 1 } ,
2
2
2
Then we have
1
fn (x) =
a
xy
a
y R.
(12.1.1)
Fn (dy).
(12.1.2)
K(y)dy = 1,
i.e., K is a Lebesgue density. Equation (12.1.2) also applies to other densities
K as well. If we replace the indicator by a dierentiable K, also the resulting
fn will be dierentiable allowing for estimation of higher order derivatives
of F . The function K is called the kernel and the resulting fn is a kernel
estimator.
As for the naive kernel (12.1.1), the optimal choice of an depends on
unknown quantities. Introducing the function
Ka (y) =
1 (y )
K
,
a
a
y R,
276
12.2
(12.2.1)
1
m(x) =
Y dP.
(12.2.3)
P(X = x)
{X=x}
Yi 1{Xi =x}
i=1
and
1
Y dP
{X=x}
i=1
Thus, dening
n
n
Yi 1{x} (Xi )
i=1 Yi 1{Xi =x}
mn (x) = n
= i=1
,
n
i=1 1{Xi =x}
i=1 1{x} (Xi )
we get
mn (x) m(x) with probability one.
If x is not an X-atom (12.2.3) is no longer true. Nadaraya and Watson
proposed (when d = 1) to replace the one point set {x} with an interval
(x an , x + an ]:
n
Yi 1{xan <Xi an }
mn (x) = ni=1
(12.2.4)
i=1 1{xan <Xi x+an }
(= 0 if the denominator is zero)
Here an > 0 is again a bandwidth (or window) tending to zero as n gets
large. an will play the same role as the bandwidth in the kernel density
estimator. Similarly, also here, we are likely to lose the local information
contained in the data and increase the bias if we choose an too large (oversmooth), and have too much variance if an is too small (undersmooth).
Choice of a proper bandwidth will therefore be an important issue also here.
Notice that mn of (12.2.4) may be written as
)
(
n
xXi
Y
K
i=1 i
an
(
)
mn (x) =
n
xXi
K
i=1
an
(12.2.5)
with
K(y) = 1[1,1) (y).
Needless to say, that (12.2.5) may be extended to more general
kernels K.
Note also that unlike the density case, we need not require K(u)du = 1
278
(the case K = 0 is always ruled out). Also, because m may also take
on negative values, there are no psychological (or emotional) objections to
kernels attaining negative values.
If x is not an atom of X, more sophisticated arguments will be needed for
showing, e.g., consistency of mn (x). Here is a heuristic argument which
uses the fact that m(X) = E[Y |X]. Only the means of the numerator and
the denominator are considered.
We have
[
(
[
(
)]
)]
n
x Xi
1
xX
1
E
Yi K
=E
YK
nan
an
an
an
i=1
(
)
}
{
xX
1
K
E[Y |X]
= E {E[. . . |X]} = E
an
an
{
(
)}
1
xX
= E
m(X)K
.
an
an
Assuming that X has a density f , the last integral becomes
(
)
xy
1
an
m(y)f (y)K
dy m(x)f (x) K(u)du
an
as an 0, under appropriate regularity conditions on f and m. The denominator of (12.2.5) formally equals the nominator if we set Yi
1. Thus
the expectation of the denominator is likely to converge to 1f (x) K(u)du.
Provided that f (x) > 0, the ratio therefore in the limit becomes m(x), as
desired.
Originally, mn was motivated upon assuming that (X, Y ) has a (bivariate)
density g. In this case
yg(x, y)dy
m(x) =
,
(12.2.6)
f (x)
where
f (x) =
g(x, y)dy
i=1
1{Xi u,Yi v} ,
u, v R
x Xi
y Yi
2 1
K
= an n
K
.
an
an
i=1
2
ygn (x, y)dy = an
2
= an
= a1
n
1
= an
) (
)
xu
yv
yK
K
Hn (du, dv)dy
an
an
(
) (
)
xu
yv
yK
K
dyHn (du, dv)
an
an
)
(
xu
(an w + v)K(w)dwHn (du, dv)
K
an
(
)
xu
vK
Hn (du, dv)
an
(
)
n
x Xi
1 1
= an n
Yi K
,
an
i=1
x Xi
1 1
,
fn (x) = an n
K
an
i=1
respectively, then
mn (x)
(
K
Win (x) =
n
xXi
an
j=1 K
xXj
an
280
we have
mn (x) =
Win (x)Yi .
(12.2.7)
i=1
12.3
Win (x)Yi .
i=1
i.e.
E
[
E
]
|Win (X)|f (Xi ) CEf (X)
i=1
(
P
)
|Win (X)| D
=1
i=1
(iii)
i=1
281
i=1
(v)
max |Win (X)| 0 in probability
1in
r
E
|Win (X)||f (Xi ) f (X)| 0.
i=1
]
|Win (X)||f (Xi ) h(Xi )|
i=1
From (ii),
[
E
]
|Win (X)||f (X) h(X)|r DE|f (X) h(X)|r D.
i=1
Altogether this shows that we need to prove the lemma only for continuous
f with compact support. Set M = f . Since f is uniformly continuous,
given > 0, we man nd some > 0 such that
x x1 implies |f (x) f (x1 )|r .
By (ii),
[ n
]
{ n
}
E
|Win (X)||f (Xi ) f (X)|r
(2M )r E
|Win (X)|1{Xi X>}
i=1
i=1
+ D.
282
Use (ii) and (iii) to conclude that the last expectation converges to zero.
This completes the proof of the lemma.
Lemma 12.3.4. Under (i) (iii), let {Wn } be a sequence of nonnegative
weights such that for nonnegative constants Mn and Nn
(
)
n
P Mn
Win (X) Nn 1.
i=1
i=1
and
lim sup E
n
i=1
Proof. Set
An =
Mn
}
Win (X) Nn
i=1
i=1
and therefore
]
Win (X)f (X)
i=1
lim inf E
Win (X)f (X) (lim inf Mn )Ef (X).
n
i=1
Apply Lemma 12.3.3 to prove the lim inf part of the lemma. A similar
argument yields the lim sup part.
283
Lemma 12.3.5. Under (i) (iii), assume that for some constants Mn and
Nn
)
(
n
2
P Mn
Win (X) Nn 1
i=1
2
lim inf E
Win
(X)f (Xi ) lim inf Mn Ef (X)
n
and
[
lim sup E
n
i=1
i=1
]
2
Win
(X)f (Xi ) (lim sup Nn )Ef (X).
n
Proof. Apply the last lemma to {Wn2 }, with C and D replaced by CD and
D2 , respectively.
Lemma 12.3.6. Under (i) - (iii), for every Borel-function f ,
n
i=1
E
|Win (X)||f (Xi ) f (X)| 0
i=1
and therefore
n
i=1
For an arbitrary f , we may choose an M > 0 such that P(|f (X)| > M ) ,
a given small positive constant. Set f = (f M ) (M ). Since
{f(Xi ) = f (Xi )} {|f (Xi )| > M },
284
we obtain
n
i=1
i=1
n
|f (X)|>M }
i=1
Proof. By (ii),
n
r
Win (X) 1 (1 + D)r with probability one.
i=1
Thus, by (iv),
[(
r
)
]
n
E
Win (X) 1 f (X) 0.
(12.3.2)
i=1
(12.3.3)
i=1
285
[
Win (X)Yi m(X) =
i=1
]
Win (X)m(Xi ) m(X)
i=1
n
Win (X)Zi ,
i=1
where
Zi = Yi E[Yi |Xi ] = Yi m(Xi ).
The term in brackets converges to zero in Lr by Lemma 12.3.7. For the
second sum consider r = 2 rst. It follows from the very denition of m and
the independence of the data, that
[
E
]2
Win (x)Zi
i=1 j=1
i=1
n
n
n
n
i=1 j=1
n
n
i=1 j=1
n
i=1
i=1
2
E[Win
(x)E(Zi2 |Xi )] =
2
E(Win
(x)h(Xi )],
]2
Win (X)Zi
i=1
[
=E
]
2
Win
(X)h(Xi )
i=1
By Lemma 12.3.5, the limsup of the right-hand side is less than or equal to
(lim sup Nn )Eh(X),
n
( n
i=1
)
2
Win
(X) Nn
1.
(12.3.4)
286
2
Win
(X) 0 in probability,
i=1
i.e., for Nn we may take Nn , where > 0 is any small number. Expression (12.3.4) thus becomes Eh(X). Since is arbitrary, this proves the
theorem.
Remark 12.3.8. Observing that
Win (X) = Win (X; X1 , . . . , Xn )
is a measurable function of the i.i.d. random vectors X, X1 , . . . , Xn , we get
E[|Win (X)|f (Xi )] = E[|Win (Xi ; X1 , . . . , X, . . . Xn )|f (X)]
so that (i) amounts to
[{ n
}
]
E
|Win (Xi ; X1 , . . . , X, . . . , Xn )| f (X) CEf (X),
i=1
(i)*
(12.3.5)
i=1
holds.
Remark 12.3.9. Let g be a Borel-measurable function such that g(Y ) Lr .
Replacing Yi by g(Yi ), we obtain
mn (x) =
Win (x)g(Yi ),
i=1
Win (x)1{Yi t}
i=1
and
m(x) = m(t; x) = P(Y t|X = x),
the empirical and the true conditional d.f. at t given X = x.
287
n2 (x) =
Win (x)Yi2
Win (x)Yi
i=1
i=1
12.4
288
the open -ball with center x. For x supp(), we have (S (x)) > 0. In
fact, from (S (x)) = 0 we obtain (A) = 1, where A = supp() \ S (x) is
a closed set strictly contained in supp(), a contradiction.
Now, from the SLLN,
n (S (x)) (S (x)) a > 0
Hence,
P-a.s.
i=1
eventually with probability one. Conclude dkn :n < . This proves the lemma.
In the following, replace x by X, so that dkn :n becomes dkn :n (X).
Lemma 12.4.5. Under kn /n 0 as n ,
dkn :n (X) 0 with probability one
Proof. Set
A = {(, x) Rd : dkn :n (x) 0}.
Then A A Bd . From Lemma 12.4.4
P(Ax ) = 1 for each x supp().
By independence of X and the training sample,
P (dkn :n (X) 0) =
P(Ax )(dx) = 1.
supp()
We now introduce the kn -NN weights. For x Rd , set
{
1/kn if Xi is among the kn NN of x
Win (x) =
.
0
otherwise
Lemma 12.4.6. Under kn /n 0, we have for each > 0
n
i=1
289
Proof. Set
An := {dkn :n (X) > }.
Lemma 12.4.5 yields P(An ) 0 as n . Furthermore,
n
i=1
Win (x) = 1
i=1
and
1in
(provided that kn )
also conditions (ii), (iv) and (v) are satised. It remains to verify condition
(i) resp. (i)*:
n
(i)*
i=1
Notice that the left-hand side of (i)* equals kn1 times the number of is for
which x is among the kn -NN of Xi .
We rst treat the case kn = 1.
Lemma 12.4.7. (Bickel and Breiman). For each d 1 and any norm ,
there exists (d) < such that it is possible to write Rd as the union of
(d) disjoint cones C1 , . . . , C with 0 as their common peak such that if
x, y Cj (x, y = 0), then x y < max( x , y ), j = 1, . . . , (d).
Proof. Set
S(0, 1) = {x Rd : x 1},
the unit sphere in Rd . Since S(0, 1) is compact we can nd disjoint sets
C1 , . . . , C(d) such that
(d)
j=1
290
Let
Cj = {x : x Cj , 0},
1 j (d).
Suppose x =
x, y = y with x
, y Cj . W.l.o.g., assume . Then,
x y =
x
y
{(
)
}
1
y + x
y < y .
Altogether, we have shown that the kn -NN weights satisfy the assumptions
of Theorem 12.3.2. Hence we may conclude
Theorem 12.4.10. Under kn /n 0 and kn , the kn -NN regression
estimate
n
mn (X) =
Win (X)Yi
i=1
fullls
mn (X) m(X) in Lr .
Notice that Theorem 12.4.10 holds under no assumption on the underlying
distribution of (X, Y ) (up to E|Y |r < ). Therefore we say, that mn (X) is
universally consistent.
12.5
291
Nonparametric Classication
Suppose that Y is some variable attaining only nitely many values, say
{1, . . . , m}. Rather than Y , we observe some random vector X, which typically is correlated with Y . The problem of classication is one of specifying
the value of Y on the basis of X.
In other words, we are looking for a function (or rule) g of X such that
g(X) may serve as a substitute for Y . Denote with
P(g(X) = Y ) the probability of misclassication.
g0 is optimal if the probability of misclassication is minimal:
inf P(g(X) = Y ) L = P(g0 (X) = Y ).
g
But
P(g(X) = Y ) =
=
P(Y = i|X)dP =
i=1
{g(X)=i}
i=1
P(g(X) = i, Y = i)
i=1
i=1
pi (x)(dx)
{x:g(x)=i}
max pj (x)(dx) =
j
max pj (x)(dx),
j
(12.5.1)
{x:g(x)=i}
(12.5.2)
292
and therefore
L = 1 E( max pj (X)) the Bayes Risk.
1jm
1 j m.
i=1
(12.5.3)
(12.5.4)
293
Now, by denition of L ,
L P(
gn (X) = Y |X1 , Y1 , . . . , Xn , Yn )
Integrate out to get
P-a.s.
L P(
gn (X) = Y ).
Furthermore,
P(
gn (X) = Y |X, X1 , Y1 , . . . , Xn , Yn )
m
=
1{gn (X)=i} pi (X)
i=1
i=1
i=1
i=1
whence
P(
gn (X) = Y )
= 1 L 2Een (X).
From (12.5.4) we may infer
lim sup P(
gn (X) = Y ) L ,
n
as desired.
Remark 12.5.3. The consistency in Bayes risk of the kn -NN rule was rst
proved (under further regularity conditions) by Fix and Hodges (1951).
Remark 12.5.4. Much attention has been also given to the case k 1,
i.e. Y is classied as to be the value of that Yi for which Xi is the nearest
neighbor of X. It may be shown that
L = lim P(gn (X) = Y )
n
exists, with
L LL
)
m
2
L .
m1
294
An important issue in pattern classication is that of estimating the unknown probability of misclassication. Ln is the overall probability of misclassication, i.e., it constitutes the mean risk averaged over all outcomes of
X and Y and the training sample.
A more informative quantity which represents the probability of misclassication given the particular sample X, (Xi , Yi ), 1 i n, on hand, is the
conditional probability
P(
gn (X) = Y |X, X1 , Y1 , . . . , Xn , Yn ) = Ln1 .
Another quantity which is of interest when only the training sample is known
but X has not been sampled so far, is
P(
gn (X) = Y |X1 , Y1 , . . . , Xn , Yn ) = Ln2 .
Note that both Ln1 and Ln2 are random. Recall that in the proof of Theorem
12.5.2 we have seen that
m
1 Ln1 =
1{gn (X)=i} pi (X)
(12.5.5)
i=1
i=1
= 1 L 2 en (x)(dx).
(12.5.6)
pi (x)(dx)
1 Ln2 =
i=1
{
gn =i}
295
i=1
pin (x)(dx)
{
gn =i}
i=1
i=1
{
gn =i}
The last term converges to zero in the mean and hence in probability.
j=1
1im
n2 underestimates Ln2 .
In the literature it is often reported that L
n2 :
We now introduce two variants of L
CV the cross-validated estimate of Ln2 : for this, compute pi,n1 as beL
n2
fore, but this time for (X1 , Y1 ), . . . , (Xj1 , Yj1 ), (Xj+1 , Yj+1 ), . . . , (Xn , Yn )
(j)
i.e., for the whole sample with (Xj , Yj ) deleted. Write pi,n1 for this p and
put
n
(j)
CV
1
1 Ln2 = n
max pi,n1 (Xj ).
j=1
1im
296
1
H
1L
n2 = k
j=nk+1
1im
i=1
and
1
H
L
n2 = k
j=nk+1
12.6
Consider an i.i.d. sequence X1 , X2 , . . . from a density f . For a given bandwidth a > 0, set
(
)
)
(
n
1
x Xi
1
xy
fn (x) =
K
=
Fn (dy).
K
na
a
a
a
i=1
Set
t
Fn (t) =
fn (x)dx,
t R,
I =
(x)Fn (dx) = (x)fn (x)dx
(
)
xy
1
(x)K
dxFn (dy) a (y)Fn (dy).
=
a
a
We have
E(I) =
a (y)F (dy) = a
297
and
[a a ]2 F (dy) = 2 (a)
nVar(I) =
provided a is square-integrable.
In the following Lemma we formulate some conditions on f and under
which the above integrals converge to those with in place of a . For this,
note that
(
) ]
[
xy
1
f (y)K
dy dx.
a = (x)
a
a
Under some mild conditions on K the inner integrals converge to f (x) for
all x up to a Lebesgue null set. To show
a dF
as a 0
(12.6.1)
we need to justify the assumptions in the Lebesgue dominated convergence
theorem. Whenever K(x) = T ( x ), T nonincreasing, by Wheeden and
Zygmund, p. 156,
sup |f Ka (x)| cf (x)
(12.6.2)
a>0
g (x) = sup
|g(y)|dy
Q (Q) Q
denotes the Hardy-Littlewood maximal function of g, Q is an interval
centered at x and (Q) is the Lebesgue measure of Q. This denition extends
to the multivariate case, in which the supremum is taken over all cubes with
center x and edges parallel to the coordinate axes.
As a conclusion (12.6.2) implies (12.6.1) provided that f is Lebesgueintegrable.
As to
[a a ] F (dy)
2
[ ]
2 F (dy),
(12.6.3)
298
We have
[2a 2 ]dF
|a ||a + |dF
a 2 ( a 2 + 2 )
1
xy
2
2
a (y)F (dy)
(x)K
dxF (dy)
a
a
=
2 (x)f Ka (x)dx c 2 (x)f (x)dx < ,
1
xy
2
2
(a ) dF
[(x) (y)] K
dxF (dy)
a
a
(
)
1
xy
2
=
[(x) (y)] f (y)K
dydx.
a
a
One can show that
)
(
1
xy
2
dy 0
[(x) (y)] f (y)K
a
a
for all x up to a Lebesgue-null set, since f L1 (R) and 2 f L1 (R).
Application of the dominated convergence theorem follows as before. In
summary, we get
Lemma 12.6.1. Provided 2 f L1 (R) and f L1 (R), under standard
regularity assumptions on K,
E(I) dF
and
nVar(I)
[ ]
2 dF
as a 0.
Now, consider the stochastic process
299
(12.6.4)
in distribution.
Setting 0 (y) = (y), (12.6.4) also holds for a = 0 when L2 (F ). If we
let a depend on n in such a way that a = an 0 as n , then
1/2
n
an (y)[Fn (dy) F (dy)] N (0, 2 (0))
in distribution.
Under an appropriate smoothness assumption on f , for K even,
a2
a (y)F (dy) = (x)F (dx) +
z 2 K(z)f (x)dzdx + O(a3 ).
2
We see that
1/2
300
Chapter 13
Conditional U -Statistics
13.1
Win (x)Yi .
i=1
He could show that under very broad assumptions on the weights, for p 1,
E [|mn (X) m(X)|p ] 0.
(13.1.1)
In particular, he was able to verify these assumptions for N N -type estimators. Window estimators were dealt with independently by Devroye and
Wagner (1980) and Spiegelman and Sacks (1980). Devroye (1981) considered Lp -convergence at a point:
E [|mn (x0 ) m(x0 )|p ] 0.
(13.1.2)
302
There are several situations in which one is not only interested in the relation between a single X and the corresponding Y , but in the dependence
structure of few Xs and a function of the associated Y s. To be precise, let
h = h(Y1 , . . . , Yk ) be a p-times integrable function of Y1 , . . . , Yk , p 1. h is
called a kernel, and k is its degree. The Y s need not be real but may be
random vectors in any Euclidean space. Put, for x = (x1 , . . . , xk ),
m(x) = E [h(Y1 , . . . , Yk )|X1 = x1 , . . . , Xk = xk ] .
The problem of estimating m was studied in Stute (1991). As estimators we
considered generalizations of the Nadaraya-Watson estimator, namely
k
h(Y1 , . . . , Yk )
j=1 K[(xj Xj )/bn ]
(13.1.3)
mn (x) =
k
W (x)h(Y ).
(13.1.4)
Statistics of the type (13.1.4) may be called conditional (or local) U -statistics,
because in spirit they are similar to Hoedings (1948) U -statistic, extending
the sample mean.
The analog of (13.1.1) for conditional U -statistics has been obtained in Stute
(1994). In the present paper we consider Lp -convergence at a point x0 , say, of
a window estimator, i.e., of the form (13.1.3) with K = 1[1,1]d . Extensions
to more general kernels will be dealt with in Remark 13.1.2 below. The
examples in Stute (1991) could also be mentioned here, but we prefer to
discuss another example, which may be of independent interest. See Sections
13.1 and 13.2 for details.
In what follows denotes the sup-norm on an Euclidean space. For the
window estimator,
1{X xbn }
W (x) =
,
1{x xbn }
13.2. DISCRIMINATION
303
as
n .
(x : x xi br2 )
<
(x : x xi br1 )
(13.1.5)
for 1 i k. In particular, (13.1.5) is satised if xi is a -atom or the restriction of to some neighborhood of xi is dominated by Lebesgue-measure.
13.2
Discrimination
where
pj (x) = P(Y = j|X = x),
1 j M.
L = P(g0 (X) = Y ) = 1 E
]
j
max p (X)
1jM
mjn (x) =
Win (x)1{Yi =j} .
i=1
304
Put
gn (x) = arg max mjn (x).
1jM
1 j M,
then form a partition of the feature space. h(Y10 , . . . , Yk0 ) = j then is tantamount to (Y10 , . . . , Yk0 ) Aj which is to be considered as one of M concurrent situations of interest. Take, for example, k = 2 and let the Y s be
real-valued. We may then wonder whether Y10 Y20 or not. Putting
{
1 if y1 y2
h(y1 , y2 ) =
0 if y1 > y2
we arrive at a discrimination problem for h(Y10 , Y20 ). Given (X10 , X20 ) and a
training sample (Xi , Yi ), 1 i n, a decision has to be made as to whether
h = 1 or 0. This example will be considered in the simulation study of
Section 13.3. Others are mentioned in Stute (1994).
It is easily seen that now the Bayes-rule equals
g0 (x) = arg max mj (x),
1jM
13.2. DISCRIMINATION
305
where
(
)
mj (x) = P h(Y10 , . . . , Yk0 ) = j|X10 = x1 , . . . , Xk0 = xk .
The Bayes-risk becomes
[
L =1E
]
max m (X) ,
j
1jM
with
mjn (x) =
W (x)1{h(Y )=j} .
One can show that under mild assumptions on the weights one obtains
Bayes-risk consistency:
P(gn0 (X) = h(Y )) L .
In this section we discuss the limit behavior of the probability of error given
a particular training sample at hand:
Ln := P(gn0 (X) = h(Y )|X1 , Y1 , . . . , Xn , Yn ).
Theorem 13.2.1. Assume bn 0 and nbdn . Then, as n ,
Ln L
in the mean.
mj (x)(dx1 ) . . . (dxk ).
j=1
{gn0 =j}
j=1
{gn0 =j}
1jM
306
in the mean.
In the most general situation, also will be unknown. We may then replace
n by the empirical measure n of X1 , . . . , Xn . So we
in the denition of L
come up with the following estimate of 1 Ln :
n = nk
1L
1i1 ,...,ik n
1jM
n:
Two modications are suggested in order to reduce the bias of L
1. Summation should only be extended over all = (i1 , . . . , ik ) with
pairwise distinct ir s.
2. Given such a multi-index , mjn should be replaced by mj
n computed
from the sample (X1 , Y1 ), . . . , (Xn , Yn ) with (Xir , Yir ), 1 r k,
deleted.
Hence we obtain
n :=
1L
1
max mj (X ),
n(n 1) . . . (n k + 1) 1jM n
13.3
307
Simulation Study
In our simulation study we consider data generated from the linear regression
model Y = X + , in which X is uniformly distributed on the unit interval
and is standard normal, both independent of each other. For variouss,
training samples (Xi , Yi ), 1 i n, were drawn. Given two input variables
X10 and X20 it is required to make a decision as to whether Y10 exceeds
Y20 or not. See Example 13.2.3 with k = 2. The rule gn0 was applied
n was computed
for various bandwidths. Finally, the jackknife-corrected L
for each training sample. Table 13.3.2 below presents the estimated mean
and variance based on 1000 replicates. For the sake of comparison the
true values of L are listed in Table 13.3.1. It becomes apparent that L
attains its maximum for = 0 and decreases as || increases. This should
also be intuitively clear, since for small || the input variables contain less
information on the Y s than for large ||s. In the extreme, when = 0, Y
does not depend on X at all so that X10 and X20 become uninformative as
to our multi-input classication problem. By symmetry, L = 0.5.
Next we present the summary statistics for sample size n = 10(20) for some
selected values of bn and .
To measure the performance of the rule gn0 , it is also instructive to compare
n with the true error rate Ln . This is done for n = 10 and = 1.
L and L
(See Table 13.3.3.)
The true error rate is always bigger than L (by denition of L ). It is pretty
n for 0.6 bn 0.8. Also Ln seems less vulnerable
well approximated by L
to an improper choice of bn .
In Table 13.3.4 we list, for n = 10 and various s, the bandwidths leading
n which in the mean are closest to L .
to an L
We see that bn decreases as increases. Hence for large s ecient discrimination between the Y s is based on smaller neighborhoods of the input
variables. After a moment of thought this observation becomes quite natural since for these values of , Y is strongly correlated with X so that small
neighborhoods contain enough information on the Y s. On the other hand,
small neighborhoods have the advantage to properly divide the training sample into disjoint parts such that each part contains information which is only
relevant for the corresponding pair (X 0 , Y 0 ).
308
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.3
1.5
1.8
2.0
L
0.500
0.491
0.481
0.472
0.463
0.453
0.444
0.435
0.426
0.417
0.408
0.383
0.366
0.343
0.328
Table 13.3.1:
309
n
Estimated Mean of L
0.48(0.48)
0.46(0.47)
0.44(0.46)
0.42(0.45)
0.47(0.48)
0.45(0.47)
0.43(0.46)
0.41(0.44)
0.46(0.47)
0.44(0.46)
0.42(0.44)
0.40(0.43)
0.46
0.44
0.42
0.40
0.37
0.34
0.46
0.43
0.40
0.37
0.34
0.31
n
Estimated Variance of L
0.0004(0.00009)
0.0006(0.0002)
0.0010(0.0003)
0.0020(0.0005)
0.0005(0.0001)
0.0010(0.0003)
0.0020(0.0006)
0.0020(0.0009)
0.0006(0.0003)
0.0010(0.0006)
0.0020(0.001)
0.0030(0.001)
0.0008
0.0017
0.0025
0.0030
0.0040
0.0040
0.009
0.0018
0.0026
0.0036
0.0040
0.0052
Table 13.3.2:
0.8
0.7
0.6
0.5
L
0.408
0.408
0.408
0.408
n
Estimated Mean of L
0.46
0.44
0.42
0.40
Table 13.3.3:
Estimated Mean of Ln
0.46
0.45
0.45
0.45
310
0.1
0.5
1.0
1.5
2.0
bn
0.8
0.7
0.5
0.4
0.3
Table 13.3.4: