Vous êtes sur la page 1sur 315

EMPIRICAL DISTRIBUTIONS

Winfried Stute
University of Giessen

Contents
1 Introduction
1.1 Counting . . . . . . . . . . . . . . . . . . . .
1.2 Diracs and Indicators . . . . . . . . . . . . . .
1.3 Empirical Distributions for Real-Valued Data
1.4 Some Targets . . . . . . . . . . . . . . . . . .
1.5 The Lorenz Curve . . . . . . . . . . . . . . .
1.6 Ginis Index . . . . . . . . . . . . . . . . . . .
1.7 The ROC-Curve . . . . . . . . . . . . . . . .
1.8 The Mean Residual Lifetime Function . . . .
1.9 The Total Time on Test Transform . . . . . .
1.10 The Product Integration Formula . . . . . . .
1.11 Sampling Designs and Weighting . . . . . . .
2 The
2.1
2.2
2.3

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

1
1
7
13
17
31
34
39
42
44
45
51

Single Event Process


59
The Basic Process . . . . . . . . . . . . . . . . . . . . . . . . 59
Distribution-Free Transformations . . . . . . . . . . . . . . . 65
The Uniform Case . . . . . . . . . . . . . . . . . . . . . . . . 69

3 Univariate Empiricals:
The IID Case
3.1 Basic Facts . . . . . . . . . . . . . . . . . . . .
3.2 Finite-Dimensional Distributions . . . . . . . .
3.3 Order Statistics . . . . . . . . . . . . . . . . . .
3.4 Some Selected Boundary Crossing Probabilities

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

77
77
92
97
106

4 U -Statistics
113
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2 The Hajek Projection Lemma . . . . . . . . . . . . . . . . . . 115
4.3 Projection of U -Statistics . . . . . . . . . . . . . . . . . . . . 117
i

ii

CONTENTS
4.4
4.5

The Variance of a U -Statistic . . . . . . . . . . . . . . . . . . 120


U -Processes: A Martingale Approach . . . . . . . . . . . . . . 122

5 Statistical Functionals
5.1 Empirical Equations . . . . . . . . . . . . . . .
5.2 Anova Decomposition of Statistical Functionals
5.3 The Jackknife Estimate of Variance . . . . . . .
5.4 The Jackknife Estimate of Bias . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

6 Stochastic Inequalities
6.1 The D-K-W Bound . . . . . . . . . . . . . . . . . . . .
6.2 Binomial Tail Bounds . . . . . . . . . . . . . . . . . .
6.3 Oscillations of Empirical Processes . . . . . . . . . . .
6.4 Exponential Bounds for Sums of Independent Random
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Invariance Principles
7.1 Continuity of Stochastic Processes . . . . .
7.2 Gaussian Processes . . . . . . . . . . . . . .
7.3 Brownian Motion . . . . . . . . . . . . . . .
7.4 Donskers Invariance Principles . . . . . . .
7.5 More Invariance Principles . . . . . . . . . .
7.6 Parameter Empirical Processes (Regression)

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

133
. 133
. 136
. 138
. 140

. . . .
. . . .
. . . .
Vari. . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

145
145
148
152
157

161
. 161
. 163
. 165
. 167
. 172
. 176

8 Empirical Measures: A Dynamic Approach


183
8.1 The Single-Event Process . . . . . . . . . . . . . . . . . . . . 183
8.2 Martingales and the Doob-Meyer Decomposition . . . . . . . 185
8.3 The Doob-Meyer Decomposition of the Single-Event Process . 187
8.4 The Empirical Distribution Function . . . . . . . . . . . . . . 190
8.5 The Predictable Quadratic Variation . . . . . . . . . . . . . . 193
8.6 Some Stochastic Dierential Equations . . . . . . . . . . . . . 195
8.7 Stochastic Exponentials . . . . . . . . . . . . . . . . . . . . . 197
9 Introduction To Survival Analysis
9.1 Right Censorship: The Kaplan Meier Estimator .
9.2 Martingale Structures under Censorship . . . . .
9.3 Condence Bands for F under Right Censorship
9.4 Rank Tests for Censored Data . . . . . . . . . . .
9.5 Parametric Modelling in Survival Analysis . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

201
. 201
. 207
. 214
. 216
. 220

CONTENTS

iii

10 Time To Event Data


10.1 Sampling Designs in Survival Analysis . . . . . .
10.2 Nonparametric Estimation in Counting Processes
10.3 Nonparametric Testing in Counting Processes . .
10.4 Maximum Likelihood Procedures . . . . . . . . .
10.5 Right-Truncation . . . . . . . . . . . . . . . . . .
11 The
11.1
11.2
11.3
11.4
11.5

Multivariate Case
Introduction . . . . . . . . . . . . . . . . . . .
Identication of Defective Survival Functions
The Multivariate Kaplan-Meier Estimator . .
Eciency of the Estimator . . . . . . . . . .
Simulation Results . . . . . . . . . . . . . . .

12 Nonparametric Curve Estimation


12.1 Nonparametric Density Estimation . . . . . .
12.2 Nonparametric Regression: Stochastic Design
12.3 Consistent Nonparametric Regression . . . .
12.4 Nearest-Neighbor Regression Estimators . . .
12.5 Nonparametric Classication . . . . . . . . .
12.6 Smoothed Empirical Integrals . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

225
. 225
. 228
. 231
. 235
. 242

.
.
.
.
.

251
. 251
. 257
. 260
. 264
. 269

.
.
.
.
.
.

273
. 273
. 276
. 280
. 287
. 291
. 296

13 Conditional U -Statistics
301
13.1 Introduction and Main Result . . . . . . . . . . . . . . . . . . 301
13.2 Discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 307

iv

CONTENTS

Chapter 1

Introduction
1.1

Counting

Empirical distributions have been designed to provide fundamental mathematical and graphical tools in the context of data analysis.
To begin with, suppose that n customers are asked to name their favorite car
from a list a, b, c, d, e. Denote with Xj the response of the j-th customer.
A quantity of general interest is the number of customers in favor of a,
say. This number is the outcome of a procedure which consists of n steps
and which adds a one to the previous value whenever one obtains another
positive vote for a: counting.
To formalize things a little bit, let S = {a, b, c, d, e} be the sample space of
all possible outcomes. Set A = {a}, a set consisting of only one element. Let
1A be the indicator function associated with A, i.e., the function dened
on S attaining the value 1 on A and 0 otherwise:
{
1 if x A
1A (x) =
.
(1.1.1)
0 if x
/A
In our example, 1A (Xj ) equals 1 if the vote of the j-th customer is in favor
of a and 0 otherwise. The absolute number of customers voting for a of
course equals
n

1A (Xj ).
(1.1.2)
j=1

Needless to say, (1.1.2) also applies to other sets B, C, . . . and other Xs. As
another example, consider the number of days per year at a specic location,
1

CHAPTER 1. INTRODUCTION

when the maximum temperature exceeds 30 C. The set of interest then is


the interval A = [30, ), together with its complement A = [273, 30).
Quantity (1.1.2) may then be compared with similar results from previous
years and is helpful in analyzing possible changes in the local climate. In
other situations, the Xs may take on multivariate values as well. We may
be interested in the income of a person (per year) together with the amount
of money spent on travelling. When we split the income scale into nitely
many groups A1 , . . . , Ak representing classes from very poor to very rich,
and likewise partition the scale for travel expenses into groups B1 , . . . , Bm ,
we may form new groups Ar Bs , 1 r k, 1 s m, with
1Ar Bs (Xj ) = 1
just indicating that the j-th person belongs to the r-th group as far as
income and to the s-th group as far as travel expenses are concerned.
More generally, we may split a sample space S into nitely many mutually
disjoint sets (groups, cells) C1 , . . . , Cm , i.e.,
Ci Cj =

for i = j.

We call = {C1 , . . . , Cm } a partition of S if

i=1 Ci

= S.

E.g., if one is interested in the age distribution of a population, we may put


C1 = [0, 9], C2 = [10, 19], . . . , C10 = [90, 99], C11 = [100, ).
A person aged 25 thus is assigned to C3 . In this example, if one has the
impression, that the grouping is too rough so that relevant information is
lost one may feel free to choose a ner partition. In other situations, the
relevant sets, say B1 , . . . , Bm , may be increasing: B1 B2 . . . Bm .
When S is the real line and each Bi = (, ti ] is an extended interval then
the Bi s are increasing if and only if t1 t2 . . . tm . A (nite) sequence
of Bs is called decreasing i B1 . . . Bm . If a sequence of Bs is
increasing or decreasing we call the Bs linearly ordered. Note that in
the case of extended intervals, we may always arrange the Bs in such a way
that they are increasing and hence linearly ordered. Finally, to each sequence
B1 . . . Bm we may attach the natural partition = {C1 , . . . , Cm+1 }
dened by
m .
C1 = B1 , C2 = B2 \ B1 , . . . , Cm = Bm \ Bm1 , Cm+1 = B

1.1. COUNTING

We thus see that absolute frequencies (1.1.2) may be computed in general


situations, whether the data are on a nominal, ordinal or metric scale, or
whether they are univariate or multivariate.
Obviously, (1.1.2) may become very large with n. A quantity which is
better suited to discriminate between the fractions of A, B, . . . in the whole
population is the so-called relative frequency. For a set (or group) A this
becomes
n
1
1A (Xj ).
(1.1.3)
n (A) =
n
j=1

This quantity is called the empirical distribution (measure) of A:


u

u
u


u

u A


n (A) =

2
8

Figure 1.1.1: Empirical measure of a set


If = {C1 , . . . , Cm } is a partition of the sample space S, then
m

i=1

n (Ci ) = n

(m

)
Ci

= n (S) = 1.

i=1

Hence the empirical frequencies (n (C1 ), . . . , n (Cm )) attain their values in


the set of m-tuples (p1 , . . . , pm ) with
0 pi 1 and

pi = 1.

i=1

If the sets B1 , . . . , Bm are increasing, then each Xj contained in Bi is also


contained in Bi+1 . Hence
0 n (B1 ) n (B2 ) . . . n (Bm ) 1.

CHAPTER 1. INTRODUCTION

For data on a nominal scale we may represent, e.g., group-frequencies through


the following diagram:

Figure 1.1.2: Empirical frequencies for data on a nominal scale


When the data are on an ordinal scale, a graphical representation should
take care of this fact in that the best group, e.g., could be placed at the left
end. Below we present the empirical measures (or fractions) of nal grades
of the master students at the Mathematical Institute of the University of
Giessen in one of the previous years.

Figure 1.1.3: Empirical frequencies for data on an ordinal scale


When the data are on a metric scale, grouping of the data into nitely many
sets may lead to a considerable loss of information. In such a situation we
may consider a much richer class of sets. For example, let {A} be the class
of all extended intervals (, t] with t R. We then come up with
1
Fn (t) n (, t] =
1{Xj t} ,
n
n

(1.1.4)

j=1

the empirical distribution function of X1 , . . . , Xn . Thus Fn (t) denotes

1.1. COUNTING

the relative number of data being less than or equal to t. For two s and t we
either have s t or t < s. In the rst case (, s] is contained in (, t]
so that by monotonicity of n
Fn (s) Fn (t)

for s t,

i.e., Fn is non-decreasing. If we let s and t , we obtain in the


limit
Fn () = 0 and Fn () = 1.
(1.1.5)
The rst statement just says that among real data, there are no Xj s which
are less than or equal to , while, for the second, all are less than or equal
to +. Actually, we already have Fn (x) = 0 for all x < smallest datum
and Fn (x) = 1 for all x largest datum.
Sometimes applicants of statistical methodology prefer to look at
1
1{Xj >t} ,
n
n

1 Fn (t) =

t R,

j=1

which is non-increasing in t. For example, in a medical context, the Xj s


might represent the times elapsed from surgery until death. Then 1 Fn (t)
equals the fraction of patients who survived at least t time units. Therefore,
in this context, 1 Fn is called the empirical survival function.
For multivariate data X1 , . . . , Xn in the Euclidean space Rk we have many
possibilities to choose among sets A. For example, we could take for A
so-called quadrants
At =

(, tl ] = {x = (x1 , . . . , xk )T : xl tl for all 1 l k},


l=1

where t = (t1 , . . . , tk )T is a given vector determining the quadrant. Here


and in the following the superscript T denotes transposition, i.e., if not
stated otherwise, t etc. denote column vectors. The multivariate empirical
distribution function. is then dened as
1
Fn (t) = n (A ) =
1{Xjl tl
n
n

j=1

for all 1lk} .

CHAPTER 1. INTRODUCTION

Here Xjl denotes the l-th component of Xj . For ease of notation we write
Xj t i holds coordinatewise. Then Fn becomes
1
1{Xj t} .
n
n

Fn (t) =

j=1

Other possible choices for A are halfspaces of the form


A = {x :< x, > c},
where Rk and c R are xed or may vary as t before, and
< x, >=

xi i

i=1

denotes the scalar product in Rk .

x2
*
*

*
3

*
*

*
1

4x1 + 2x2 > 6

*
4x1 + 2x2 6

1.5

x1

Figure 1.1.4: A data set split into two pieces by a line


In general, the choice of A will depend on the kind of application one has
in mind. Therefore, in the beginning, we leave it open what the As are.

1.2. DIRACS AND INDICATORS

Also the sample space S may be quite general. Later in this monograph,
most of our analysis will rst focus on S = R, i.e., on real-valued data, with
the class of As being the extended intervals (, t]. After that we shall
study multivariate problems. Our nal extension will be concerned with the
important case when the available information comes through functions over
time or space. This will lead us to the study of so-called functional data.

1.2

Diracs and Indicators

The indicator function is intimately related with a so-called point mass (or
Dirac measure).
For each x S, dene
{
x (A) =

1
0

if x A
if x
/A

(1.2.1)

x (A1 ) = 0
x (A2 ) = 1

Figure 1.2.1: Evaluation of a Dirac measure


If we compare (1.2.1) with (1.1.1), we immediately obtain
x (A) = 1A (x).

(1.2.2)

Note that for x the point x is kept xed with A varying among a class of
sets, while for the indicator function, the set A is xed with x varying in
the sample space. It is easy to see that each x is a (probability) measure.
In fact,
0 x (A) 1 with x (S) = 1
and
x

i=1

)
Ai

i=1

x (Ai )

CHAPTER 1. INTRODUCTION

for any mutually disjoint subsets Ai of S. One may check this equality
separately for the two cases that x is contained in none of the Ai s and,
alternatively, that x is contained in precisely one of them. In the rst case
both sides equal zero, while in the latter we have one on both sides.
Within this framework, (1.1.3) becomes
1
Xj (A)
n

(1.2.3)

1
Xj .
n

(1.2.4)

n (A) =

j=1

or, omitting sets,

n =

j=1

In other words, the empirical measure is a sum of equally weighted Diracs.


For many statistical procedures, counting data is not sucient. Rather we
would like to extend the scope of possible applications of empirical distributions by introducing empirical integrals. Starting again with a point
mass x , we have

x (A) =

1A dx = 1A (x).

(1.2.5)

The rst equation just expresses the fact that the integral of an indicator
function always equals the measure of the underlying set. If we put = 1A ,
the second equation may be rewritten to become

dx = (x).
(1.2.6)
The last equality is simple but fundamental and may be easily extended to
other s, not necessarily of indicator type. Since n is a linear combination
of the point masses Xj , 1 j n, we thus obtain

n
1
(Xj ),
(1.2.7)
dn =
n
j=1

i.e., an empirical integral may be most easily evaluated by just averaging


the -values at the data.
When the data are real-valued, the choice of (x) = x yields a representation
of the sample mean as an empirical integral:

n
1
n.
xdn =
Xj X
n
j=1

1.2. DIRACS AND INDICATORS

Similarly, if we set (x) = xp , we obtain

1 p
Xj ,
n
n

xp dn =

j=1

the empirical p-th moment.


The above identities easily extend to complex-valued functions. E.g., if we
set
(x) = exp[ix],
we obtain the so-called empirical characteristic function (Fouriertransform):

n
1
exp[iXj ].
n () dn =
n
j=1

Here i denotes the complex unity satisfying i2 = 1. When we take the


real-valued version
0 (x) = exp[x],
we come up with the empirical Laplace-transform:

n0 () =

1
exp[Xj ].
n
n

0 dn =

j=1

Needless to say, if we set = 1(,t] , then

dn = Fn (t).
To present an example for bivariate data Xj = (Xj1 , Xj2 )T , 1 j n, the
choice of
(x1 , x2 ) = x1 x2
leads to

1
Xj1 Xj2 ,
n
n

dn =

j=1

an empirical mixed moment.


This small list of examples already indicates that many traditional quantities
known from elementary statistical analysis may be expressed in terms of n .

10

CHAPTER 1. INTRODUCTION

When the data are real-valued or multivariate, n is uniquely determined


through Fn . We therefore prefer to replace n with Fn and write:

dn dFn .
For a deeper analysis the class of quantities which somehow can be computed
from the available data through n or Fn , respectively, need to be enlarged.
Going back to (1.2.3), it is tempting to look also at products of n . For
example, the product measure n n is uniquely determined through its
values on rectangles A1 A2 and is given by
n n (A1 A2 ) =
=

n
n
1
Xi (A1 )Xj (A2 )
n2

1
n2

i=1 j=1
n
n

1{Xi A1 ,Xj A2 } .

i=1 j=1

This expression shows that n n is a new (articial) measure on the sample


space S S giving masses n2 to each of the points (Xi , Xj ), 1 i, j n.
The representation corresponding to (1.2.4) becomes
n n = n2

n
n

(Xi ,Xj ) .

i=1 j=1

Since (1.2.6) holds true for a general Dirac, we get, for an arbitrary function
= (x, y) of two variables:

n
n
1
(Xi , Xj ),
d(n n ) = 2
n
i=1 j=1

a so-called V-statistic. A closely related version is

1
(Xi , Xj ).
n(n 1)
n

Un =

i=1 j=1
i=j

These statistics are called U-statistics (of degree two).


For example, if the Xj s are real-valued and if we set (x, y) = 12 (x y)2 ,
we obtain the sample variance
1
n )2 .
=
(Xj X
n1
n

n2

j=1

1.2. DIRACS AND INDICATORS

11

Actually,

1
1
(Xi Xj )2 =
(Xi Xj )2
2n(n 1)
2n(n 1)
n

i=1 j=1
i=j
n

1
n1
1
n1

i=1 j=1

1
Xi Xj
n(n 1)
n

Xi2

i=1
n

i=1 j=1

n 2
1
n )2 .
Xn =
(Xi X
n1
n1
n

Xi2

i=1

i=1

Our discussion may have already indicated that point measures or Diracs are
just the cornerstones of a hierarchy of more or less sophisticated statistical
tools. After a moment of thought this is not surprising, because in a statistical framework the relevant information is often obtained only through
a set of nitely many (discrete) data so that it is only natural to put the
weights there.
So far we have restricted ourselves to equal weighting. As it will turn out
there will be many data situations when we are led to masses which are not
equal. For further discussion, we rewrite (1.2.4) as
n

1
n =
X .
n j

(1.2.8)

j=1

From a purely mathematical point of view this is trivial if not ridiculous.


We prefer (1.2.8) over (1.2.4) whenever we want to point out that n1 is a
weight attached to Xj . When the weight Wnj given to Xj depends also on
j, the resulting (generalized) empirical distribution becomes
n =

Wjn Xj .

(1.2.9)

j=1

In such a situation it will not be possible to keep the weights separated


from the Xj s so that (1.2.8) is the right way to look at n . The empirical
integrals w.r.t. n in (1.2.9) equal

dn =

j=1

Wjn (Xj ).

(1.2.10)

12

CHAPTER 1. INTRODUCTION

An introductory discussion of examples where it may be necessary to replace


the standard weights n1 by more sophisticated Wjn s will appear in
Section 1.5. Generally speaking, a proper choice of Wjn will depend on
the problem one has in mind as well as on the structure of the data. In
many situations the Wjn will depend on the observations and are therefore
random. Though the Wjn s may be very complicated, there will be principles
which may guide us how to nd and analyze the Wjn s. It is the aim of this
monograph to develop such principles and to study the resulting empirical
quantities.
Our nal extension of an empirical measure will incorporate weights depending also on a variable x, say. In other words, the weights Wjn are functions
rather than constants and the associated empirical distribution becomes
n =

Wjn (x)Xj ,

(1.2.11)

j=1

The xs need not necessarily be taken from the same S but may vary in
another set. More clearly, we have
n (x, A) =

Wjn (x)1{Xj A}

(1.2.12)

j=1

and

(y)n (x, dy) =

Wjn (x)(Xj ).

(1.2.13)

j=1

Adopting terminology from probability theory, we call n (x, A) an empirical kernel. The integrals (1.2.13) computed w.r.t. these n are functions
in x. As one may expect these functions are candidates for estimators of
functions rather than parameters like the mean or variance of a population.
The kernel n (x, A) is called an empirical probability kernel i for all x
Wjn (x) 0 and
n

Wjn (x) = 1.
j=1

In some situations the sum of weights may be less than or equal to one:
0

j=1

Wjn (x) 1,

1.3. EMPIRICAL DISTRIBUTIONS FOR REAL-VALUED DATA

13

in which case we call n an empirical sub-probability kernel or just an


empirical sub-distribution. When some of the weights are negative, n is
called a signed empirical distribution (resp. kernel).

1.3

Empirical Distributions for Real-Valued Data

In this section we present a preliminary discussion of Fn with standard


weights n1 , when the observations are real-valued. In this case we may
arrange the data X1 , . . . , Xn in increasing order. We then come up with the
so-called order statistics
X1:n . . . Xn:n .
Sometimes the smallest and the largest datum, X1:n and Xn:n , are called
extreme order statistics. When the data are pairwise distinct, i.e., if
X1:n < . . . < Xn:n ,
there are no ties.
From the denition of Fn it follows that
Fn is nondecreasing
Fn (x) = 0 for all x < X1:n and Fn (x) = 1 for all x Xn:n
Fn is constant between successive (distinct) order statistics
Fn is continuous from the right, i.e., for each x R:
Fn (x) = lim Fn (y)
yx

Fn has left-hand limits:


Fn (x) = lim Fn (y) exists
yx

Fn has discontinuities only at the order statistics Xj:n .


The jump size
Fn {x} = Fn (x) Fn (x)

14

CHAPTER 1. INTRODUCTION

vanishes for all x outside the data set. When x = Xj for some j, then
dj
,
n
where dj is the number of data with the same value as Xj . When no ties
are present, then dj = 1. In this case, Fn has exactly n jumps with jump
size n1 . Hence for large n, Fn is a function with many discontinuities but
small jump heights.

0.0

0.2

0.4

0.6

0.8

1.0

Fn {Xj } =

Figure 1.3.1: Example of an empirical distribution function for n = 100


observations
Note also that for the computation of Fn it is enough to know the set of
ordered data. To recover the value of Xj from the order statistics we also
need the position of Xj within the sample. This leads us to the rank of Xj .
Denition 1.3.1. Assume that there are no ties. If Xj = Xi:n , then the
index i is called the Rank of Xj .

1.3. EMPIRICAL DISTRIBUTIONS FOR REAL-VALUED DATA

15

In a particular data situation Xj may denote the result of the j-th experiment or the response of the j-th person in a medical or socio-demographic
survey. As another possibility the index j may refer to time. For example,
in a nancial context, Xj may denote the price of a stock on day j at the
closing of the session. In each case, and contrary to Xi:n , the index j needs
to be chosen before the experiment is run.
Example 1.3.2. In this example we record the times (in seconds) in the
100m nal (men) at the Olympic Games 2004 in Athens. Xj corresponds to
the time of athlete number j in the alphabetic order as outlined below:
Collins
Crawford
Gatlin
Greene
Obikwelu
Powell
Thompson
Zakari

X1
X2
X3
X4
X5
X6
X7
X8

= 10.00
=
9.89
=
9.85
=
9.87
=
9.86
=
9.94
= 10.10
= DN F

From this we see that X1:n = X3 = 9.85 with R3 = 1, i.e., the winner of the
gold medal was Gatlin.
By the denition of ranks, we have
Xj = XRj :n .

(1.3.1)

Hence the original data X1 , . . . , Xn can be reconstructed from the set of


order statistics and the ranks. In other words, ranks and order statistics
contain the same information as the original (unordered) sample.
The empirical distribution function provides a convenient tool to connect
and express these quantities. First, we have
Rj = nFn (Xj ).

(1.3.2)

This means, to compute the ranks we need to evaluate Fn at very special ts,
namely the (original) data. Secondly, when we compute Fn at the ordered
data, we obtain a xed value:
i
= Fn (Xi:n ).
n

(1.3.3)

16

CHAPTER 1. INTRODUCTION

Conversely, if we x an arbitrary 0 < u 1, then we nd an x R such


that
Fn (x) = u
only when u equals i/n for some 1 i n. Therefore, to get some kind of
inverse function for Fn , we have to adopt a denition, which at rst sight
looks a little strange.

Denition 1.3.3. Let Fn be the standard empirical distribution function.


Then the associated empirical quantile function Fn1 is dened on the
interval 0 < u 1 and is given by
Fn1 (u) = inf{t : Fn (t) u}.
This denition also applies to Fn s with other weights Win (x). For standard
weights, we may represent the order statistics in terms of Fn1 , namely
Xi:n =

Fn1

( )
i
.
n

(1.3.4)

Also the following properties of Fn and Fn1 are readily checked:


Fn1 (u) = Xi:n for
and

i
i1
< u and 1 i n
n
n

(
)
Fn Fn1 (u) u for 0 < u 1

(1.3.5)

(1.3.6)

with equality in (1.3.6) only when u = ni , 1 i n. Note that, by (1.3.5),


Fn1 is left-hand continuous.
The u-quantile Fn1 (u) divides the sample into the lower 100 u percent
and the upper 100 (1 u) percent of the data. Fn1 (1/2) is the sample
(or empirical) median, a popular parameter for the central location of a
data set. Other quantiles which have found some interest are the lower and
upper quartiles, Fn1 (1/4) and Fn1 (3/4). The interquartile range
Fn1 (3/4) Fn1 (1/4)

(1.3.7)

is a convenient means to measure the spread of the central part of the data.

1.4. SOME TARGETS

1.4

17

Some Targets

So far we have introduced and discussed several elementary quantities related to Fn . In this section we formalize their mathematical background.
This enables us to express and view our statistics as estimators of various
targets. In particular, we provide the mathematical framework for the Xj s.
Again we rst restrict ourselves to real-valued Xs.
In mathematical terms, the Xj are random variables dened on a probability
space (, A, P). For the time being we assume that all Xs have the same
distribution function (d.f.)
P( : Xj () t) F (t)

(Xj F ).

The random variables Xj are assumed to be measurable w.r.t. the -algebra


A guaranteeing that the events {Xj t} belong to A. The quantity F (t)
equals the probability that in a future draw from the same population, the
attained value does not exceed t. Typically, F is unknown and needs to
be estimated from a set of data, often being the only source of available
information.
The function F has the following properties shared by each d.f. G:
(i) G is non-decreasing
(ii) G is right-hand continuous with left-hand limits
(iii) limt G(t) = 0 and limt G(t) = 1
For G = Fn , conditions (i) (iii) have been discussed in Section 1.3. For
other Gs like G = F , (i) (iii) follow from elementary properties of the
underlying probability measure P.
The notion of a quantile function also extends from Fn to arbitrary d.f.s
G:
G1 (u) = inf{t R : G(t) u},
0 < u 1.
(1.4.1)
F 1 constitutes the true quantile function. If G(t) < 1 for all t R,
G1 (1) = . Generally, G1 (1) equals the smallest upper bound for the
support of G. Inequality (1.3.6) holds for a general G. Note that G
G1 (u) = u if G is continuous.
At this point of the discussion it is already useful to look at F and Fn as
two points on the string of all distributions, one (G = F ) representing the

18

CHAPTER 1. INTRODUCTION

whole but unknown population and the other one (G = Fn ), revealing some
though limited information about through X. With this in mind we may
consider all quantities discussed in the previous sections also for G = F .
A functional T , which maps G into a real number or a vector T (G), is
called a statistical functional. A simple example of a linear functional is
dened, for a given , as

T (G) = (x)G(dx),
provided the integral exists. The functional T may be also evaluated at
measures which are not necessarily probability measures. T is called linear
because for any two G1 , G2 for which T is well-dened and any real numbers
a1 and a2 we have
T (a1 G1 + a2 G2 ) = a1 T (G1 ) + a2 T (G2 ).

While
T (F ) =

(x)F (dx)

denotes the theoretical quantity of interest,

T (Fn ) = (x)Fn (dx)


equals the estimator obtained by replacing F with Fn . Furthermore, by
transformation of integrals,
T (F ) = E[(X)],

X F,

while
T (Fn ) = E[(X )],

X Fn .

This (very) simple example already reveals the possibility that we have more
or less two options to represent quantities of interest:
(i) a representation in terms of random variables
(ii) a representation in terms of distributions

1.4. SOME TARGETS

19

No representation is superior to the other one but each has its own value. A
representation of expectation and variance, e.g., through variables still exhibits that originally these quantities were attached to variables describing
some particular random phenomena. On the other hand, the representation through F gives us a clue how to estimate the target, namely just by
replacing F with Fn (plug-in method).
The V -statistic

n
n
1
(Xi , Xj ) =
(x, y)Fn (dx)Fn (dy)
Vn = 2
n
i=1 j=1

equals T (Fn ), where now

T (G) =

(x, y)G(dx)G(dy)

is a so-called V -functional of degree two. The function is called the


kernel associated with T .
We now list some targets which may be of some general interest. This list
of course is not complete and is intended only to give a rough idea what
could be the subject of research. In each case, we assume without further
mentioning that all integrals exist at G = F .
In the following we assume that X F .
Example 1.4.1. The choice of (x) = xk leads to the k-th moment of X,

k
EX = xk F (dx).
The empirical estimator becomes

n
1 k
Xj .
xk Fn (dx) =
n
j=1

Example 1.4.2. If we set t (x) = 1{xt} it leads to

t (x)F (dx) = F (t),


respectively

t (x)Fn (dx) = Fn (t).

Note that if we consider the collection {t } of all t s, we obtain a stochastic process or random function indexed by t R.

20

CHAPTER 1. INTRODUCTION

Example 1.4.3. The family of functions (x) = exp[x] leads to the


Laplace-transform of X (resp. F ):

exp[x]F (dx) = EeX


along with its estimator

1
exp[Xj ],
exp[x]Fn (dx) =
n
n

j=1

the empirical Laplace-transform.


Example 1.4.4. The complex-valued version (x) = exp[ix] leads to the
Fourier-transform of X (resp. F ):

exp[ix]F (dx) = EeiX


along with its estimator

1
exp[iXj ].
n
n

exp[ix]Fn (dx) =

j=1

Here again, i is the complex unit.


The Fourier-transform uniquely determines the distribution of a random
variable X. Hence the empirical version is a candidate to detect distributional features of X.
Example 1.4.5. Another quantity which uniquely determines the distribution of a random variable X is its cumulative hazard function

1{xt}
F (dx)
F (t) =
=
F (dx).
1 F (x)
1 F (x)
(,t]

In this case, the function consists of two components, the numerator being
a function only of x, while the denominator depends on x through F . The
cumulative hazard function plays an outstanding role in survival analysis
and reliability, when the data are lifetimes (failure times) of individuals
or technical units and are therefore nonnegative. Hence in this case it is
sucient to extend the integral over the positive real line. For a continuous
F , the left-hand limit F (x) coincides with F (x). At this point it is not

1.4. SOME TARGETS

21

clear why one should not use F (x) instead of F (x). For a preliminary
answer we compute the empirical cumulative hazard function estimator, the
so-called Nelson-Aalen estimator:
t
n (t) =
0

1{Xi t}
Fn (dx)
n
=
.
1 Fn (x)
j=1 1{Xj Xi }
i=1

Note that the choice of Fn (x) in the denominator is responsible for Xj Xi


(and not Xj > Xi ). Because of Xi Xi , the sum is at least 1 so that the
ratio is well-dened. If, as in Example 1.4.2, we let t vary in some interval,
say, we obtain a stochastic process in t.

When there are no ties we may compute n also by summation along the
order statistics. This yields

n (t) =

1{Xi:n t}
.
ni+1
i=1

We thus see that like Fn also n jumps at the data, the weights 1/(n i + 1)
assigned to the i-th order statistic, however, being strictly increasing as we
move to the right.

While the total mass under (the standard) Fn equals one, the Nelson-Aalen
estimator has total mass
n

i=1

1
1
=
ln n as n .
ni+1
i
n

(1.4.2)

i=1

This total cumulative mass is attained for all t Xn:n so that n is a


constant function there.

CHAPTER 1. INTRODUCTION

22

Figure 1.4.1: Example of a Nelson-Aalen estimator for n = 100 observations


In the example underlying Figure 1.4.1, we feature some eects which are
typical for the Nelson-Aalen estimator. For 0 t 3 the function is close
to a smooth function. Here it is the straight line y = t. In the extreme right
tails, there are fewer observations with weights becoming larger and larger.
Consequently there is no averaging eect and n is not a reliable estimate
of F .

Example 1.4.6. Though the empirical distribution function is a nonparametric estimator in the sense that for its computation as well as its properties no parametric model assumption is required, our view turns out to be
useful also in parametric statistics. For this, let M = {f (, ) : } be a
family of densities parametrized by a set of nite-dimensional vectors. In
the case of independent observations from M, i.e., the true density equals

1.4. SOME TARGETS

23

f (, 0 ) for some unknown 0 , the log-likelihood function equals

ln f (Xi , )

i=1

which after normalization becomes

ln f (x, )Fn (dx).

(1.4.3)

We thus see that log-likelihood functions are empirical integrals parametrized by . Under some smoothness conditions the maximizer of (1.4.3),
i.e., the Maximum Likelihood Estimator (MLE) satises

(x, )Fn (dx) = 0,


where

ln f (x, ).

Example 1.4.7. Sometimes the target is known to take on a specic value,


given some proper assumptions hold. For example, assume F is symmetric
at zero, i.e., X = X in distribution when X F , and is odd:
(x, ) =

(x) = (x).

Then

(x)F (dx) = 0.
Important examples for are
(x) = sign(x) =

1
0

if
if
if

x>0
x=0 ,
x<0

the sign function, or


t (x) = sin(tx),

t R.

Example 1.4.8. As another example, consider a two-sample situation, in


which the rst sample is from F and the second is from G. For a scorefunction J set

J(F (x))G(dx).
If F = G and F is continuous, the last integral becomes

1
0

value.

J(u)du, a known

24

CHAPTER 1. INTRODUCTION

The empirical version becomes

J(Fn (x))Gm (dx),

(1.4.4)

an empirical integral with (x) = J(Fn (x)). Here, Fn and Gm are the
empirical d.f.s of the two samples X1 , . . . , Xn and Y1 , . . . , Ym from (the
unknown) F and G, respectively. Applying our general formula (1.2.7), we
get

m
1
J(Fn (Yi )).
(1.4.5)
J(Fn (x))Gm (dx) =
m
i=1

Note that since Yi is from the second sample and Fn is from the rst nFn (Yi )
is not the rank of Yi . Rather it provides us with the location (or position) of
Yi w.r.t. the rst sample. E.g., when we put J(u) = u, then (1.4.4) equals
1
1{Xj Yi } .
nm
m

i=1 j=1

This integral is related to the Wilcoxon-two-sample rank statistic. At the


same time it also constitutes an example of a two-sample U -statistic.
Since under F = G (continuous)

F (x)G(dx) =

1
F (x)F (dx) =

1
udu = ,
2

a known value, we may expect that (1.4.5) for J(u) = u is at least close to
1/2 when F = G. At the same time this integral, namely (1.4.5), should
hopefully be far away from 1/2 if F = G. In general, the interesting question will then be how to choose J in order to get large power on certain
alternatives to F = G.
We see from Examples 1.4.7 and 1.4.8 that empirical integrals may also be
utilized for tests in that a null hypothesis such as symmetry or equality
in distribution is rejected if the empirical term deviates too much from a
pre-specied value.
Many of the concepts discussed so far have obvious extensions to the kvariate case. As we know, a convenient class of sets which determines a
distribution is the family of quadrants
Q = (, t],

1.4. SOME TARGETS

25

where = (, . . . , )T and t = (t1 , . . . , tk )T . The pertaining empirical d.f. was dened as


1
Fn (t) =
1{Xj t} ,
n
n

t Rk .

j=1

New questions may come up now which have no counterpart in the univariate
case.
Example 1.4.9. We may be interested in the dependence structure of the
X-coordinates. A simple measure of association between the coordinates of
a bivariate vector X = (X1 , X2 )T , say, is the correlation
Cov(X1 , X2 )
CorrX =
.
VarX1 VarX2
We have already seen in Section 1.2 that variances are simple (quadratic)
functionals of a distribution function. Denoting with F the joint distribution
of X1 and X2 also the covariance allows for a simple representation as a
(statistical) functional of F :

1
Cov(X1 , X2 ) =
(x1 y1 )(x2 y2 )F (dx1 , dx2 )F (dy1 , dy2 ).
2
Example 1.4.10. Much work, e.g., has been devoted to testing for independence of X1 and X2 . Write F1 and F2 for the marginal distributions of
X1 and X2 , respectively. Independence is tantamount to
F (t1 , t2 ) F1 (t1 )F2 (t2 ) = 0

(1.4.6)

for all t1 , t2 R. A test of (1.4.6) may then be based on its empirical


analogue
Fn (t1 , t2 ) F1n (t1 )F2n (t2 ).
Summarizing, the purpose of this section so far was to demonstrate, that
quantities or structures which may be expressed through the true distribution function have natural sample analogues in terms of empirical d.f.s. The
empirical versions then serve as estimators of the theoretical quantities or
may lead to statistics for testing hypotheses about unknown parameters or
structures.
In most of our examples the entities of interest as well as their estimators
could be written in an explicit form. The log-likelihood function was an

26

CHAPTER 1. INTRODUCTION

exception in that the estimator was implicitly dened as a solution of an


empirical equation.
So far we focused on empirical masses attached to xed sets A, B, . . . For extended intervals (, t], Fn (t) represents the cumulative masses attached
to all Xj s less than or equal to t. Therefore Fn is also called the cumulative
empirical d.f. Similarly, F is the theoretical cumulative d.f. representing
the whole population. As such, both Fn and F have a global feature. At
the same time there are targets which characterize the distribution of a
population in a local way.
For this, assume again that the Xj s are real-valued from a distribution
function F . If F is dierentiable with derivative
f (x) = lim
h0

F (x + h) F (x)
,
h

(1.4.7)

the (Lebesgue) density of F , then for small h > 0 we have


F (x + h) F (x) hf (x).

(1.4.8)

These terms exhibit the local structure of F in a neighborhood of x R.


Applicants of statistical methodology often prefer a density as a representative of a distribution and to recover masses as areas under the density. There
are other important quantities which are of local type and which correspond
to cumulative functions.
Example 1.4.11. If F has a density f , then the cumulative hazard function
F equals
t
f (x)dx
F (t) =
= ln(1 F (t)).
1 F (x)
0

Hence
1 F (t) = exp[(t)].
The function
(x) =

f (x)
1 F (x)

is the hazard function of F . We may use (1.4.7) to represent (x) as


(x) = lim
h0

F (x + h) F (x)
P(x < X x + h|x < X)
= lim
,
h0
h(1 F (x))
h

1.4. SOME TARGETS

27

where
P(A|B) =

P(A B)
P(B)

is the conditional probability of an event A given an event B (with positive


probability). Hence similarly to (1.4.8) we get
P(x < X x + h|x < X) h(x).
In other words, the probability that, e.g., a technical unit may fail between x
and x + h given that it worked until x is proportional to (x). The quantity
x may be interpreted as the current age of the system.

If X denotes a discrete lifetime attaining the values i = 0, 1, 2, . . . the hazard


function is dened as
(i) = P(i < X i + 1|i < X) =

P(X = i + 1)
.
P(i < X)

Hence (i) equals the probability that given a person has survived i time
units, death will occur in the next period.

Table 1.4.1 presents the hazard functions of the German male and female
population computed for some selected decades in the last century.

The pertaining plot 1.4.2 reveals that the hazard function of a human population is likely to have the shape of a bathtub. If we compare the local risks
for German women in the 1910s and 1980s it becomes apparent that significant improvements as to the rate of mortality have been obtained for young
children (in particular babies) and elderly women exceeding 60 years of
age.

CHAPTER 1. INTRODUCTION

300

28

150
0

50

100

1000

200

250

1901/10
1980/82

20

40

60

80

Age in years

Figure 1.4.2: Plot of two selected hazard functions for the female population
in Germany
Example 1.4.12. Another important local quantity is the so-called regression function. For this, recall that we already discussed some measures
of association for the two components of an observational vector (X1 , X2 )T .
The purpose then was to quantify the degree of dependence between X1
and X2 . In regression analysis one is interested in the kind of dependence
between X1 and X2 . This is of great practical importance when X2 , e.g.,
is not observable and needs to be predicted on the basis of X1 . Hence X1
is sometimes called the independent variable (dose, input) while X2 is the
dependent variable (response, output). To distinguish between the dierent roles played by X1 and X2 , we prefer the notation X and Y for X1 and
X2 , respectively.
If Y has a nite expectation, a result in probability theory guarantees the

1.4. SOME TARGETS

29
Table 1.4.1: 1000 (i)

i=0
i=1
i=2
i=5
i = 10
i = 15
i = 20
i = 25
i = 30
i = 35
i = 40
i = 45
i = 50
i = 55
i = 60
i = 65
i = 70
i = 75
i = 80
i = 85
i = 90

1901/10
202,34
39,88
14,92
5,28
2,44
2,77
5,04
5,13
5,56
6,97
9,22
12,44
16,93
23,57
32,60
47,06
69,36
106,40
157,87
231,60
320,02

1924/26
115,38
16,19
6,36
2,42
1,42
1,94
4,27
4,39
4,05
4,25
5,35
7,23
10,30
15,48
23,62
36,92
58,08
93,91
141,96
212,85
284,69

1932/34
85,35
9,26
4,50
2,32
1,33
1,57
2,83
2,97
3,24
3,94
4,82
6,58
9,39
14,18
21,72
34,04
54,01
87,40
136,68
207,69
287,73

Males
1949/51
61,77
4,16
2,46
1,21
0,70
1,04
1,88
2,23
2,28
2,76
3,52
5,16
8,50
12,75
18,91
29,06
45,79
75,08
121,37
190,15
282,56

1960/62
35,33
2,31
1,40
0,80
0,45
0,75
1,85
1,69
1,70
2,09
2,95
4,43
7,39
12,97
22,04
34,33
50,87
78,85
122,97
188,02
279,21

1970/72
26,00
1,55
1,00
0,73
0,47
0,79
2,00
1,61
1,70
2,10
3,20
4,75
7,71
12,06
20,44
34,59
55,92
84,15
122,86
180,95
259,70

1980/82
13,07
0,92
0,63
0,44
0,29
0,52
1,54
1,28
1,32
1,69
2,78
4,44
7,33
10,97
18,25
28,34
45,76
75,30
115,52
169,05
227,77

i=0
i=1
i=2
i=5
i = 10
i = 15
i = 20
i = 25
i = 30
i = 35
i = 40
i = 45
i = 50
i = 55
i = 60
i = 65
i = 70
i = 75
i = 80
i = 85
i = 90

1901/10
170,48
38,47
14,63
5,31
2,56
3,02
4,22
5,37
5,97
6,86
7,71
8,54
11,26
16,19
24,73
39,60
62,06
98,31
146,50
217,39
295,66

1924/26
93,92
14,93
5,74
2,19
1,20
1,81
3,32
3,94
4,14
4,52
5,31
6,44
8,86
12,73
19,47
31,55
51,98
85,29
133,71
198,37
263,08

1932/34
68,39
8,23
3,98
2,15
1,14
1,30
2,27
2,70
3,01
3,48
4,22
5,46
7,91
11,53
17,46
28,53
47,61
80,33
126,51
193,66
273,64

Females
1949/51
49,09
3,60
2,15
0,99
0,47
0,68
1,15
1,35
1,65
1,99
2,55
3,68
5,46
8,13
12,91
22,24
39,11
68,11
114,02
173,62
259,16

1960/62
27,78
2,01
1,08
0,56
0,28
0,40
0,62
0,73
0,99
1,38
2,01
2,99
4,45
6,72
10,85
18,62
32,85
59,61
103,31
166,26
248,21

1970/72
19,84
1,31
0,80
0,50
0,28
0,45
0,65
0,63
0,77
1,16
1,78
2,82
4,56
5,38
9,88
17,11
30,19
54,29
94,43
155,88
234,20

1980/82
10,37
0,84
0,50
0,29
0,20
0,33
0,44
0,51
0,66
0,96
1,42
2,23
3,50
5,30
8,69
13,39
22,77
43,11
76,44
132,62
206,54

Source: Statistisches Bundesamt.

30

CHAPTER 1. INTRODUCTION

existence of a function m = m(x) such that


Y = m(X) + ,

(1.4.9)

where is orthogonal to X, i.e., the conditional expectation of , the noise


variable, given X equals zero:
E(|X) = 0.

(1.4.10)

Given X = x (and not Y ), the optimal predictor of Y then equals m(x). In


probability theory the notation
m(x) = E[Y |X = x]
is very common and intuitive. It points out that m(x) equals the mean output when the input equals x. Unless m is a constant function, the expected
response therefore depends on the value attained by X. Unfortunately, theory only provides us with the existence of m. If (X, Y ) admits a bivariate
Lebesgue density f = f (x, y), we have

yf (x, y)dy
(1.4.11)
m(x) =
f1 (x)
where f1 is the (marginal) density of X. Equation (1.4.11) exhibits the local
avor of m but also points out that in a real world situation, m is unknown
and depends on unknown quantities like f and f1 .
We now introduce the cumulative process pertaining to m. Again we restrict
ourselves to the bivariate case. Let F1 be the d.f. of X and set
I(t) = E[Y 1{Xt} ].
When Y 1, we obtain I(t) = F1 (t). In the general case, we get, using
well-known properties of conditional expectations:
I(t) = E[E(Y 1{Xt} |X)]
= E[1{Xt} E(Y |X)]

= E[1{Xt} m(X)] =

m(x)F1 (dx).
(,t]

The function I is the integrated (or cumulative) regression function,


where integration takes place w.r.t. the (marginal) distribution of X. Needless to say, this concept has an obvious extension to multivariate Xs. If a

1.5. THE LORENZ CURVE

31

sample (Xj , Yj ), 1 j n, from the same distribution as (X, Y ) is available,


the empirical estimator of I equals
1
Yj 1{Xj t} .
n
n

In (t) =

j=1

Again, when all Yj = 1, we obtain the empirical d.f. of the Xj s. In the


general case, In is an example of a so-called marked point process with
jumps at Xj , 1 j n, and marks (or jump heights) Yj /n. These marks
may be random and take on positive as well as negative values. The choice
of the indicator 1{Xt} has been made only to make I comparable with F .
Applicants may feel free to replace this indicator, e.g., by 1{X>t} , which is
more popular for lifetime data.
A preliminary consequence of our discussion is that it seems possible to propose natural estimators if the target has a global respectively cumulative
character. For local quantities like densities, hazard or regression functions
things seem to be dierent. Consider the problem of estimating a density f .
The empirical d.f. Fn is constant between two successive data and therefore
has derivative zero there. On the other hand, Fn is not even continuous at
the data. Any proposal as to smooth Fn and then to compute the resulting
derivative brings us exactly to the heart of the problem:
How to smooth?
From a mathematical point of view, our goal is to invert the integration
operator through dierentiating. As we have seen, however, the inverse
operator is not applicable to the estimator Fn , a feature which is quite
common in so-called ill-posed problems.

1.5

The Lorenz Curve

In the following sections we discuss several important quantities, which may


be expressed through F and F 1 . First we investigate the so-called Lorenz
curve.

Denition 1.5.1. Let F be any distribution function on the real line with
quantile function F 1 . Assume that = xF (dx) exists and does not

32

CHAPTER 1. INTRODUCTION

vanish. Then the Lorenz function L is dened through


1

LF (p) = L(p) =

F 1 (u)du,

0 p 1.

Since by transformation of integrals


1
=

F 1 (u)du,

we have L(1) = 1. Clearly, L(0) = 0. Hence the associated Lorenz curve


(p, L(p)), 0 p 1, connects the points (0, 0) and (1, 1). It is scale-free.
Remark 1.5.2. If the random variable X with d.f. F is nonnegative, so is
F 1 (u). Consequently, in this situation, L is a non-decreasing function.

Lemma 1.5.3. Assume that > 0. Then the Lorenz function L is convex.
Proof. We have to show that for p1 , p2 from the unit interval and any
0c1
cL(p1 ) + (1 c)L(p2 ) L(cp1 + (1 c)p2 ).
(1.5.1)
The left-hand side equals, for 0 p1 p2 1:
1

p1

p2

(u)du + (1 c)

F 1 (u)du.

p1

For the right-hand side of (1.5.1) we obtain


1

p1

L(cp1 + (1 c)p2 ) =

cp1 +(1c)p
2

F 1 (u)du.

(u)du +

p1

Hence it remains to show that


p2
(1 c)
p1

F 1 (u)du

cp1 +(1c)p
2

F 1 (u)du.

p1

1.5. THE LORENZ CURVE

33

The integral on the right-hand side equals, however,


p2
(1 c)

F 1 (cp1 + (1 c)u)du.

p1

The conclusion now follows from the monotonicity of F 1 and


u cp1 + (1 c)u for p1 u p2 .

Example 1.5.4. For the exponential distribution with parameter > 0 we
have
L(p) = p + (1 p) ln(1 p).
Example 1.5.5. For the uniform distribution on the interval [a, b] we have
L(p) =

2ap + (b a)p2
.
a+b

In particular, for a = 0 and b = 1, we obtain L(p) = p2 .


Corollary 1.5.6. The Lorenz curve is always below the straight line connecting (0, 0) and (1, 1).
We may view L as a functional T evaluated at p and F
LF (p) = T (p, F ).
To compute the empirical counterpart, we have to replace F by Fn , i.e., the
empirical Lorenz function equals
1
Ln (p) = T (p, Fn ) =
n

Fn1 (u)du,

where n = n

n
1

i=1 Xi

is the sample mean.

i
Since Fn1 is constant between two successive i1
n and n we have for
p nk :
[k1
]
(
)
1 1
k1
Ln (p) =
Xi:n + p
Xk:n .
n
n
n

k1
n

<

i=1

We see that Ln (p) may also be expressed through sums of order statistics.

34

CHAPTER 1. INTRODUCTION

Example 1.5.7. If X1:n = . . . = Xn1:n = 0 and Xn:n = K > 0, then


{
0
for 0 p n1
n
Ln (p) =
1
n(p 1 + n ) for n1
n <p1
If we interpret the Xs as the wealth owned by n members of a population,
this describes a situation where all is owned by a single member. In this
extreme situation, L is at up to n1
n . Hence the dierence between L and
the 45 line is very large. This observation leads us to an index, which
measures the distribution of wealth in an economy.

1.6

Ginis Index

In this section, F denotes a continuous d.f. supported by the positive real


line. Let A denote the area between the 45 -line and the Lorenz curve, and
let B be the area below the Lorenz curve. Ginis index, see Gini (1936), is
then dened as
A
G :=
A+B
1
Since A + B = 2 , we obtain

1
1
G = 2A = 2 L(p)dp
2
0

1
=12

L(p)dp.
0

We always have 0 G 1. G is close to one if L(p) is close to zero. In


view of example 1.5.7, a large G indicates that the wealth of a population
is concentrated in a few hands.
Lemma 1.6.1. Assume that F is a continuous d.f. supported by the positive
real line. Then we have

G=
,
2
where

=
|x y|F (dx)F (dy).
0

In terms of random variables, we have


= E|X Y |, where X, Y F are independent.

1.6. GINIS INDEX

35

Rang

GiniIndex

Rang

GiniIndex

Rang

GiniIndex

Namibia

70,70

Sdafrika

65,00

Nigeria

49

43,70

Kenia

50

42,50

Bangladesch

98

33,20

Griechenland

99

Lesotho

63,20

Burundi

51

33,00

42,40

Frankreich

100 32,70

Botsuana
Sierra Leone

63,00

Russische Fderation

62,90

Cte dIvoire

52

42,00

Taiwan

101 32,60

53

41,50

Tadschikistan

Zentralafrikan. Republik

61,30

102 32,60

Senegal

54

41,30

Kanada

Haiti

103 32,10

59,20

Marokko

55

40,90

Italien

Bolivien

104 32,00

58,20

Georgien

56

40,80

Spanien

105 32,00

Honduras

57,70

Turkmenistan

57

40,80

Timor-Leste

106 31,90

Kolumbien

10

56,00

Nicaragua

58

40,50

Estland

107 31,30

Guatemala

11

55,10

Trkei

59

40,20

Korea, Republik

108 31,00

Thailand

12

53,60

Mali

60

40,10

Tschechische Republik

109 31,00

Hongkong

13

53,30

Tunesien

61

40,00

Armenien

110 30,90

Paraguay

14

53,20

Jordanien

62

39,70

Niederlande

111 30,90

Chile

15

52,10

Burkina Faso

63

39,50

Pakistan

112 30,60

Brasilien

16

51,90

Guinea

64

39,40

Australien

113 30,50

Panama

17

51,90

Ghana

65

39,40

Europische Union

114 30,40

Mexiko

18

51,70

Israel

66

39,20

thiopien

115 30,00

Papua-Neuguinea

19

50,90

Mauretanien

67

39,00

Kosovo

116 30,00

Sambia

20

50,80

Mauritius

68

39,00

Zypern

117 29,00

Swasiland

21

50,40

Venezuela

69

39,00

Zypern

117 29,00

Costa Rica

22

50,30

Malawi

70

39,00

Slowenien

118 28,40

Gambia

23

50,20

Portugal

71

38,50

Serbien

119 28,20

Simbabwe

24

50,10

Moldau

72

38,00

Island

120 28,00

Sri Lanka

25

49,00

Jemen

73

37,70

Belgien

121 28,00

Dominikanische Republik

26

48,40

Vietnam

74

37,60

Ukraine

122 27,50

China

27

48,00

Japan

75

37,60

Belarus

123 27,20

Madagaskar

28

47,50

Tansania

76

37,60

Deutschland

124 27,00

Singapur

29

47,30

Indien

77

36,80

Kroatien

125 27,00

Ecuador

30

47,30

Usbekistan

78

36,80

Finnland

126 26,80

Nepal

31

47,20

Indonesien

79

36,80

Kasachstan

127 26,70

El Salvador

32

46,90

Laos

80

36,70

Slowakei

128 26,00

Ruanda

33

46,80

Mongolei

81

36,50

Luxemburg

129 26,00

Malaysia

34

46,20

Benin

82

36,50

Malta

130 26,00

Peru

35

46,00

Neuseeland

83

36,20

Malta

130 26,00

Argentinien

36

45,80

Bosnien und Herzegowina 84

36,20

sterreich

131 26,00

Philippinen

37

45,80

Litauen

85

35,50

Norwegen

132 25,00

Mosambik

38

45,60

Algerien

86

35,30

Dnemark

133 24,80

Jamaika

39

45,50

Lettland

87

35,20

Ungarn

134 24,70

Uruguay

40

45,30

Albanien

88

34,50

Montenegro

135 24,30

Bulgarien

41

45,30

gypten

89

34,40

Schweden

136 23,00

Vereinigte Staaten

42

45,00

Polen

90

34,20

Guyana

43

44,60

Grobritannien

91

34,00

Kamerun

44

44,60

Niger

92

34,00

Iran

45

44,50

Irland

93

33,90

Kambodscha

46

44,40

Schweiz

94

33,70

Uganda

47

44,30

Aserbaidschan

95

33,70

Mazedonien

48

44,20

Kirgisistan

96

33,40

Rumnien

97

33,30

Land

Land

Figure 1.6.1: Gini Index

Land

36

CHAPTER 1. INTRODUCTION

Proof. We have
y
(y x)F (dx)F (dy)

=2
0

y
= 2 yF (y)F (dy)
xF (dx)F (dy) .
0

Applying Fubinis Theorem the last expression becomes

2 yF (y)F (dy) x(1 F (x))F (dx)


0

[
]

= 2 (2xF (x) x)F (dx) = 2 2 xF (x)F (dx) .


0

Since, by transformation of integrals and under continuity of F ,

1
xF (x)F (dx) =

1 u
(u)udu =

1 1
=

F
0

1dvF 1 (u)du

1 v
(u)dudv =

F 1 (u)dudv,

we come up with

= 2 2

1 v
0

whence

=12
2

F 1 (u)dudv

1
L(v)dv = G.
0


The empirical Gini Index of course equals
1
Gn = 1 2

Ln (p)dp.
0

1.6. GINIS INDEX

37

Since Ln is a
piecewise linear function interpolating the values Sk and Sk+1 ,
where Sk = ki=1 Xi:n , we obtain
Gn = 1 2
=1

(
n

1 k

1
n

i=1
n

k=1

k1

n
n

)(

Sk1
Sk
+
Sn
Sn

Sk + Sk1
.
Sn

Example 1.6.2. Assume that X1 = 16, X2 = 1, X3 = 4, X4 = 25 and X5 =


9. Ordering gives X1:5 = 1, X2:5 = 4, X3:5 = 9, X4:5 = 16, X5:5 = 25. Hence
1 5 14 30
Ln interpolates the values 0, 55
, 55 , 55 , 55 and 1 yielding G5 = 24
55 = 43, 64%.
The statistical analysis of Gn requires some knowledge of the distributional
theory of sums of order statistics. Alternatively we could use the equivalent
representation in terms of U -statistics.
Rather than looking at the area between the 45 line and the Lorenz curve,
we could take into account the discrepancy
Z(p) = p L(p).
Obviously, Z(p) 0 and
1

1
Z(p)dp =
2

1
L(p)dp = G.
2

Since L is convex,the function Z is concave. Furthermore,


Z(0) = 0 and Z(1) = 1 L(1) = 0.
Since L and hence Z is continuous, Z admits a maximizer, say p0 :
Z(p0 ) = max Z(p).
0p1

It turns out that p0 admits an interesting interpretation.


Lemma 1.6.3. We have
p0 = F ().
In other words, p0 equals the proportion of the population obtaining less
than or equal to the mean.

38

CHAPTER 1. INTRODUCTION

Proof. We have to show that Z(F ()) Z(p) for all 0 p 1. We only
discuss the case F () p, the other being dealt with in a similar way. Now,
Z(F ()) Z(p) = F () p [L(F ()) L(p)]
1
= F () p

F
()

F 1 (u)du.

(1.6.1)

The integral on the right-hand side equals, by transformation of integrals,

1{F 1 (p)x} xF (dx)


so that (??) becomes
1

( x)1{F 1 (p)x} F (dx).




Obviously the integral is nonnegative.

It is interesting to investigate the maximal deviation between L and the


45 -line, i.e., of
Z(p0 ) = p0 L(p0 ).
Lemma 1.6.4. We have
Z(p0 ) =
with

the mean absolute deviation.

|x |F (dx),

1.7. THE ROC-CURVE

39

Proof. We have

1
=
2
2

1
|x |F (dx) =
2

1
(x )F (dx) +
( x)F (dx)
2

1
1
1
=
( x)F (dx) +
(x )F (dx) =
(x )F (dx)
2

xF (dx) 1 + F () = F ()

= F ()

xF (dx)
0

F
()

F 1 (u)du.

(1.6.2)

Recalling that p0 = F (), we obtain that (1.6.2) equals Z(p0 ).

1.7

The ROC-Curve

The ROC-Curve (Receiver Operating Characteristic Curve) was


originally proposed to analyze the accuracy of a medical diagnostic test. To
begin with, assume that a variable D may take on the two values zero and
one indicating the true but unknown status of a patient:
{
1 for disease
D=
0 for non-disease
Let Y be the result of a diagnostic test. Suppose that
{
1 positive for disease
Y =
0 negative for disease
The result of the test may be classied as follows:

Y =0
Y =1

D=0
True negative
False positive

D=1
False negative
True positive

We denote with
FPF = P(Y = 1|D = 0)

40

CHAPTER 1. INTRODUCTION

and
TPF = P(Y = 1|D = 1)
the false positive and true positive fraction, respectively. The ideal test
yields
FPF = 0 and TPF = 1.
For a useless test which is unable to discriminate between D = 0 and D = 1
we have FPF = TPF. In a biomedical context FPF and 1 TPF are called
sensitivity and specicity, i.e.,
sensitivity = P(Y = 1|D = 0)
specicity = P(Y = 0|D = 1)
In engineering the terms hit rate (TPF) and false alarm rate (FPF)
are quite common. In statistics when the goal is to test the null hypothesis
D = 0, FPF is called the signicance level and TPF power. The quantity
= P(D = 1) is called the population prevalence of disease. Finally, the
misclassication probability equals
P(Y = D) = P(D = 1)P(Y = 0|D = 1) + P(D = 0)P(Y = 1|D = 0)
= (1 TPF) + (1 )FPF.
In applications to real data
1 TPF = P(Y = 0|D = 1)
should be small since usually a negative diagnosis of a patient who has fallen
sick may result in a fatal incidence. On the other hand, a positive diagnosis
of a negative status may only result in some less serious inconveniences. We
nally introduce the predictive values
Positive predictive value = PPV = P(D = 1|Y = 1)
Negative predictive value = NPV = P(D = 0|Y = 0)
A perfect test has
PPV = 1 = NPV
while a useless test yields
P(D = 1|Y = 1) = P(D = 1) and P(D = 0|Y = 0) = P(D = 0).

1.7. THE ROC-CURVE

41

A straightforward application of Bayes rule gives us PPV and NPV in a


general situation:
TPF
TPF + (1 )FPF
(1 )(1 FPF)
NPV =
(1 )(1 FPF) + (1 TPF)
PPV =

In terms of TPF and FPF, the test is perfect if


TPF = 1 and FPF = 0.

(1.7.1)

In applications, the variable Y is not dichotomous (Y = 0 or 1) but a score


statistic Y = S(X) of some available data vector X. We can dichotomise by
comparing Y with a threshold c. In the context of these notes we diagnose
the case to be positive if Y > c and negative if Y c. It follows that
FPF(c) = P(Y > c|D = 0)
TPF(c) = P(Y > c|D = 1)

(1.7.2)

The problem now is one of choosing the threshold c yielding a good if not
optimal test. To start with, we consider the curve consisting of all values
from (1.7.2) with < c < .
Denition 1.7.1. The ROC curve consists of all tuples (FPF(c), TPF(c))
with < c < .
Since FPF and TPF are both survival functions, we have
lim TPF(c) = 0

lim FPF(c) = 0

and
lim TPF(c) = 1

lim FPF(c) = 1.

If we plot TPF(c) against FPF(c), we obtain the ROC-function. It is an


increasing function on the unit interval with ROC(0) = 0 and ROC(1) = 1.
If we denote with
F (y) = P(Y y|D = 1) and G(y) = P(Y y|D = 0)
their survival functions, then by
the two relevant d.f.s and with F and G
construction
1 (t), 0 < t < 1.
ROC(t) = F G

42

CHAPTER 1. INTRODUCTION

In applications people became interested in, as for the Lorenz-curve, the


area under ROC, i.e.,
1
AUC = ROC(t)dt.
0

Transformation of integrals leads to

AUC = F (y)G(dy) = P(Y1 > Y0 ),


where Y1 and Y0 are independent with Y1 F and Y0 G. Given two
samples Y11 , . . . , Y1n and Y01 , . . . , Y0m from F resp. G, the estimator of
AUC becomes
n
m

[ = 1
1{Y1i >Y0j } ,
AUC
nm
i=1 j=1

a two-sample U -statistic.
As for the Lorenz curve we could consider the partial AUC, namely
s
pAUC(s) =

ROC(t)dt,

0 s 1.

This is a functional in s, F and G. If we replace F and G by their empirical


versions, we obtain a stochastic process.
Remark 1.7.2. We mention that some authors base the ROC analysis on
Also a lot of ROC analysis has been done
F and G and not on F and G.
in a parametric framework, see Hsieh and Turnbull (1996), Ann. Statistics
24, 25-40.

1.8

The Mean Residual Lifetime Function

Let X be a positive random variable with d.f. F and nite mean

= EX = (1 F (x))dx.
0

The mean residual lifetime function is dened for all x with F (x) < 1:

e(x) = E(X x|X > x) =

F (y)dy

F (x)

1.8. THE MEAN RESIDUAL LIFETIME FUNCTION

43

Note that e(0) = . The function e has been primarily used in survival
analysis and reliability theory. It is the mean remaining lifelength of those
individuals who survive beyond x.
Example 1.8.1. For the exponential d.f. F (x) = exp(x), x 0, we have
e(x) = 1 .
There is an intimate relation between the Lorenz function and the mean
residual lifetime function.
Lemma 1.8.2. We have
L(F (x)) = 1

1
F (x)[e(x) + x].

Proof. Note that

F (x)[e(x) + x] = xF (x) +

F (y)dy
x
F
(x)

F 1 (t)dt

= EX1{x<X} = EX1{Xx} =
0

= (1 L(F (x))).

The next lemma presents a formula connecting F and e.
Lemma 1.8.3. We have

e(0)
dt
.
F (x) =
exp
e(x)
e(t)
0

Proof. Since

1
F (t)
h (t)
=

e(t)
h(t)
F (z)dz
t

we obtain

x

dt
= ln h(t)
e(t)
0

44

CHAPTER 1. INTRODUCTION

and therefore

exp

h(x)
dt h(x)
=
.
=
e(t)
h(0)

We conclude

e(0) exp[

F (z)dz = F (x)e(x).

. . .] = h(x) =
x

1.9

The Total Time on Test Transform

Again, let X be a nonnegative random variable with d.f. F . The total


time on test transform (TTT-transform) is dened as
H 1 (t) =

1 (t)
F

F (x)dx,

0 < t < 1.

In survival analysis, the TTT-transform and its sample version are examples
of processes (plots) which aim at analyzing the distributional properties of
a lifetime variable subject to falling below (or above) a given threshold.
By now, there is a huge literature on the empirical TTT-transform. In
Barlow et al. (1972), it appeared in the estimation process of a failure rate
under order restrictions. Barlow and Proschan (1969) studied tests based
on TTT when the data are incomplete. Langberg et al. (1980) showed how
the TTT-transform can be used to characterize lifetime distributions; see
also Klefsjo (1983a, b). More recently, Haupt and Schabe (1997) applied
the TTT concept to construct bathtub-shaped hazard functions.
Example 1.9.1. For the exponential d.f. with parameter we have
H 1 (t) =

t
.

Again, there are some interesting connections with other functions. One of
these examples is the following

1.10. THE PRODUCT INTEGRATION FORMULA

45

Lemma 1.9.2. We have, for 0 < t < 1,


H 1 (t) = L(t) + F 1 (t)(1 t).
Proof. By denition,
H 1 (t) =

1 (t)
F

1 (t)
F

F (x)dx =
0

1{x<y} F (dy)dx
1 (t)
F

min(F 1 (t), y)F (dy) =

yF (dy) + F 1 (t)(1 t).


The empirical version of H 1 is obtained, if one replaces F and F 1 by their
empirical counterparts:
Hn1 (t) =

1
F
n (t)

Fn (x)dx.
0

At t = k/n we get
}
{ k
X
( )
k:n

k
1
Hn1
=
Fn (x)dx =
Xi:n + (n k)Xk:n .
n
n
0

i=1

This also explains the name for H 1 .

1.10

The Product Integration Formula

To summarize the results of the last 5 sections, we have introduced four


important functionals of F , namely the
Lorenz curve (including Ginis Index)
ROC-curve
Mean residual lifetime function
TTT transform

46

CHAPTER 1. INTRODUCTION

It became clear that there is an intimate relationship between these functions and each may serve as a tool to model various aspects of a parent
population. In this section we study the connection between the d.f. F
and the cumulative hazard function . As our nal result we shall come
up with the famous product integration formula. It will oer a possibility
to represent empirical d.f.s through products rather than sums. Important
applications of this approach are in survival analysis.
Let us start with an arbitrary not necessarily random sequence of real numbers, say S0 , S1 , S2 , . . . In most cases (Si )i is adapted to a ltration (Fi )i .
For a given x dene the sequence
n

Tn = x (1 + Si ), n = 0, 1, . . .
i=1

The stochastic integral of T w.r.t. S is dened as

n
n

T dS =
Ti1 Si =
Ti1 (Si Si1 ).
[0,n]

i=1

i=1

Lemma 1.10.1. We have, for n = 0, 1, . . .

T dS.
Tn = x +

(1.10.1)

[0,n]

Proof. Obviously, the assertion is true for n = 0 since both sides equal x.
The general case is proved by induction on n. Given that the assertion is
true for n, we have
Tn+1 = Tn (1 + Sn+1 ) = Tn + Tn Sn+1

= x+
T dS + Tn Sn+1 = x +
[0,n]

T dS.

[0,n+1]


Equation (1.10.1) is an example of an Integral Equation. For x = 1, we
call the sequence

T dS
(1.10.2)
Tn Expn (S) =
(1 + Si ) = 1 +
i=1

[0,n]

1.10. THE PRODUCT INTEGRATION FORMULA

47

the Stochastic Exponential of S.


The above setting allows for a simple but important extension. For this, let
(hn )n be a predictable sequence. From (Sn )n and (hn )n we may construct
a new sequence, namely

hdS =
[0,n]

hi Si =

i=1

hi (Si Si1 ).

i=1

The associated stochastic exponential equals


Tn = Expn
hdS =
(1 + hi Si ).
i=1

[0,n]

Lemma 1.10.2. The sequence (Tn )n satises

Tn = 1 +
T hdS.
[0,n]

Proof. The proof is similar to that of Lemma 1.10.1

Note that Lemma 1.10.1 is a special case of 1.10.2. Just put hn 1. Now, if
S is a function of bounded variation, we may let the number of grid points
tend to innity. Again, put t0 = and keep tn = t xed. Then the limit
of (1.10.2) exists and we obtain the analogue of (1.10.2) in continuous time:

Tt = 1 +
T (x)S(dx).
(,t]

In some situations S 0 on the negative real line. In this case we have

T (x)S(dx)
(1.10.3)
Tt = 1 +
[0,t]

Example 1.10.3. Recall the cumulative hazard function satisfying


d =

dF
.
1 F

48

CHAPTER 1. INTRODUCTION

Conclude

(1 F )d

F (t) =
(,t]

and
F (t) 1 F (t) = 1

(1 F )d.

(1.10.4)

(,t]

We thus see that the survival function F satises (1.10.3) with S = .


Solving (1.10.4) means that we aim at representing F in terms of . For a
continuous F , Example 1.4.11 presents the solution, namely
F (t) = exp((t)).
For applications in statistics we need the solution of (1.10.4), however, in
the general case. This will be necessary mainly for two reasons: In many
situations it will be simpler to estimate , say by n , than F . This n will
be discrete so that a general representation of the corresponding Fn would
be helpful to construct the associated survival function and hence estimator
of F .
To solve (1.10.4) in the general case, we need some notation.
Denition 1.10.4. Let S be a nondecreasing function. Put

S c (t) = S(t)
S(x)
xt

S(t)

S{x}

xt

Sc

The function
is called the continuous part of S. Note that S admits only
at most countably many discontinuities.
Theorem 1.10.5. Let F be any d.f. with associated cumulative hazard
function . Then we have

1 F (t) = exp[c (t)] [1 {x}].


(1.10.5)
xt

If F is continuous, so is . Therefore the product is empty and equals 1,


while c = . For a purely discrete , (1.10.5) becomes

1 F (t) =
[1 {x}].
(1.10.6)
xt

1.10. THE PRODUCT INTEGRATION FORMULA

49

For the proof of Theorem 1.10.5 we need the following result for exponentials.
Lemma 1.10.6. Let (Si1 )i and (Si2 )i be two sequences. Then we have
Expn (S 1 )Expn (S 2 ) = Expn (S 1 + S 2 + [S 1 , S 2 ]).
Here
[S 1 , S 2 ]n = S01 S02 +

(1.10.7)

Sk1 Sk2

k=1

denotes the Quadratic Covariation of S 1 und S 2 .


Proof of Theorem 1.10.5. By denition,
(t) = c (t) +

{x}

xt

c (t) + d (t),
where d stands for discrete part. As before, consider a grid t0 = <
t1 < . . . < tn = t. If we let n tend to innity such that the grid gets ner
and ner, we get
Expn (c )

[1 + c (ti )]
i=1

exp

{ n

}
c

ln[1 + (ti )]

i=1

exp

{ n

(1.10.8)

}
c (ti ) + o(1)

i=1

exp[c (t)].
This argument shows that the exponential of a continuous monoton bounded
function f equals t exp[f (t)]. For the discrete part, however, we get
Expn (d )

[1 + {x}].

(1.10.9)

xt

Now replace with . Then the product of (1.10.8) and (1.10.9) converges
to the right-hand side of (1.10.5). To prove the Theorem, we apply Lemma

50

CHAPTER 1. INTRODUCTION

1.10.6. By (1.10.7) the product of (1.10.8) and (1.10.9) equals


n

[
]
1 + Si1 + Si2 + Si1 Si2
i=1

= exp

{ n

}
ln(1 + Si1 + Si2 + Si1 Si2 )

i=1

= exp

{ n

}
ln(1 +

Si1

Si2 )

+ o(1) .

i=1

The last equation is obtained from a Taylor-expansion and the fact that

1
2
Si Si
{x}c (dx) = 0.
i=1

(,t]

Finally, S = = c d so that by (1.10.4)


n

(1 + Si1 + Si2 ) Expt [] = 1 F (t).


i=1


In the following we apply the last theorem to the empirical d.f. Fn . For the
sake of simplicity we assume that the order statistics X1:n < . . . < Xn:n
are pairwise distinct. This is always the case (with probability one) if the
underlying d.f. F is continuous. Now, since Fn only jumps at the data, we
have
{
1
Fn {x}
if
x = Xi:n
ni+1
=
n {x} =
.
0
elsewhere
1 Fn (x)
Hence Theorem 1.10.5 yields
[
1 Fn (t) =
1
Xi:n t

1
ni
=
.
ni+1
ni+1

(1.10.10)

Xi:n t

Equation (1.10.10) is called the Product-Limit representation of Fn . For


the mass at t, we get
Fn {t} = (1 Fn (t)) (1 Fn (t))

ni
ni
=

.
ni+1
ni+1
Xi:n <t

Xi:n t

1.11. SAMPLING DESIGNS AND WEIGHTING

51

This dierence equals zero if t = Xj:n for all 1 j n. For t = Xj:n ,


however, we get
Fn {t} =

j1

i=1

j
j1

ni [
ni
ni
nj ]

=
1
ni+1
ni+1
ni+1
nj+1
i=1

i=1

n1n2
1
nj+1
1
...
= ,
n n1
nj+2nj+1
n

as expected.
If we check the proof of Theorem 1.10.5 carefully, we see that the following
arguments were essential
1 F is the exponential of .
For taking logarithms we needed 1 + Si1 + Si2 + Si1 Si2 > 0.
This was true, however, since Si1 and Si1 Si2 became small while
Si2 {x} and 1 {x} > 0.
For taking limits it was important that is nite over (, t] for each
nite t.
We already knew before that 1 F is the exponential of . Equation
(1.10.5) is important since it reveals a possibility to explicitly represent
1 F in terms of the continuous and discrete part of .

1.11

Sampling Designs and Weighting

In the previous sections we have seen that empirical d.f.s may serve as
a universal tool to express (and analyze) various statistics which at rst
sight do not seem to have much in common. For this kind of discussion no
particular assumptions like independence of the data were necessary.
In this section we outline several possible data situations which we will study
in future chapters. The classical eld of empirical distributions (with standard weights) deals with independent identically distributed (i.i.d.)
random observations. Therefore, in forthcoming chapters, we shall rst
study this case in greater detail. The theory and their applications are particularly rich when the data are real-valued. For the multivariate case the
fact that the data cannot be ordered causes new challenging questions. In
the context of time series analysis the data feature some dependence.

52

CHAPTER 1. INTRODUCTION

Empirical distributions are helpful to analyze this dependence structure in


both the frequency and time domain. Quite often one has a particular model
for the autoregressive function in mind, i.e., one assumes that
Xi = m(Xi1 , . . . , Xip , ) + i
satises some (parametric) dynamic model, in which the new variable is an
unknown function of the previous ones, subject to some noise i . In such a
situation empirical d.f.s are also useful to analyze the noise variables i
and the goodness-of-t of the model in the context of a residual analysis.
In the following we discuss some data examples which show that we need to
be open to replace the standard weights n1 by other ones.
Example 1.11.1. This is a situation frequently
appearing in Robust Statistics. Recalling that the sample mean n1 nj=1 Xj is an estimator for the
expectation of a parent distribution F , in order to protect against outliers, the sample mean may be replaced by a weighted average of data in
which the inuence of extreme order statistics is trimmed. More precisely,
let X1:n . . . Xn:n again denote the set of data X1 , . . . , Xn arranged in
increasing order. Let Win be generated by a function g dened on the unit
interval (0,1):
(
)
i
1
,
1 i n.
Win = g
n
n+1
Then the statistic
Tn :=

1
g
=
n
n

Win Xi:n

i=1

i=1

i
n+1

)
Xi:n

is called an L (= linear)-statistic. Tn is the empirical version of


1
T (F ) =

g(u)F 1 (u)du.

In the above version i/n was replaced by i/(n + 1) since from time to time
one considers functions g only dened on the open unit interval (0,1) such
that g(u) tends to innity when u 0 or u 1.
The choice of g 1 leads to the sample mean. If, for 0 < a < 12 , we set
g(x) =

1
1
(x),
1 2a [a,1a]

1.11. SAMPLING DESIGNS AND WEIGHTING


we obtain
1
Tn =
n(1 2a)

53

(n+1)(1a)

Xi:n ,

i=(n+1)a

which is closely related to the a-trimmed mean

1
(n+1)(12a)+1

(n+1)(1a)
i=(n+1)a

Xi:n .

Example 1.11.2. Let X1 , X2 , . . . , Xn be a sample of real-valued data. E.g.,


Xi may be the disease-free survival time of patient i after surgery. One may
then split a time interval into disjoint sub-intervals I1 , . . . , Ik with length
h1 , . . . , hk , respectively. Put, for x Ij , 1 j k,
n
1
1Ij (Xi ).
fn (x) =
nhj
i=1

This function fn is called a histogram. Note that fn (x) 0 and fn (x)dx =


1. Hence fn is an (empirical) density. The weight of Xi depends on the location of x, i.e., on the cell Ij (x) = Ij which contains x:
1
1
.
nhj {Xi Ij }

0.0

0.2

0.4

0.6

0.8

Win (x) =

Figure 1.11.1: Example of a histogram based on n = 100 observations, with


hj = 1/2

54

CHAPTER 1. INTRODUCTION

Note that fn very much depends on the choice of the cells Ij . A slight change
of the Is may result in a dramatic change of fn , if many data are located
close to the boundary of a cell.

Example 1.11.3. This example is similar to the previous one but from
multivariate statistics. Assume we observe bivariate data (Xi , Yi ), 1 i n.
One may be interested in the mean m(x) = E[Y |X = x] of Y given that X
attains a specied value x, i.e., in the regression of Y at x. Since there may
be no Xi with Xi = x, we may open a window with center x and length
2h and average the Yi s for which the Xi s fall into [x h, x + h]. This leads
to
n
Yi 1[xh,x+h] (Xi )
mn (x) = i=1
n
i=1 1[xh,x+h] (Xi )
n

=
Win (x)Yi ,
i=1

where
1[xh,x+h] (Xi )
Win = Win (x) = n
j=1 1[xh,x+h] (Xj )
is the weight of Yi depending on x, h and the locations of all Xj . For small
h > 0 it may happen that the denominator of Win equals zero so that Win
is not well-dened. For large hs we face the risk that we loose the local
avor of the data. So the question is how to choose a good h. Also note
that each window is located around x and no pre-specied choice of cells as
for histograms is required.

Example 1.11.4. This example describes an important situation where


the available data are incomplete. Its taken from survival analysis. The
following gure describes a situation where patients enter a study at time of
surgery, say. Then one may be interested in the disease-free survival time.
Due to unexpected losses or since the study has to be terminated at time T ,
not all survival times Yi are available. Rather, some only exist in censored

1.11. SAMPLING DESIGNS AND WEIGHTING

55

form, i.e., rather than Yi the time Ci spent in the study is observed.

Y1

Y2

C3
C4

3
4
5

C5
1.

2.

3.

4.

5.

T = end of
study

time of
surgery

Figure 1.11.2: Data censored from the right

In our example Y1 and Y2 are observable. For cases 3 and 5 we observe the
times spent until the end of the study, while patient 4 dropped out of the
study before the end.
Though some data are censored, they provide important information in that
it is known that a patient survived at least for some time. The question then
becomes one of how to properly weight each datum.
In the following example the situation is even worse since some of the data
may be completely lost.
Example 1.11.5. In retrospective AIDS data analysis (e.g., transfusion
associated AIDS) one may be interested in the incubation period Xi =
Wi Si , where Si is the (calendar) time of infection and Wi is the time of
diagnosis. Typically, there is a time T when the study has to be terminated.
A case is only included if Wi T or Xi T Si = Yi . In this case both Xi
and Yi are observed. If Xi > Yi , the data are unknown and the case is not
included. One says that Xi is truncated by Yi from the right. Consequently
sample size n is unknown. Again the question is how to weight the available
data.
Example 1.11.6. Sometimes it may be known that the observations
X1 , . . . , Xn come from a distribution function F satisfying some constraint,
e.g.,

EX =

xF (dx) = 0.

56

CHAPTER 1. INTRODUCTION

In regression or time series analysis, the dependent variables are decomposed


into the input (resp. prognostic) part and noise , where by construction
has expectation zero. In a residual
analysis, when Xi is the i-th residual,
it is usually not guaranteed that xFn (dx) = 0 so that one may wish to

replace Fn with some Fn satisfying xFn (dx) = 0. In most cases Fn will


also have masses at X1 , . . . , Xn but dierent from n1 . So-called Empirical
Likelihood-Methods aim at nding and analyzing (modied) empirical
distribution functions satisfying such constraints.
Example 1.11.7. In the situation to discuss now, the unknown parameter
of interest is not F or any of its functionals but population size. Assume
one is interested in the number n of unknown species to be investigated.
To get some initial information, one has to observe (and hopefully detect)
a (random) number of species within a pre-specied period of time. We see
that, as in Example 1.5.5, the statistical inference on the total number n of
unknown species may present a challenging problem.
Example 1.11.8. Our nal example is concerned with possible structural
changes. Though many times it is assumed that a sample has been drawn
from the same population under identical conditions, in a real world there
is no guarantee that this is the case. E.g., given a set of observations
X1 , . . . , Xn , there may be an unknown index k with 1 < k n such that for
1 j k the distribution of Xj is dierent from that of Xj for k < j n.
The unknown k is then called a changepoint.
Conclusions. The main goal of this introductory chapter was rst to introduce the important concept of a Dirac. Appropriately weighted this led us
to general empirical distributions (kernels). Associated empirical integrals
then form a exible and rich class of elementary statistics. For univariate
data order statistics are obtained after ordering the data. Ranks describe
the position of an observation in the full sample or the relative position of
one sample w.r.t. another. At the end it has become clear, that quantities
computable from empirical d.f.s are just statistical functionals T = T (G)
evaluated at G = Fn , while the target is the term evaluated at the unknown
parent d.f. G = F .
In such a situation T (F ) is the parameter of interest describing some characteristics of the parent population. The examples discussed in section 1.4
1.10 then showed that very often we are not only interested in a parameter.
Rather terms like T (x, G), T (p, G) or T (, G) are functions and not parameters. The discussion in Section 1.10 revealed that the function of interest

1.11. SAMPLING DESIGNS AND WEIGHTING

57

may be characterized as being the (unique) solution of an integral equation.


Adopting the notation from calculus, the associated T may henceforth be
called a statistical operator.
Things are changing if we are interested in local quantities rather than functionals of the cumulative d.f. F . Statistical methodology then faces so-called
ill-posed problems requiring some smoothing (or regularization) to make inversion of operators feasible.
Sometimes data may be incomplete or missing, which requires a new weighting. In other cases, reweighting will be necessary in order to fulll some constraints. As our example on unknown species or the problem from changepoint analysis have revealed, there are situations when the target is not a
function of F but some other parameter of interest, like the size of a population.
So far, no distributional assumptions on the observations X1 , . . . , Xn like
independence were made. Before we do that, it is necessary to have a closer
look at the situation when sample size equals n = 1.

58

CHAPTER 1. INTRODUCTION

Chapter 2

The Single Event Process


2.1

The Basic Process

The whole chapter is devoted to a detailed analysis of the empirical d.f. on


the real line when the sample size equals n = 1.
Denition 2.1.1. Let X be a real-valued random variable dened on a
probability space (, A, P). Then we call the stochastic process
St = 1{Xt} ,

t R,

the Single Event Process.


This process equals F1 . Hence, all the properties established for a general
Fn in Section 1.3 apply also here. In particular, St equals zero for t < X
and one for t X. The jump size at X equals one. Let F denote the d.f. of
X. For each xed t, the variable St is a 0 1 or Bernoulli-variable with
expectation
E(St ) = P(X t) = F (t)
(2.1.1)
and variance
Var(St ) = F (t)(1 F (t)).

(2.1.2)

The variance attains its maximum when F (t) = 12 , i.e., at the median. As we
move to the left or to the right, the variance declines. It vanishes whenever
F (t) = 0 or F (t) = 1, i.e., outside the support of F .
In the next lemma we present the covariance structure of S {St : t R}.
59

60

CHAPTER 2. THE SINGLE EVENT PROCESS

Lemma 2.1.2. For s t we have


Cov(Ss , St ) = F (s)(1 F (t)).

(2.1.3)

Proof. We have
Cov(Ss , St ) = E(Ss St ) E(Ss )E(St ) = F (s) F (s)F (t),
whence the result. For the last equality note that
Ss St = 1{Xs} 1{Xt} = 1{Xs,Xt} = 1{Xs} .

For a general pair s, t of real numbers the covariance equals
Cov(Ss , St ) = F (s t) F (s)F (t),

(2.1.4)

where
s t = min(s, t).
For arbitrary measurable functions , 1 and 2 we similarly get the following elementary but basic equations.
Lemma 2.1.3. Provided all integrals exist, we have

E((X)) = dF

Cov(1 (X), 2 (X)) = 1 2 dF 1 dF 2 dF .


By the Cauchy-Schwarz inequality



2
1 2 dF
1 dF 22 dF .

Hence, the integral 1 2 dF exists whenever 1 , 2 L2 (F ), the space of


F -square-integrable functions on the real line.

If we set
< 1 , 2 >

1 2 dF,

2.1. THE BASIC PROCESS

61

we obtain a scalar-product on the space L2 (F ). The functions 1 , 2 are


called orthogonal if and only if

1 2 dF = 0.
If, furthermore, they are also centered, i.e.,

1 dF = 0 = 2 dF,
then
Cov(1 (X), 2 (X)) = 0.

(2.1.5)

(2.1.5) implies, that also the correlation between 1 (X) and 2 (X) equals
zero.
Orthogonality is obtained for all F if 1 2 0. E.g., if i = 1Bi for i = 1, 2
and B1 B2 = , then
1B1 1B2 = 1B1 B2 = 1 = 0.
Other s may be orthogonal only w.r.t a special F but not w.r.t others.
Having studied some elementary properties of the Single Event Process,
we now begin to analyze the probabilistic structure of the whole process.
The distribution of S is uniquely determined by its nite-dimensional
distributions (dis). For this we x any nite collection of ts, say
t1 < t2 < . . . < tk < tk+1 .
To study the distribution of (St1 , St2 , . . . , Stk+1 ) we note that, e.g.,
P(X tk+1 , X > tk , . . . , X > t1 )
P(X > tk , . . . , X > t1 )
P(X tk+1 , X > tk )
=
= P(Stk+1 = 1|Stk = 0).
P(X > tk )

P(Stk+1 = 1|St1 = 0, . . . , Stk = 0) =

Similar results hold for other possible past values of S. So we may conclude
the following lemma.
Lemma 2.1.4. The Single Event Process is a Markov process with transition probabilities, for s < t,
P(St = 1|Ss = 1) = 1,
P(St = 0|Ss = 1) = 0,

F (t) F (s)
,
1 F (s)
1 F (t)
P(St = 0|Ss = 0) =
.
1 F (s)
P(St = 1|Ss = 0) =

62

CHAPTER 2. THE SINGLE EVENT PROCESS

We take this opportunity to also introduce the pertaining increments.


Again, consider nitely many t0 < t1 < t2 < . . . < tk+1 in increasing
order. Then the associated increments (of the S process) are given by
St0 , St1 St0 , . . . , Stk+1 Stk .
Note that
Sti Sti Sti1 = 1{ti1 <Xti }
and the sets Bi = (ti1 , ti ] are mutually disjoint. Hence, the increments of
the Single Event Process are orthogonal with expectations
F (ti ) F (ti1 ) Fti .
Conclude that
Cov(Sti , Stj ) = Fti Ftj for i = j.

(2.1.6)

Empirical d.f.s are traditionally encountered as estimators of unknown d.f.s


F . The value Fn (t) may then be taken as a substitute for the probability
F (t) that in a future draw from the same population the value of X does not
exceed t. When one applies empirical d.f.s in this context one assumes that
apart from the previously obtained data X1 , . . . , Xn no further information
on X is available.
Very often, however, the situation is dierent. To give a simple example,
assume that X represents the disease free survival time of a patient after
surgery. Denote with 0 the time of surgery. Then at time 0 < s < t some
information on X may be already available. For example, if X s, then
necessarily X t, while if X > s the patient is still at risk. Given this
information the quantity of interest is no longer F (t) but
P(X t|X > s) =

F (t) F (s)
.
1 F (s)

See Lemma 2.1.4. If rather than in the event {X t} we are interested in


any function (X) of X and want to encounter the information up to s, we
have to introduce the natural ltration
Fs = ({X u} : u s).
Then (Fs )s is increasing in s with S being adapted to F , i.e., for each s the
variable Ss is measurable w.r.t. Fs . Actually, this is the smallest ltration

2.1. THE BASIC PROCESS

63

making S an adapted process. In a real world situation, at time s, there may


also be covariables available so that in this case the available information is
summarized in a ltration Gs with Gs Fs . In this chapter, however, we
stick to Fs .
In the following, we x an integrable function (X) of X. Set
Xt = E[(X)|Ft ] a martingale.
If we set F = {, }, then

X = E[(X)] =

dF,

the target we discussed before. For other ts, Xt represents a predictor of


(X) given the information up to time t. This comment
aims at changing

our view of things a bit in that the former target dF now becomes only
the starting point of a whole family Xt of quantities which focus on the true
but possibly unobserved value of X rather than a population parameter. As
t increases, Xt incorporates the available information so that one hopefully
comes up with a better updated predictor of (X). The last remark is
intended to point out the dynamic character of Xt .
In the next lemma we present an explicit expression for Xt .
Lemma 2.1.5. We have, for all t with F (t) < 1,

Xt = (X)1{Xt} + 1{X>t}

(u)F (du)

(t,)

1 F (t)

Proof. Note that both summands are measurable w.r.t. Ft . Actually, the
second term is just 1 St multiplied with a deterministic factor. As to the
rst, if X t, we know the precise value of X since Ft contains all {X u}
with u t. Knowing, for each u t, whether {X u} or {X > u}
occurred, is tantamount to knowing the value of X. Finally, using the
Markov-property of S , it remains to show that Xt has the same expectation
as (X) on both {X t} and {X > t}. But this is obvious.


64

CHAPTER 2. THE SINGLE EVENT PROCESS

The prediction error (X) Xt equals

(X) Xt = (X)1{X>t} 1{X>t}

(u)F (du)

(t,)

1 F (t)

(u)F (du)
(t,)

= 1{X>t} (X)

1 F (t)

whence

E[(X) Xt ]2 =
{X>t}

(u)F (du)

(t,)

(X)
1 F (t)

(X)dP
2

=
{X>t}

dP

dF ]2

(t,)

1 F (t)

2 (X)dP.
{X>t}

As t , this upper bound


tends to zero. Also rates of convergence are

easily available. E.g., if |X|2 (X)dP < , then (for t > 0)

1
2
|X|2 (X)dP = O(t1 ).
(X)dP
t
{X>t}

If, in Lemma 2.1.5, we set (X) = 1{X>s} for t < s, we obtain


E[1{X>s} |Ft ] = 1{X>s} 1{Xt} + 1{X>t}
= 1{X>t}

1 F (s)
1 F (t)

1 F (s)
.
1 F (t)

From this we immediately obtain


Lemma 2.1.6. The process
t

1{X>t}
1 F (t)

is a (forward) martingale w.r.t. the natural ltration. Each variable has


expectation one.

2.2. DISTRIBUTION-FREE TRANSFORMATIONS

65

With similar arguments one can show that the process


t

1{Xt}
,
F (t)

is a martingale in reverse time. It is interesting to look at Xt if the variable to


be integrated is not a deterministic function (X) of X, but any (integrable)
random variable Y . If m denotes the regression function of Y w.r.t. X, we
obtain
Xt E [Y |Ft ] = E {E[Y |X]|Ft }
= E[m(X)|Ft ],
i.e., Lemma 2.1.5 applies with = m. C.f. the function I in Section 1.4.

2.2

Distribution-Free Transformations

In nonparametric statistics no particular assumption about the underlying distribution function F will be imposed. As a computational drawback
critical regions of test statistics or distributions of estimators may depend on
F and could therefore be dicult to obtain. While with high-speed computers at hand Monte Carlo techniques may nowadays be employed to overcome
these problems, beginning in the 1940s, an increasing scepticism about the
appropriateness of small parametric models led to an enormous interest in
statistical methodology which was distribution-free under broad model
assumptions. As we shall see later the ideas elaborated then are useful in
our context and will help to motivate also more advanced new statistical
approaches.
Generally, a distribution-free transformation aims at transforming a variable
(an observation) X to another one which has a known distribution. In this
section only real-valued X will be considered. The most important transformation is the one which comes up with a uniformly distributed variable.
Denition 2.2.1. The uniform distribution (on the unit interval [0, 1])
is given through its d.f.

0 for t < 0
t for 0 t 1
FU (t) =

1 for 1 < t

66

CHAPTER 2. THE SINGLE EVENT PROCESS

Hence, a random variable from this distribution can only take on values in
(0, 1) (with probability one). Write U U [0, 1].
The other important reference distribution is the exponential distribution.
Denition 2.2.2. The exponential distribution with parameter > 0 is
given through
{
1
for t < 0
1 F (t) =
(2.2.1)
exp[t] for t 0
We shall write X Exp() whenever X has d.f. (2.2.1). Note that such an
X is nonnegative with probability one. The hazard function pertaining to
this F equals the constant function (x) . The following lemma reveals
a way how to generate a variable X with d.f. F .
Lemma 2.2.3. Let U be uniform on [0, 1]. Then
X := F 1 (U ) F.

(2.2.2)

Proof. For each real t we have from the denition of F 1 that


{F 1 (U ) t} = {U F (t)}
whence
P(X t) = P(U F (t)) = F (t).

While Lemma 2.2.3 states that a random variable X F equals F 1 (U ) in
distribution, the following lemma shows that any random variable X F
may be written as
X = F 1 (U )

with probability one

and not only in distribution, provided the underlying probability space


(, A, P) is rich enough to carry appropriate random variables independent
of X. This will be assumed throughout without further mentioning.
Lemma 2.2.4. Assume that X F . Then, for an appropriate uniform U ,
we have
X = F 1 (U ) with probability one.

2.2. DISTRIBUTION-FREE TRANSFORMATIONS

67

Proof. Denote with A = {a} the set of atoms of F , i.e., the set of points
such that F {a} > 0. There are at most countably many atoms, if there are
any. Let V be independent of X and uniformly distributed on [0, 1]. Set
{
F (X)
if X
/A
U=
.
F (a) + F {a}V if X = a A
Check that U U [0, 1] and X = F 1 (U ).
1.0
0.6

U5

0.8

U2

0.4

U1

0.2

U4

0.0

U3

X3

X4

X1

X5

X2

Figure 2.2.1: X-raindrops as the U -clouds hit the F -mountain


What Lemma 2.2.4 points out and the corresponding Figure 2.2.1 depicts
is the following: we may interpret the observed X-data as raindrops falling
to the earth. The location of the drops depends on the height of the U clouds. Once these are blown to the right they hit the F -mountain (graph)
in dierent locations. These are in charge of the actual values of the Xs.
The experimenter or observer only has access to the X-world but not to
the U -sky. Also the shape of the mountain (shape of F ) remains unknown.
More or less such a picture appropriately modied applies to other situations
in statistics, i.e., the goal is to analyze what is available to get some ideas
of what is behind the stage. We also see why we need the inmum in the
denition (1.4.1) of F 1 . It guarantees that once a U -cloud (like U4 , U5 )
has started on a level which does not yield a point to drop down, the gate
(hole) between F (x) and F (x) in the F -graph needs to be closed. This is
exactly what the inmum does.

68

CHAPTER 2. THE SINGLE EVENT PROCESS

For most distributions, the quantile function does not allow for a simple
analytic expression so that Lemma 2.2.3 is of limited value if one really
wants to generate X through U and F 1 . For the exponential distribution,
however,
u = F (t) = 1 exp(t)
immediately yields
so that

F 1 (u) = 1 ln(1 u)
X := 1 ln(1 U ) Exp().

(2.2.3)

Lemma 2.2.4 and (2.2.3) now lead to the desired distribution-free transformations. Recall

F (dx)
F (t) =
,
1 F (x)
(,t]

the cumulative hazard function of F .


Lemma 2.2.5. Assume that X F and F is continuous. Then
(i) F (X) FU
and
(ii) F (X) Exp(1).
Proof. By Lemma 2.2.4 we have X = F 1 (U ). By continuity of F ,
F F 1 (u) = u for all 0 < u < 1.
Hence
F (X) = U,
which proves (i). For the second statement, we obtain under continuity

F (dx)
= ln[1 F (t)].
(2.2.4)
F (t) =
1 F (x)
(,t]

Since X = F 1 (U ), it follows that


F (X) = ln[1 F (X)]
= ln[1 U ] Exp(1),

2.3. THE UNIFORM CASE

69

by (2.2.3).

Lemma 2.2.4 has some interesting consequences for the representation of the
Single Event Process. Actually, since X = F 1 (U ), we have
S(t) St = 1{Xt} = 1{U F (t)} .

(2.2.5)

If we denote with S X and S U the S-process for X and U , then (2.2.5) states
that
S X (t) = S U (F (t)).
(2.2.6)
The process S U will be restricted to [0, 1] since for t outside the unit interval
S U is constant (= 0 resp. 1) and therefore uninteresting.
Equation (2.2.6) shows that the study of S X could be traced back to that of
S U . Actually, S X equals S U after a proper time transformation through
F . When we observe X, only the left-hand side of (2.2.6) is known. Neither
U nor F are available. Though, (2.2.6) turns out to be a basic equality
to understand and study many distribution-free procedures in statistics.
It is responsible for the special role played by the uniform distribution in
nonparametric statistics.
Therefore our next section will be devoted to the study of U and the associated S U .

2.3

The Uniform Case

In the previous section we have studied the connection between X and U ,


a uniform variable.
The density of U equals
{
U

f (t) =

1
0

for 0 t 1
.
elsewhere

f U is symmetric at 1/2. The associated cumulative hazard function equals


t
U (t) =
0

1
dx
1x

= ln(1 t),

0 t < 1.

70

CHAPTER 2. THE SINGLE EVENT PROCESS

It tends to + as t 1. The k-th moment equals


1
tk dt =

1
.
k+1

In particular, the mean equals 1/2, and for the variance we obtain
VarU =

1 1
1
= .
3 4
12

When we study S U and later the uniform empirical d.f. FnU , we also
have to study the space L2 (F U ). In particular, we are looking for systems
of orthonormal functions spanning relevant subspaces of L2 (F U ).
Lemma 2.3.1. Consider the functions

j (t) = 2 sin(jt),

0 t 1, j N.

Then {j } is an orthonormal system of functions in L2 (F U ).


Proof. For each j N we have
2j (t)dt
0

]
1
cos jt sin jt 1
=2 t
=1
2
2j
0
[

while
1
0



sin[(j k)t] 1 sin[(j + k)t] 1
sin(jt) sin(kt)dt =

2(j k) 0
2(j + k) 0
= 0 for j = k.


Since the functions j vanish at zero and one they are not candidates for
a basis in L2 (F U ). If, however, we restrict ourselves to the subspace of all
functions in L2 (F U ) vanishing at the boundary of [0,1], these functions are
indeed candidates. Coming back to S U , we have S U (0) = 0 but S U (1) = 1.
To obtain a process which also vanishes at t = 1, we set

1 (t) = StU E(StU ) = 1{U t} t,

0 t 1.

2.3. THE UNIFORM CASE

71

The upper bar stands for uniform. This process is centered and vanishes at
t = 0 and t = 1. It has the same covariance structure as S U , see (2.1.4),
namely
Cov(
1 (s),
1 (t)) = s t st,

0 s, t 1.

As a function of t, it equals
{

1 (t) =

t
1t

for t < U
.
for U t 1

The sample path thus has the shape of a sawtooth.


1

Figure 2.3.1: A sawtooth-type path of


1

As a centered version of the Single Event Process the process


1 will play a
dominant role in our analysis. It is called the (uniform) Empirical Process
(for sample size one). The extension to larger n will be introduced and
discussed in Chapter 3.
We will show later that the system {j } is a complete system of orthonormal
functions in the subspace of all functions in L2 (F U ) vanishing at zero and
one. From this we get

1 =

j=1

<
1 , j > j ,

(2.3.1)

72

CHAPTER 2. THE SINGLE EVENT PROCESS

the Fourier representation of


1 . The Fourier coecients <
1 , j > equal
1

1 (t)j (t)dt =

<
1 , j > =
0

1
=

1
tj (t)dt +

(1 t)j (t)dt

tj (t)dt +
U

cos(jU )
2
j (t)dt = 2
=
j (U ),
j
j

with
j (x) = cos(jx),

0 x 1.

The functions 1 , 2 , . . . satisfy certain properties which make the orthonormal representation (2.3.1) interesting for statistical applications.
Lemma 2.3.2. We have, for each j 1,
1
Ej (U ) =

j (x)dx = 0
0

and, for i = j,
1
E[i (U )j (U )] =

i (x)j (x)dx = 0.
0

Proof. The rst statement follows from integration, while the second is a
direct consequence of
1
0



sin[(i j)x] 1 sin[(i + j)x] 1
cos(ix) cos(jx)dx =
+
2(i j) 0
2(i + j) 0
= 0.


We see from Lemma 2.3.2 that the Fourier coecients

2
<
1 , j >=
j (U )
j

(2.3.2)

2.3. THE UNIFORM CASE

73

are uncorrelated and centered. For the variance we get


2
Var(<
1 , j >) = 2 2
j

1
j2 (x)dx
0

[
]
2
cos jt sin jt 1 1
= 2 2
+ t
j
2j
2 0

1
= 2 2.
j
Hence, the variances decrease at the rate j 2 . We may approximate
1 by
the series in (2.3.1) truncated at some integer k:

1k =

<
1 , j > j .

j=1

For the dierence


1
1k we get by the orthonormality of the j s and
Parsevals identity
1
[
1 (t)
1k (t)]2 dt =

<
1 , j > 2 .

j=k+1

Its expectation equals

j=k+1

For k = 0 we get

1
j 22

12 (t)dt =

1
= O( ).
k

<
1 , j >2 .

j=1

The decomposition (2.3.1) of 1 into orthonormal functions and uncorrelated


coecients has its counterpart in Multivariate Statistics, where we deal with
nite-dimensional problems. There the components are called principal components. The vectors forming the orthonormal basis are eigenfunctions and
the variances of the coecients are the eigenvalues of a certain matrix. Here
it is similar. Without much further work we shall see later that the functions
j form an eigenbasis for our subspace of L2 (F U ), and the eigenvalues
j =

1
j 22

j = 1, 2, . . .

are associated to the covariance kernel


K(s, t) = s t st,

0 s, t 1.

74

CHAPTER 2. THE SINGLE EVENT PROCESS

Therefore representation (2.3.1) constitutes the principal component decomposition of the Single Event Process.
It is also interesting to determine the distribution of the (uncorrelated)
j (U ).
Lemma 2.3.3. 1 (U ), 2 (U ), . . . all have the same distribution. Their
Fourier-transform equals the Bessel-function of order zero, i.e.,
[

E e

it cos(jU )

]
= J0 (t) :=

(1)k t2k
k=0

Proof. Set

(k!)2 4k

1
I1 :=

cos[t cos(jx)]dx
0

and

1
sin[t cos(jx)]dx.

I2 :=
0

We rst show

1
I1 =

cos[t cos(x)]dx,

(2.3.3)

1
I2 =
(1)k1
j
j

k=1

1
sin[t cos(x)]dx,

(2.3.4)

1
sin[t cos(x)]dx = 0,

(2.3.5)

cos[t cos(x)]dx = J0 (t).

(2.3.6)

and

1
0

As to I1 , set u = jx and obtain


1
I1 =
j

j
0

1
cos[t cos(u)]du =
j
j

k=1k1

cos[t cos(u)]du.

2.3. THE UNIFORM CASE

75

Setting y = u (k 1), the k-th summand becomes


k

1
cos[t cos(y + (k 1))]dy

cos[t cos(u)]du =
0

k1

1
=

k1

cos (1)

1
]
t cos(y) dy = cos[t cos(x)]dx.

This proves (2.3.3). Similarly, we obtain for I2 :


1
I2 =
j
j

[
sin (1)

k1

k=1 0

1
j
]
1
k1
(1)
sin[t cos(x)]dx.
t cos(y) dy =
j
k=1

For (2.3.5), set u = cos(x) whence


1
1
dx = [sin(x)]1 du = [1 u2 ]1/2 du.

(2.3.7)

Conclude that
1

1
sin[t cos(x)]dx =

sin(tu)(1 u2 )1/2 du = 0,

where the last equation follows from the fact that the integrand is odd.
Finally,
1

1
cos[t cos(x)]dx =

1
0

cos(tu)(1 u2 )1/2 du

k t2k

(1)
2
u2k du
cos(tu)(1 u2 )1/2 du =
.

(2k)!
(1 u2 )1/2
k=0
0

The rst equation uses (2.3.7) again, while the second uses the fact that the
integrand is even. For the last equation, just apply the series expansion of
the cosine function. To compute the last integral, apply (2.3.7) again to get
1
0

u2k du
=
(1 u2 )1/2

1/2
1/2
2k 1
2k

cos (x)dx =
cos2(k1) (x)dx.
2k
0

76

CHAPTER 2. THE SINGLE EVENT PROCESS

By recursion, we therefore obtain


1/2
(2k)!
1 1 3 5 . . . (2k 1)
1 2k k!
(2k)!
2k
cos (x)dx =
=
=
.
k
2 2 4 6 . . . 2k
2 2 k!
2 4k (k!)2
0

Summarizing,
[

E e

it cos(jU )

]
=

1
cos[t cos(jx)]dx + i

sin[t cos(jx)]dx
0

1
=

cos[t cos(x)]dx.
0

This proves the lemma.

Chapter 3

Univariate Empiricals:
The IID Case
3.1

Basic Facts

So far we have discussed only those properties of empirical d.f.s which do


not require any distributional assumptions on the underlying observations.
Clearly, to obtain, e.g., the distribution of Fn (t) or any other empirical
integral dFn , the distributional structure of X1 , . . . , Xn turns out to be
crucial. The situation which is best known is the case of independent
identically distributed (i.i.d.) random variables X1 , . . . , Xn . As always,
let F denote the unknown d.f. of each Xi .
This chapter is devoted to various problems and questions which may arise in
connection with empiricals of i.i.d. real-valued data. In the rst section we
collect some basic properties of Fn which are straightforward consequences
of classical results in probability theory.
Recall the empirical distribution
1
1{Xi A} .
n (A) =
n
n

i=1

In the i.i.d. case nn (A) is a sum of n i.i.d. Bernoulli-random variables with


success parameter p = (A). Hence the following lemma is straightforward.
Lemma 3.1.1. Assume that X1 , . . . , Xn are i.i.d. from some distribution
77

78

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

. For any (measurable) set A we have for k = 0, 1, . . . , n


( )
n k
P(nn (A) = k) =
(A)(1 (A))nk ,
k
i.e., nn (A) Bin(n, (A)), the binomial distribution with parameters
n and p = (A).
Note that this result holds true for i.i.d. Xi s which take their values in any
sample space, not necessarily the real line, and for any (measurable) set
A. For extended intervals A = (, t], this lemma takes on a special form
which, for the sake of reference, is listed as part of the following lemma.
Lemma 3.1.2. Assume X1 , . . . , Xn are i.i.d. from F . We then have
( )
n
P(nFn (t) = k) =
F k (t)(1 F (t))nk , k = 0, 1, . . . , n
(3.1.1)
k
E[Fn (t)] = F (t)
(3.1.2)
1
Var[Fn (t)] = F (t)(1 F (t))
(3.1.3)
n
1
Cov(Fn (s), Fn (t)) = F (s)(1 F (t)).
(3.1.4)
n
Since Fn is a normalized sum of independent Single Event Processes, (3.1.2)
(3.1.4) easily follow from (2.1.2), Lemma 2.1.2 and the fact that the variance of a sum of independent random variables equals the sum of their
variances. Assertion (3.1.2) states that Fn (t) is an unbiased estimator of
F (t).
From (3.1.2) and (3.1.3) we obtain that, as n ,
Fn (t) F (t)

in L2 (, A, P),

the space of square-integrable functions on (, A, P) endowed with the


L2 -norm
[
]1/2
2
2 =
dP
,
L2 (, A, P).
Recalling that

1
1{Xi t}
n
n

Fn (t) =

i=1

3.1. BASIC FACTS

79

is a sample mean of bounded (and hence P-integrable) i.i.d. random variables, the Strong Law of Large Numbers (SLLN) may be applied to get
for each xed t R:
lim Fn (t) = F (t) with probability one.

(3.1.5)

0.0

0.2

0.4

0.6

0.8

1.0

In other words, Fn (t) is a strongly consistent estimator of F (t).

Figure 3.1.1: Example of F100 and its target F

In Figure 3.1.1 we have added the true d.f. F to the empirical d.f. F100
already depicted in Figure 1.3.1. These data were independently generated
from an Exp(1)-distribution. It becomes clear that apart from small deviations F100 follows the graph of F very closely.
Finally, the de Moivre-Laplace version of the Central Limit Theorem
(CLT) yields for each t R, as n , an approximation for the distribution
of a properly standardized Fn (t):
L

n1/2 [Fn (t) F (t)] N (0, t2 ),

(3.1.6)

where L denotes convergence in law (or distribution), and N (0, 2 ) is


the normal distribution with expectation zero and variance 2 . According

80

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

to (3.1.3),
2 = t2 = F (t)(1 F (t)).
The assertions (3.1.2) (3.1.3) in Lemma 3.1.2 as well as (3.1.5) and (3.1.6)
allow for straightforward extensions to general empirical integrals. To apply
the SLLN and CLT appropriate moment conditions are required which are
automatically fullled for indicator functions. See also Lemma 2.1.3 for
sample size n = 1.
Lemma 3.1.3. Assume that X1 , . . . , Xn are i.i.d. from a d.f. F , and let
, 1 , 2 : R R be arbitrary (Borel measurable) functions. Provided that
all integrals on the right-hand side exist, we have:
[
]
E dFn = dF
(3.1.7)
[
]
(
)2
[
]
2
Var dFn = n1
dF
dF
(3.1.8)
[
]
[
]

Cov 1 dFn , 2 dFn = n1


1 2 dF 1 dF 2 dF
(3.1.9)

limn dFn = dF with probability one


(3.1.10)
[
]

L
n1/2 dFn dF N (0, 2 )
(3.1.11)
with

dF
2

)2
dF

All statements are classical results from probability theory properly translated into the language of empiricals.
Equation (3.1.4) reveals the covariance structure of Fn when considered
as a stochastic process. To get rid of the factor n1 on the right-hand side we
need to introduce the standardized process
n (t) := n1/2 [Fn (t) F (t)],

t R.

This process is the so-called empirical process. For sample size n = 1 and
for the uniform case it was introduced and studied in Section 2.3. In terms
of n we immediately get from the preceding results
Lemma 3.1.4. Assume that X1 , . . . , Xn are i.i.d. from F . Then we have:
E[n (t)] = 0 for each t R

(3.1.12)

Cov[n (s), n (t)] = F (s)(1 F (t)) for s t

(3.1.13)

n (t) N (0, F (t)(1 F (t))) as n .

(3.1.14)

3.1. BASIC FACTS

81

Since both Fn and F are distribution functions, we have


lim Fn (t) = 0 = lim F (t)

and
lim Fn (t) = 1 = lim F (t).

Hence
lim n (t) = 0 = lim n (t).

(3.1.15)

By (3.1.15), we may continuously extend n to through


n () = 0 = n ().

(3.1.16)

Since, by (3.1.16), n vanishes at , i.e., on both sides of a river called real


line, the empirical process is sometimes characterized as being of bridge
type.
We now come back to the distribution-free transformations as introduced
in Section 2.2. Under independence of X1 , . . . , Xn , we may apply (2.2.2) to
each Xi to obtain
Xi = F 1 (Ui ) for i = 1, . . . , n,
for an independent sample U1 , . . . , Un from FU . As before, we have
Xi t if and only if Ui F (t).
Conclude that
Fn (t) = Fn (F (t)),
where

1
Fn (u) =
1{Ui u} ,
n

(3.1.17)

0 u 1,

i=1

is the empirical d.f. of the uniform sample U1 , . . . , Un . The representation


(3.1.17) constitutes the extension of (2.2.6) to sample size n. It allows for a
reduction to the uniform case. In some instances this turns out to be quite
useful. The main advantages here are:
(i) FU is continuous
(ii) FU has compact support [0, 1].

82

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Discontinuities or atoms of F may, from time to time, create some problems.


With (3.1.17), these eects are absorbed by the time transformation F .
The original empirical d.f. Fn and its theoretical counterpart F are dened
on the real line, which is not compact. Sometimes, however, it is useful to
have a smallest and a largest t. This would require compactication of the
real line and a continuous extension of Fn and F . See (3.1.15) and (3.1.16).
To some readers not familiar with these topological aspects, a transformation
of the time scale is a welcome way out of the dilemma. Moreover, we shall see
soon that such transformations are extremely important to also understand
the distributional properties of several other statistics based on Fn and n .
As to the empirical process, (3.1.17) yields
n (t) =
n (F (t)),

t R,

a composition of the deterministic function F and the random process


n
dened on the (compact) unit interval.
In our next lemma we derive a representation of the original X-order statistics in terms of uniform order statistics.
Lemma 3.1.5. For 0 < u < 1 we have
Fn1 (u) = F 1 (Fn1 (u)).

(3.1.18)

In particular, for u = i/n, we get


Xi:n = F 1 (Ui:n ),

1 i n.

Proof. The proof follows from (3.1.17) and the denition of quantile functions. See also Figure 2.2.1.

In Chapter 1 we introduced the rank of an observation as a means to describe
its position within the sample. Let
R = (R1 , . . . , Rn )
denote the vector of ranks. The following result presents a well known
fact, namely that under some weak assumptions R is distribution-free.

3.1. BASIC FACTS

83

Lemma 3.1.6. Let R be the vector of ranks for an i.i.d. sequence of observations from a continuous d.f. F . Then the distribution of R is distributionfree, i.e., does not depend on F .
Proof. Under continuity,
F F 1 (u) = u for all 0 < u < 1.
Moreover, Xi = F 1 (Ui ) for all 1 i n. Altogether, this gives us for each
1 i n:
Ri = nFn (Xi ) = nFn (F (Xi )) = nFn (F F 1 (Ui )) = nFn (Ui ),
the second equality following from (3.1.17). We see that F has dropped
out. The proof is complete.

Lemma 3.1.6 does not specify the distribution of the vector of ranks. It only
guarantees that it suces to consider uniformly distributed observations.
Due to continuity, there will be no ties among the data and each rank is
well-dened. The vector R attains its values in the set of all permutations
of the integers 1, . . . , n. The event {R = r} corresponds to a unique ordering
of the U s. For independent U s each of the n! possible orderings has equal
probability, namely

1
...
1du1 . . . dun = .
n!
0<u <...<u <1
1

Hence, we obtain the following lemma.


Lemma 3.1.7. For i.i.d. observations from a continuous d.f. F , the vector
of ranks is uniformly distributed on the set of all permutations of 1, . . . , n
(Laplace model):
1
P(R = r) = .
n!
Figure 3.1.1 suggests that the (pointwise) consistency of Fn , see (3.1.5), may
also hold uniformly in t. This is ascertained in the following famous result.
Theorem 3.1.8 (Glivenko-Cantelli). Assume that X1 , . . . , Xn are i.i.d.
from an arbitrary (!) d.f. F . Then, as n ,
Dn := sup |Fn (t) F (t)| 0 with probability one.
tR

84

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Proof. The proof of this result presents another nice application of (3.1.17).
First we have
Dn = sup |Fn (F (t)) F (t)|.
(3.1.19)
tR

Setting u = F (t), we obtain


n.
Dn sup |Fn (u) u| D

(3.1.20)

0u1

n . As to this, x some > 0 and choose a nite


Hence, it suces to bound D
grid 0 = u0 < u1 < u2 < . . . < uk = 1 such that
max (ui+1 ui ) .

0i<k

Then use the monotonicity of Fn and the identity function to get for any u
between ui and ui+1 :
Fn (u) u Fn (ui+1 ) ui |Fn (ui+1 ) ui+1 | +
and similarly
|Fn (ui ) ui | Fn (u) u.
In conclusion,
sup |Fn (u) u| max |Fn (ui ) ui | + .
0ik

0u1

The maximum is taken over nitely many points. Hence, by (3.1.5), the
max tends to zero with probability one. We obtain
n
lim sup D
n

P-almost surely.

Now choose = 1/m with m N and let m to complete the proof. 


Though the last proof is simple, it is instructive to discuss the various issues related to Dn . First, the symbol Dn stands for distance or discrepancy between Fn and F . In a hypothesis testing framework, when F is
replaced with a hypothetical distribution, the resulting test is called the
Kolmogorov-Smirnov test. Thus in our context the quantity Dn will be
henceforth called the K-S distance. Other distances between Fn and F
will be studied in detail later. Equation (3.1.19) is interesting in itself. For
a continuous F , the time transformation u = F (t) is surjective so that in
(3.1.20) we indeed have equality:
n.
Dn = D

3.1. BASIC FACTS

85

Again, as with rank statistics, F has dropped out so that


Dn is distribution-free when F is continuous.

(3.1.21)

Next observe that, because of Fn and F are right-hand continuous and have
left-hand limits,
Dn = sup |Fn (t) F (t)|,
tQ

with Q denoting the set of all rational numbers. Hence Dn , being the supremum of countably many random variables |Fn (t)F (t)|, is again measurable.
In particular, the event
0 = { lim Dn = 0}
n

is measurable and P(0 ) is well-dened.


The next comment is concerned with the technical part. As remarked earlier, (3.1.17) was the key step for the transformation to the unit interval.
The compactness of this interval was essential to guarantee the existence of
a nite -grid. The monotonicity of Fn together with the monotonicity
and uniform continuity of the identity function u u were then used to
obtain uniform convergence from pointwise convergence. Note that pointwise convergence at the boundary points u = 0 and u = 1 are given for free,
since there both functions equal zero or one, respectively. We mention that
in detail since much later, when we deal with general (weighted) empiricals,
convergence at boundary points may and will cause special diculties and
nothing will be given for granted anymore.
The interested reader is asked to prove Glivenko-Cantelli Theorem without
using (3.1.17). Find a proper grid of the whole real line and allow for
innitely many discontinuities of F .
In Figures 3.1.2 and 3.1.3 we consider plots of the function Fn F under
dierent simulated scenarios. For the true distribution we took the uniform
and the Exp(1)-distribution. Sample sizes were n = 50 and n = 100. For
n = 100, Fn F is close to zero indicating that the approximation of F
through Fn is good already for moderate sample size. On the other hand,
since the graph of Fn F looks like a (mathematical) worm the plots are
unable to reveal any characteristic features of the deviation between Fn and
F . In such a situation it is helpful to put Fn F under the microscope.
In mathematical terms, this means that we have to multiply Fn F with a
factor which gets large as n increases. For the empirical process this factor
is n1/2 .

86

CHAPTER 3. UNIVARIATE EMPIRICALS:

0.0
1.0

1.0

0.0

0.5

Sample 2 with n = 50

0.5

Sample 1 with n = 50

THE IID CASE

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

0.8

1.0

0.0
1.0

1.0

0.0

0.5

Sample 4 with n = 100

0.5

Sample 3 with n = 100

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3.1.2: Glivenko-Cantelli: Uniform scenario

0.0
0.5

0.5

0.0

0.5

Sample 2 with n = 50

0.5

Sample 1 with n = 50

0.0
0.5

0.5

0.0

0.5

Sample 4 with n = 100

0.5

Sample 3 with n = 100

Figure 3.1.3: Glivenko-Cantelli: Exponential scenario

3.1. BASIC FACTS

87

0.0
1.0

1.0

0.0

0.5

Sample 2 with n = 50

0.5

Sample 1 with n = 50

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

0.8

1.0

0.0
1.0

1.0

0.0

0.5

Sample 4 with n = 100

0.5

Sample 3 with n = 100

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3.1.4: Empirical process: Uniform scenario

0.0
0.5

0.5

0.0

0.5

Sample 2 with n = 50

0.5

Sample 1 with n = 50

0.0
0.5

0.5

0.0

0.5

Sample 4 with n = 100

0.5

Sample 3 with n = 100

Figure 3.1.5: Empirical process: Exponential scenario

88

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Figures 3.1.4 and 3.1.5 show the plots of n = n1/2 (Fn F ) for the same set
of data. Under the microscope, the roughness of the sample paths becomes
apparent. Also the paths do not explode so that multiplication with the
standardizing factor n1/2 seems to keep everything in balance. Mathematically, at least for a xed t, this is justied by the CLT guaranteeing that
the distribution of n (t) has a nondegenerate limit.
The next lemma presents an algorithm how to compute Dn in nitely many
steps.
Lemma 3.1.9. For a continuous F , we have
{
}
i1 i

Dn = Dn := max max[F (Xi:n )


, F (Xi:n )] .
1in
n n
Proof. For Xi:n t < Xi+1:n and 1 i < n we have by monotonicity
Fn (t) F (t) Fn (Xi:n ) F (Xi:n ) =
as well as

i
F (Xi:n ) Dn
n

Fn (t) F (t) Fn (Xi:n ) F (Xi+1:n ) Dn .

For t < X1:n ,


Dn F (X1:n ) Fn (t) F (t) 0 Dn ,
while for t Xn:n ,
Dn 0 Fn (t) F (t) 1 F (Xn:n ) Dn .
Summarizing, we obtain Dn Dn . Conversely,
i
F (Xi:n ) = Fn (Xi:n ) F (Xi:n ) Dn
n
and, by continuity of F ,
F (Xi:n )

i1
= lim [F (t) Fn (t)] Dn .
tXi:n
n

Hence Dn Dn , and the proof is complete.

3.1. BASIC FACTS

89

Lemma 3.1.9 implies that


Dn = max(Dn+ , Dn ),
[

where
Dn+
and
Dn

= sup[Fn (t) F (t)] = max

1in

tR

]
i
F (Xi:n )
n

]
[
i1
= sup[F (t) Fn (t)] = max F (Xi:n )
1in
n
tR

denote the two one-sided deviations between Fn and F . Also Dn+ and Dn
are distribution-free as long as F is continuous. Moreover, both have the
same distribution. This is most easily seen by introducing the variables
Ui = F (Xi ) and Ui = 1 Ui , 1 i n. Under continuity of F , the Ui s are
also i.i.d. from FU . Moreover, in an obvious notation, Dn+ = Dn whence
Dn+ = Dn in distribution.
Our nal results are concerned with the consistency of empirical quantiles.
So far we have only studied the consistency of empirical integrals, with
special emphasis on Fn (t). Under appropriate integrability conditions on ,
consistency was a straightforward consequence of the SLLN. For quantiles
the situation is dierent. They are dened through Fn1 and F 1 to which
a priori no SLLN applies.
We rst study uniform quantiles Fn1 (u). Lemma 3.1.10 shows that the KS distance between Fn1 and the identity function equals the K-S distance
between Fn and the identity function.
Lemma 3.1.10. We have
n .
Dn := sup |Fn (u) u| = sup |Fn1 (u) u| =: D
0u1

0<u<1

Proof. Because of Lemma 3.1.9, we have


{
}
n = max max[Ui:n i 1 , i Ui:n ] .
D
1in
n n
Since
Ui:n

i1
= lim [Fn1 (u) u]
n
u i1
n

90

CHAPTER 3. UNIVARIATE EMPIRICALS:

and

i
i
Ui:n = Fn1
n
n

we obtain

THE IID CASE

( )
i
,
n

n D
.
D
n

Conversely,

i1
i
< u implies Fn1 (u) = Ui:n
n
n

and therefore
n Ui:n
D

i
i1
n,
Fn1 (u) u Ui:n
D
n
n

n D
n.
from which D

An application of the Glivenko-Cantelli Theorem to a uniform sample together with Lemma 3.1.10 immediately yield with probability one
0.
sup |Fn1 (u) u| = D
n

(3.1.22)

0<u<1

For a general F , things are less obvious. The next lemma gives a positive
answer as to pointwise consistency.
Lemma 3.1.11. Assume that F 1 is continuous at 0 < u < 1. Then
lim Fn1 (u) = F 1 (u) with probability one.

Proof. The proof is an immediate consequence of (3.1.18) and (3.1.22).



It is easy to see that F 1 is continuous at u i
F (t) > u for each t > F 1 (u),
i.e., there is no non-degenerate interval [F 1 (u), a] such that F is constant
there. Put in another way, F 1 is discontinuous at u i F attains the value
u on a non-degenerate interval. Conclude that F 1 has at most countably
many discontinuities so that in particular Lemma 3.1.11 holds for Lebesgue
almost all 0 < u < 1. Hence the Ui s appearing in the representation
Xi = F 1 (Ui ) are with probability one continuity points of F 1 .

3.1. BASIC FACTS

91

If F 1 is continuous on some compact subinterval [u0 , u1 ] of (0, 1), then a


modication of the proof of Theorem 3.1.8 gives uniform convergence on
[u0 , u1 ]:
sup

u0 uu1

|Fn1 (u) F 1 (u)| 0 with probability one.

An extension to the whole (open!) interval (0, 1) creates some problems,


since compactness of the parameter set is lost and no simple -grid approach
is available. Actually, suppose that F has unbounded support, like a normal
distribution. Then
sup |Fn1 (u) F 1 (u)| = ,

0<u<1

since X1:n Fn1 (u) Xn:n and F 1 is unbounded. This simple counterexample shows that compactness arguments are indeed crucial to obtain
uniform convergence, and that the uniform convergence valid for Fn may
not hold for other estimators. Another example is the cumulative hazard
function. We have noted in (2.2.4) that for a continuous F
F (t) = ln[1 F (t)]
and therefore
F (t) as t bF ,
where bF is the smallest upper bound for the support of F , possibly innite.
The Nelson-Aalen estimator

n (t) =

Fn (dx)
1 Fn (x)

(,t]

is a bounded function so that necessarily


sup |n (t) F (t)| = .
t

One may argue that the bad t of n is caused by ts in the far right tails,
namely t Xn:n . This is true only to a certain extent, since typically a bad
t of an estimating function is not abrupt but takes place gradually. See
Figure 1.4.1. In other words, n (t) is not a reliable estimator of (t) already
for those ts ranging between extreme order statistics Xn:n > Xn1:n >
Xn2:n > . . ..

92

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

3.2

Finite-Dimensional Distributions

In the last section we have seen that for i.i.d. random variables with distribution , we have
( )
n k
P(nn (A) = k) =
(A)(1 (A))nk ,
k = 0, 1, . . . , n.
k
Since the number of observations n is xed the event nn (A) = k automat = n k. In other words, the event on the left-hand
ically implies nn (A)
side equals
= n k}.
{nn (A) = k, nn (A)
forms a simple partition of the sample space S. We
The collection {A, A}
now discuss the extension to more than two sets. Let {A1 , A2 , . . . , Am } be
a (measurable) partition of the sample space, i.e.,
Ai Aj = for i = j,

i, j = 1, . . . , m and

Ai = S.

i=1

2 -test

Such a situation occurs in the


where one compares the frequencies
n (Ai ) with hypothetical cell probabilities, 0 (Ai ), 1 i m, through the
pertaining weighted Euclidean distance:
2 = n

[n (Ai ) 0 (Ai )]2


i=1

0 (Ai )

(3.2.1)

The joint distribution of these frequencies equals a multinomial distribution.


Lemma 3.2.1. For i.i.d. observations from a distribution we have
(
)
m
n
P(nn (Ai ) = ki for 1 i m) =
[(Ai )]ki ,
k1 , . . . , km
i=1
m
for ki = 0, 1, . . . , n such that i=1 ki = n.
This lemma may be applied to also obtain the joint distribution of empirical
measures of sets which are not disjoint but increasing.
Lemma 3.2.2. For sets C1 C2 . . . Cm and integers 0 k1 k2
. . . km n we have
(
) m+1

n
P(nn (Ci ) = ki for 1 i m) =
[(Ai )]ni ,
n1 , n2 , . . . , nm+1
i=1

3.2. FINITE-DIMENSIONAL DISTRIBUTIONS

93

where n1 = k1 , ni = ki ki1 for 2 i m and nm+1 = n km and


A1 = C1 , Ai = Ci \ Ci1 for 2 i m and Am+1 = S \ Cm .
Proof. The proof is an immediate consequence of the previous lemma.

Though Lemmas 3.2.1 and 3.2.2 provide us with exact distributions the
formulas are not tractable for moderate and large sample sizes. Moreover,
for a deeper forthcoming analysis they are also not very insightful to detect
hidden structures. In particular, if we want to study the dynamic behavior
of empirical and related processes it is useful to restate the last lemma in
terms of conditional probabilities.
Lemma 3.2.3. In the situation of Lemma 3.2.2 we have
P (nn (Cm ) = km |nn (Ci ) = ki for 1 i m 1)
= P(nn (Cm ) = km |nn (Cm1 ) = km1 )
]
]
(
)[
[
n km1
(Cm \ Cm1 ) km km1
(Cm \ Cm1 ) nkm
=
1
.
1 (Cm1 )
1 (Cm1 )
km km1
Proof. Write the conditional probability as a ratio of two probabilities and
apply the last lemma.

Lemma 3.2.3 states that nn has a Markov-property when it is evaluated on increasing sets. Furthermore the transition probability equals a
(shifted) binomial distribution with new parameters nkm1 and (|Cm1 ).
More precisely, given nn (Cm1 ) = km1 , nn (Cm ) has the same distribution as km1 + (n km1 )nkm1 (Cm \ Cm1 ), and where the data dening
the last empirical distribution are i.i.d. from restricted (and re-normalized)
to Cm1 .
For real-valued data, the most important Ci s are again the extended intervals (, ti ]. Because of their importance, the preceding results are summarized for these Cs in the following Theorem. It constitutes the extension
of Lemma 2.1.4 to sample size n > 1.
Theorem 3.2.4. Assume that X1 , . . . , Xn are i.i.d. from F . Then Fn is a
Markov-process such that for t1 < t2 < . . . < tm and 0 k1 k2 . . .
km n:
P(nFn (tm ) = km |nFn (tm1 ) = km1 )
(
)[
]
[
]
n km1
F (tm ) F (tm1 ) km km1
F (tm ) F (tm1 ) nkm
=
1
.
km km1
1 F (tm1 )
1 F (tm1 )

94

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

More or less, considering an increasing sequence of ti s often is only a matter


of convenience or taste. From a mathematical point of view we could also
consider a decreasing sequence of ti s. In such a situation we have
Theorem 3.2.5. Assume that X1 , . . . , Xn are i.i.d. from F . Then Fn is
a Markov-process in reverse time such that for t1 < t2 < . . . < tm and
0 k1 k2 . . . km n:
P(nFn (t1 ) = k1 |nFn (t2 ) = k2 , . . . , nFn (tm ) = km )
( )(
) (
)
F (t1 ) k1
k2
F (t1 ) k2 k1
= P(nFn (t1 ) = k1 |nFn (t2 ) = k2 ) =
1
.
k1
F (t2 )
F (t2 )
Later on we shall view the processes Fn and n as random elements in the
space D of all discontinuous functions which are right-hand continuous and
have limits from the left, equipped with the -eld generated by all Fn (t), t
R. Its distribution is uniquely determined through its nite dimensional
distributions. We may then combine Theorems 3.2.4 and 3.2.5 with the
uniform representation (3.1.17) to come up with the following result.
Theorem 3.2.6. For i.i.d. data, nFn is a Markov process such that conditionally on nFn (t0 ) = k, nFn (t) on t t0 has the same distribution as the
process
[
]
F (t) F (t0 )

.
t k + (n k)Fnk
1 F (t0 )
Similarly, for the process in reverse time, we have that conditionally on
nFn (t0 ) = k, nFn (t) on t t0 has the same distribution as the process
(
)
F (t)

t k Fk
.
F (t0 )
Informally speaking, we see that conditionally on Fn (t0 ) = k, the process
nFn on t t0 resp. t t0 equals in distribution a uniform empirical d.f. with
sample size and time transformation properly adjusted. These statements
are (hopefully) easier to remember than those in Theorems 3.2.4 and 3.2.5.
They readily enable us to compute conditional expectations. For example,
in the forward case, let Ft = (Fn (s) : s t). Then the rst statement in
Theorem 3.2.6 implies for t0 t:
E[nFn (t)|Ft0 ] = nFn (t0 ) + (n nFn (t0 ))

F (t) F (t0 )
,
1 F (t0 )

(3.2.2)

3.2. FINITE-DIMENSIONAL DISTRIBUTIONS

95

while in the backward case with Gt = (Fn (s) : t s):


E[nFn (t)|Gt0 ] = nFn (t0 )

F (t)
.
F (t0 )

(3.2.3)

From (3.2.2) and (3.2.3) we immediately get the following result.


Corollary 3.2.7. Let X1 , . . . , Xn be i.i.d. from F . Then
the process t

1Fn (t)
1F (t)

the process t

Fn (t)
F (t)

is a martingale w.r.t. (Ft )t (on F (t) < 1)

is a reverse martingale w.r.t. (Gt )t (on F (t) > 0).

The forward case also readily follows from Lemma 2.1.6.


Theorem 3.2.6 also yields explicit formulas for conditional variances:
[
]
F (t) F (t0 )
F (t) F (t0 )
1
(3.2.4)
Var{nFn (t)|Ft0 } = (n nFn (t0 ))
1 F (t0 )
1 F (t0 )
[
]
F (t)
F (t)
Var{nFn (t)|Gt0 } = nFn (t0 )
1
.
(3.2.5)
F (t0 )
F (t0 )
The last statements in this section deal with the limit distribution of n
at nitely many points. With Nm (0, ) we denote the m-variate normal
distribution with expectation 0 Rm and m m covariance matrix .
Lemma 3.2.8. Recall
n (t) = n1/2 [Fn (t) F (t)],

t R,

the empirical process. Then, for each t1 t2 . . . tm , we get


L

[n (t1 ), . . . , n (tm )]T Nm (0, ) as n


with
= (ij )1i,jm
and
ij = F (ti tj ) F (ti )F (tj ).

(3.2.6)

This lemma is an immediate consequence of the multivariate CLT. For


(3.2.6), see also (2.1.4).
Next we formulate the asymptotic result corresponding to disjoint sets
A1 , . . . , Am as considered in Lemma 3.2.1. Since n (Am ) = 1 n (A1 )

96

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

. . . n (Am1 ) the limit distribution of (n (A1 ), . . . , n (Am )) when properly standardized will be degenerate. Therefore we restrict ourselves to the
rst m 1 coordinates.
Lemma 3.2.9. Let A1 , . . . , Am be a partition of the sample space. Then,
as n ,
n1/2 [n (A1 )(A1 ), . . . , n (Am1 ) (Am1 )]T
L

Nm1 (0, ),
where is the (m 1) (m 1) matrix

p1 (1 p1 ) p1 p2
p2 p1

=
..
..

.
.
pm1 p1

...

p1 pm1

..

.
pm1 (1 pm1 )

and pi = (Ai ).
Proof. The assertion follows from the multivariate CLT. For the covariance
matrix, recall (2.1.6).

We only mention the well-known fact that Lemma 3.2.9 is in charge of the
limit distribution of 2 from (3.2.1). Actually, setting pi = n (Ai ) we get
2 = n(
p1 p1 , . . . , pm1 pm1 )I(
p1 p1 , . . . , pm1 pm1 )T
with

1 . . .
p1 + p1
pm
p1
m
m

..
p1

.
m

I=

..
..

.
.
1
1
1
pm
. . . . . . pm1 + pm

denoting the pertaining Fisher Information Matrix. From Lemma 3.2.9


and the Continuous Mapping Theorem, we obtain
L

2 Y T IY with Y Nm1 (0, ).


But I = 1 . We therefore have the following result.
Theorem 3.2.10. As n ,
L

2 Z,
where Z has a 2 -distribution with m 1 degrees of freedom.

3.3. ORDER STATISTICS

3.3

97

Order Statistics

In this section we collect some basic distributional properties of order statistics. Special emphasis is given on order statistics from the uniform and the
exponential distribution. Order statistics are important for at least three
reasons:
Special quantiles like the median, upper and lower quartiles or, more
generally, linear combinations of order statistics are important (robust)
competitors to estimators based on empirical moments (like sample
means or sample variances).
As Lemma 3.1.9 has shown, properly transformed order statistics may
appear in distribution-free statistics. To determine the distribution of
these statistics, a detailed study of order statistics will be indispensable.
In some data situations, e.g., in life testing, n items are put on test and
monitored until the r-th failure. In such a situation only X1:n , . . . , Xr:n
are observable and any statistical inference therefore needs to be based
on them.
Our rst result deals with the distribution of a single Xi:n . For example,
when i = n, we get
P(Xn:n x) = P(Xi x for all i = 1, . . . , n)
= F n (x), by independence.
For i = 1, we obtain
P(X1:n x) = 1 P(X1:n > x) = 1 P(Xi > x for all i = 1, . . . , n)
= 1 [1 F (x)]n .
These examples are two special cases of the following more general result.
Lemma 3.3.1. Let X1 , . . . , Xn be a sample of independent observations
from a d.f. F . Then
n ( )

n
Gi (x) = P(Xi:n x) =
F k (x)(1 F (x))nk , 1 i n.
k
k=i

Proof. The assertion immediately follows from, see (1.3.4),


P(Xi:n x) = P(i/n Fn (x))

98

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE




and (3.1.1).

The function Gi only gives limited information about the distributional character of order statistics. Therefore we next determine the joint distribution
of all Xi:n s. Since by construction X1:n X2:n . . . Xn:n this distribution is supported by the set K of all x = (x1 , . . . ,
xn ) of n-vectors
satisfying x1 x2 . . . xn . The set of rectangles ni=1 (ai , bi ] with
a1 b1 . . . an bn uniquely determines a distribution on K. From the
i.i.d. property of the Xi s, we obtain
P(ai < Xi:n bi for 1 i n) = n!

[F (bi ) F (ai )].


i=1

When F admits a Lebesgue density f , the Xi:n s also have a Lebesgue density
supported by K.
Lemma 3.3.2. Assume that X1 , . . . , Xn are independent from a density f .
Then (X1:n , . . . , Xn:n ) has the Lebesgue density
{

n! ni=1 f (xi ) for x1 x2 . . . xn


gn (x1 , . . . , xn ) =
.
0
elsewhere
For the sake of reference we display two important special cases:
F = FU :
{
n! for 0 x1 . . . xn 1
gn (x1 , . . . , xn ) =
0 elsewhere
F = Exp(1):
{
gn (x1 , . . . , xn ) =

n! exp[x1 . . . xn ] for 0 x1 . . . xn <


.
0
elsewhere

The distributional behavior of the order statistics 0 E1:n . . . En:n


pertaining to a sample E1 , . . . , En of independent random variables from an
Exp(1) distribution is particularly nice.
As seen before, (E1:n , . . . , En:n ) has density
{

n! exp[ ni=1 xi ] for 0 x1 < . . . < xn


gn (x1 , . . . , xn ) =
.
0
elsewhere

3.3. ORDER STATISTICS

99

Dene E0:n = 0 and set


Yin Yi = (n i + 1)(Ei:n Ei1:n )
whence
Er:n =

(Ei:n Ei1:n ) =

i=1

i=1

Yi
.
ni+1

The interesting fact about this representation is that the Yi s have a simple
distributional structure. See Renyi (1953).
Lemma 3.3.3. The random variables Y1 , . . . , Yn are i.i.d. from Exp(1).

Y1
E1:n
..
.
. = A ..
Yn
En:n

Proof. We have

with the n n matrix A dened as

n
0
...
(n 1)
n1
0

0
(n

2)
n

A=
..

0
...
...

0
0

.
..

2 2 0
...
1 1

...
...

From calculus, the Y -vector has density

x1

(x1 , . . . , xn ) |detA1 |gn A1 ... .


xn
Since detA1 = 1/detA = 1/n! and

x1
{

n! exp[ ni=1 xi ] for x1 , . . . , xn 0


1 ..
gn A . =
,
0
elsewhere
xn
the proof is complete.


100

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

If we represent the Ei s in terms of uniform random variables U1 , . . . , Un :


Ei = ln(1 Ui ),
then
ln(1 Ur:n ) =

i=1

Yi
.
ni+1

(3.3.1)

Solving for Yr yields


Yr
1 Ur1:n
= ln
nr+1
1 Ur:n
and therefore

1 Ur:n
Qr :=
1 Ur1:n

]nr+1
= exp(Yr ).

(3.3.2)

We may therefore conclude from Lemma 3.3.3 the following Corollary. See
Malmquist (1950).
Corollary 3.3.4. The random variables Q1 , . . . , Qn are i.i.d. from FU .
There is another interesting variant of Lemma 3.3.3 which brings the cumulative hazard function F back into play.
Corollary 3.3.5. Assume that X1 , . . . , Xn are i.i.d. from a continuous F .
Then the variables
Zi = (n i + 1) [F (Xi:n ) F (Xi1:n )] ,

1in

are i.i.d. from Exp(1). For completeness, set F (X0:n ) = 0.


This Corollary is indeed an extension of Lemma 3.3.3 to the case of a general
continuous F . Actually, when the Xi s are Exp(1), then F (t) = t so that
the Zi s equal the Yi s from before. Corollary 3.3.5 reveals that F may
serve as a tool to transform the dependent order statistics to a sample of
independent random variables with a specied distribution.
Proof of Corollary 3.3.5. The assertion is a straightforward consequence of
Lemma 3.3.3, the monotonicity of F and Lemma 2.2.5.

Relation (3.3.1) is the basic tool to analyze the probabilistic properties of
uniform order statistics. First,
]
[ r

Yi
.
(3.3.3)
1 Ur:n = exp
ni+1
i=1

3.3. ORDER STATISTICS

101

Since the Y s are independent, the partial sums are Markovian. So are
U1:n , . . . , Un:n . Moreover, for k + 1 j n, we have
[ j
]

Yi
Uj:n = 1 exp
ni+1
i=1
[ k
]
[
]
j

Yi
Yi
= 1 exp
exp
,
ni+1
ni+1
i=1

i=k+1

whence
Uj:n Uk:n
1 Uk:n

= 1 exp
[

i=k+1

Yi
ni+1

Yi
= 1 exp
(n k) (i k) + 1
i=k+1
[ jk
]

Yi+k
= 1 exp
nki+1

(3.3.4)
]

i=1

= Ujk:nk .
From this we immediately get the following lemma.
Lemma 3.3.6. For each n 1, U1:n , . . . , Un:n is a Markov sequence such
that for 0 y1 . . . yk 1:
)
(
Ui:n yk
, k + 1 i n|U1:n = y1 , . . . , Uk:n = yk
L
1 yk
= L(U1:nk , . . . , Unk:nk ).
Here again L stands for distribution.
The following lemma presents a reverse Markov property for the Ui:n s.
Lemma 3.3.7. For each n 1 and all 0 < yk+1 . . . < yn < 1 we have
L (Ui:n , 1 i k|Uk+1:n = yk+1 , . . . , Un:n = yn ) = L(yk+1 Ui:k , 1 i k).
i = 1 Ui . The U
i s are again i.i.d. from FU . Since Ui:n =
Proof. Put U
ni+1:n , we get from the preceding lemma, for any a1 b1 . . .
1U

102

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

ak bk yk+1 :
P (ai < Ui:n bi for 1 i k|Uk+1:n = yk+1 , . . . , Un:n = yn )
(
)
ni+1:n < 1 ai for 1 i k|U
nk:n = 1 yk+1 , . . . , U
1:n = 1 yn
= P 1 bi U
(
)
ki+1:k < ai + yk+1 for 1 i k
= P bi + yk+1 yk+1 U
= P (ai < yk+1 Ui:k bi for 1 i k) .

Our nal result in this direction is the following
Lemma 3.3.8. For 1 < r < n and 0 < x < 1, given Ur:n = x, the
random vectors (U1:n , . . . , Ur1:n ) and (Ur+1:n , . . . , Un:n ) are independent
and distributed as (xU1:r1 , . . . , xUr1:r1 ) and (x + (1 x)U1:nr , . . . , x +
(1 x)Unr:nr ).
When F has a derivative f = F , then also Gi has a Lebesgue density
which may be immediately derived from the formula in Lemma 3.3.1. In the
following we shall rather apply the Markov structure of the Ui:n s to get some
experience with a recursion technique, which also applies in a conditional
setting.
First
Gn (x) = P(Un:n x) = xn for 0 x 1.
Furthermore, for 1 i < n and 0 x 1, we have
1
Gi (x) = P(Ui:n x) =

P(Ui:n x|Ui+1:n = y)Gi+1 (dy)


0

1
= P(Ui+1:n x) +

P(yUi:i x)Gi+1 (dy)


x

1
= Gi+1 (x) + xi

y i Gi+1 (dy).

(3.3.5)

This recursion will be the key tool to compute the density gi of Ui:n .

3.3. ORDER STATISTICS

103

Lemma 3.3.9. For 1 i n,


{ i1
x (1 x)ni /B(i, n i + 1) for 0 x 1
gi (x) =
.
0
elsewhere
Here

1
ta1 (1 t)b1 dt,

B(a, b) =

a, b > 0

denotes the Beta-function.


Remark 3.3.10. Recall that
B(a, b) =

(a)(b)
(a + b)

and is the Gamma function satisfying


(a) = (a 1)!

for a N.

Proof of Lemma 3.3.9. For i = n,


gn (x) = nxn1 on 0 x 1
which is the same as the expression in the lemma. For 1 i < n and
0 x 1, we apply (3.3.5) and use induction to get
1
gi (x) = gi+1 (x) + ixi1

y i gi+1 (y)dy xi xi gi+1 (x)

1
= ixi1

y i y i (1 y)ni1 dy/B(i + 1, n i)

1
(1 y)ni1 dy/B(i + 1, n i)

= ixi1
x

= x

i1

(1 x)ni /B(i, n i + 1),

where the last equality follows from


(n i)B(i + 1, n i)
= B(i, n i + 1).
i

104

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE




For a general d.f. F with density f , since


P(Xi:n x) = P(Ui:n F (x)),
Xi:n has density
fi (x) = f (x)F i1 (x)(1 F (x))ni /B(i, n i + 1).

(3.3.6)

Lemma 3.3.9 immediately yields the moments of Ui:n .


Corollary 3.3.11. For j 1 and 1 i n, we have
j
EUi:n
=

B(j + i, n i + 1)
.
B(i, n i + 1)

As special cases we obtain


EUi:n =
and
2
EUi:n
=

i
i,n
n+1

i(i + 1)
.
(n + 1)(n + 2)

From this

1
i,n (1 i,n ).
n+2
These formulae may also be easily obtained by using (3.3.3). For example,
since
1
E [exp(E)] =
1+
for > 0,
Var Ui:n =

E[1 Ur:n ] =

i=1

1
ni+1

1+

ni+1
i=1

ni+2

whence

nr+1
r
=1
n+1
n+1

r
.
n+1
Furthermore, putting j = k + 1 in (3.3.4), we get
(
)
Yk+1
Yk+1
nk
nk
Uk:n + 1 e
Uk+1:n = e
EUr:n =

3.3. ORDER STATISTICS

105

and, by independence of Uk:n and Yk+1 :


E[Uk+1:n |U1:n , . . . , Uk:n ] =

nk
1
Uk:n +
.
nk+1
nk+1

Lemma 3.3.12. The sequence


i
Ui:n n+1
,
ni+1

1in

is a mean zero martingale w.r.t. the natural ltration


Fi = (Uj:n : 1 j i),

1 i n.

The reverse Markov property of the Ui:n s gives rise to a reverse martingale.
The proof is similar to that of Lemma 3.3.12 and therefore omitted.
Lemma 3.3.13. The sequence
(n + 1)Ui:n /i,

1in

is a mean-one reverse martingale w.r.t. the natural ltration Fi = (Uj:n :


i j n).
In the last result of this section, we relate uniform order statistics to sums
and not order statistics of independent exponential random variables.
Lemma 3.3.14. The vector (U1:n , . . . , Un:n ) of uniform order statistics has
the same distribution as
(
)
Z1
Zn
,...,
,
Zn+1
Zn+1
where Zi = E1 +. . .+Ei is the i-th partial sum from a sequence E1 , . . . , En+1
of independent Exp(1) random variables.
Proof. For 0 a1 b1 . . . an+1 bn+1 a Fubini-type argument yields
P(ai < Zi bi for 1 i n + 1)
bn+1 x
1 ...xn

b1 b2x1
=

...
a1 a2 x1

an+1 x1 ...xn

exp[x1 . . . xn+1 ]dxn+1 dxn . . . dx1 .

106

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Substituting yn+1 = xn+1 + (x1 + . . . + xn ) one obtains for the last integral
b1 b2x1

bn+1

...
a1 a2 x1

b1b2

exp(yn+1 )dyn+1 dxn . . . dx1 =

an+1

bn+1

...

a1 a2

exp(xn+1 )dxn+1 . . . dx1 .

an+1

This shows that (Z1 , . . . , Zn+1 ) has the Lebesgue density


{
exp(xn+1 ) for 0 x1 < . . . < xn+1
(x1 , . . . , xn+1 )
.
0
elsewhere
Conclude that
Zi
bi for 1 i n) =
P(ai <
Zn+1
=

(bi ai )
i=1

b1xn+1

bnxn+1

...
a1 xn+1

xn exp(x)dx = n!

exp(xn+1 )dxn+1 . . . dx1

an xn+1 0

(bi ai ),
i=1

as desired.

The results presented in this section are fundamental tools for proving various properties of empirical d.f.s. In the following section they will be used
to get some exact boundary crossing probabilities for Fn .

3.4

Some Selected Boundary Crossing Probabilities

The Glivenko-Cantelli Theorem asserts that the empirical d.f.s approach, as


sample size increases to innity, the true d.f. in the uniform metric. In this
section we derive several results which characterize the deviation between
Fn and F for xed nite n.
Our rst Theorem, which is originally due to Daniels (1945), yields the exact
distribution of the maximal relative deviation between Fn and F :
Rn =

Fn (t)
.
t:0<F (t) F (t)
sup

By the SLLN,
Fn (t)
1 with probability one.
F (t)

(3.4.1)

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 107


Note, however, that compared with Fn F , the process of interest, Fn /F , is
only dened on the left-open interval of those ts for which F (t) is positive so
that compactness arguments in connection with a uniform version of (3.4.1)
do not apply here. Indeed,
[
]
Fn (t)
1 F (t)
Var
=
F (t)
nF (t)
tends to innity for xed n, as F (t) 0, so that Rn is likely to be heavily
controlled by Fn (t) when t is small.
Theorem 3.4.1. For a continuous F , we have, for 0 < 1,
1
Gn () P(Rn < ) = 1 .

(3.4.2)

Proof. That Gn is distribution-free, follows as for the K-S statistic, since


under continuity
n.
Rn = sup Fn (u)/u R
0<u1

It is noteworthy that also Gn is the same for each n 1. For n = 1,


R1 = 1/U1 so that (3.4.2) is immediate.
For a general n > 1, we follow Renyi (1973) who suggested to prove Theorem
3.4.1 by induction on n. Now,
n = max k/nUk:n
R
1kn

so that
Gn () = P( min nUk:n /k > ).
1kn

Recall that Un:n has the Lebesgue density ny n1 on 0 < y < 1. Thus, for
n > 1, we may apply Lemma 3.3.7 and use the validity of (3.4.2) for n 1
to get
(

1
Gn1

Gn () =

(n 1)
ny

)
ny n1 dy

1 [
1

]
(n 1)
ny n1 dy = 1 .
ny

108

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE




If we set

1
= 1 + with 0

we may restate Theorem 3.4.1 to get


(
)
[
]
Fn (t)
1
P sup
1 > =
.
1+
0<F (t) F (t)
It follows that, as expected, Fn /F does not converge to 1 uniformly in t. On
the other hand, to formulate a positive result, Theorem 3.4.1 asserts that
up to an event of probability , Fn is bounded from above by F/:
Fn (t) F (t)/.

(3.4.3)

This inequality is also true when F (t) = 0, since then with probability one
all data exceed t so that Fn (t) = 0. Since for uniformly distributed data,
F (t) = t is a linear function, the bound (3.4.3) is sometimes called a linear
bound (on the F -scale) for Fn .
Next we investigate lower bounds corresponding to (3.4.3). Again it suces
to consider the case of uniformly distributed U s.
For 0 s n and a, b > 0 such that a + sb < 1, put
Pns (a, b) = P(Uk:n a + (k 1)b for 1 k s, Us+1:n > a + sb),
with Un+1:n 1.
Lemma 3.4.2. (Dempster). We have
( )
n
Pns (a, b) =
a(a + sb)s1 (1 a sb)ns .
s
Proof. The assertion is true for n = 1. Also, for a general n, it holds for
s = 0. For n 2 and s 1 we obtain upon conditioning on U1:n :
[

a
n(1 y)

n1

Pns (a, b) =
0

Pn1,s1

]
a+by
b
,
dy.
1y
1y

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 109


Now, use induction on n to get the result.

From Dempsters formula we obtain further results on the relative deviation
between Fn and F .
Corollary 3.4.3 (R
enyi). For 0 < < 1 we have
(
P

)
max nUk:n /k >

1kn

n1
(
s=0

)
[
]
n ( )s
(1 + s) ns
s1
(1 + s)
1
s
n
n

s e(s+1) (s + 1)s1 /s! as n .

s=0

Proof. From Dempsters formula, we have


(
P

)
max nUk:n /k >

1kn

( )
,
n n
s=0
[
]
n1
(n) ( )s
1 + s) ns
s1
(1 + s)
1
=
.
n
n
s
=

n1

Pns

s=0

The limit result follows from an application of the dominated convergence


theorem.

We now derive a formula for
(
)
P
max nUk+1:n /k > ,
1kn1

which diers slightly from the probability appearing in Corollary 3.4.3. It


is of interest because
[
]1
inf Fn (t)/t =
max nUk+1:n /k
t>U1:n

1kn1

so that
(

1
P inf Fn (t)/t <
t>U1:n

(
=P

)
max nUk+1:n /k > .

1kn1

110

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Now, conditioning on U1:n , we obtain for 0 < < n


(
)
(
)
P
max nUk+1:n /k >
= P U1:n >
1kn1
n
+

[
]
/n
r

n1
n y
n(1 y)
Pn1,s
,
dy,
1 y n(1 y)
s=0

where r is the largest integer < n/. Applying Dempsters formula we get
the following Corollary.
Corollary 3.4.4 (Chang). For 0 < < n,
(
P

)
max nUk+1:n /k >

1kn1

(
)
r+1 ( )
(
)n n ( )i
i ni
= 1
+
1
(i1)i1 .
n
n
n
i
i=1

Letting n , we get for all > 0


(
)

(e )k (k 1)k1
P
max nUk+1:n /k > e +
.
1kn1
k!
k=1

Our last result in this section presents the exact distribution of


Dn+ = sup[Fn (t) F (t)] = sup [Fn (u) u],
0u1

tR

when F is continuous.
Theorem 3.4.5 (Birnbaum-Tingey). For all 0 < x 1 and n 1 we
have
<n(1x)> (

P(Dn+ x) = 1 x

j=0

)
n
j
j
(1 x )nj (x + )j1 .
j
n
n

Proof. Obviously, Dn+ x if and only if


i
Ui:n + x for all i = n, n 1, . . . , K,
n
where the integer K is dened through
K
K 1
x0>
x.
k
n

(3.4.4)

3.4. SOME SELECTED BOUNDARY CROSSING PROBABILITIES 111


In view of Lemma 3.3.2, P(Dn+ x) equals the integral
1

yn

y
K+1

J(x, n, K) = n!

...

1x

1
1x n

y2

yK
...

1x nK
n

dy1 . . . dyn .
0

By induction we obtain
y2

yK

dy1 . . . dyK1 =

...
0

K1
yK
.
(K 1)!

Inserting this expression into J(x, n, K) we obtain the recursion formula


(

1 x nK
n
J(x, n, K) = J(x, n, K + 1)
K!

)K

1
n!

y
K+2

...

1x

(
J(x, n, K + 1)

)
nK K

1x
K!

1x

dyK+1 . . . dyn
n(K+1)
n

I(x, n, K + 1)

whence
J(x, n, K) = J(x, n, n)

n1

i=K

1x
i!

)
ni i
n

I(x, n, i + 1).

Note that
J(x, n, n) = 1 (1 x)n
while for the Is we obtain through induction on i:
(
)ni1
x x + ni
n
I(x, n, i + 1) = n!
,
(n i)!

i = K, . . . , n 1.

Putting j = n i, we obtain the assertion of the Theorem.

The distribution of Dn = max(Dn+ , Dn ) is much more complicated. Because


Dn+ = Dn in distribution (when F is continuous), we have at least an
inequality for the upper tail probabilities:
P(Dn > x) 2P(Dn+ > x).

112

CHAPTER 3. UNIVARIATE EMPIRICALS:

THE IID CASE

Chapter 4

U -Statistics
4.1

Introduction

Let X1 , . . . , Xn be a sample of i.i.d. random


variables with d.f. F . For a

given function , the empirical integral dFn then is a linear statistic to


which standard results from probability such as the SLLN, CLT or the law of
the iterated logarithm may be applied. As we have seen in previous sections
there are many examples in which the statistic of interest is a multiple sum.
The associated statistical functional then is of the form

T (G) = . . . h(x1 , . . . , xk )G(dx1 ) . . . G(dxk ).


Here and in the following it will be assumed that the integral exists for
G = F . For G = Fn , it is always nite.
The function h is called the kernel of T , while k is its degree. When
k = 1, we are back at linear functionals, so here we shall focus on k > 1.
For G = Fn , we obtain the V -statistic
T (Fn ) = nk

i1 =1

...

h(Xi1 , . . . , Xik ).

ik =1

When = T (F ) is the parameter of interest, it is more common to look at


Un =

(n k)!
h(Xi1 , . . . , Xik ),
n!

where summation takes place over all (incomplete) permutations =


(i1 , . . . , ik ) of integers 1 ij n with length k n. The variable Un is
113

114

CHAPTER 4. U -STATISTICS

called a U -statistic of degree k, since Un is an unbiased estimator of :


EUn = = T (F ).
The goal of this chapter is to discuss some fundamental nite and large
sample properties of Un . We already remark now that in some applications
the kernel h may vary with n, in which case special care is required when
we let n tend to innity.
Now, the main problem with Un is its nonlinearity when k > 1. This limits
the applicability of standard results from probability theory. Therefore, in
n which is closest to Un ,
Section 4.2, we shall look for the linear statistic U
and determine the associated . Then, by expanding Un into
n + (Un U
n ) U
n + Rn ,
Un = U
we shall discuss when the remainder Rn may be neglected, if at all.
It is again useful to look at this problem from a geometric point of view.
For this, let 1 and 2 be two F -integrable functions and let a1 , a2 be two
real numbers. If we combine the two linear statistics
Ljn

j (Xi ),

j = 1, 2

i=1

we obtain
a1 L1n + a2 L2n =

(a1 1 + a2 2 )(Xi ),

i=1

which is a linear statistic with score function a1 1 +a2 2 . In other words, the
linear statistics form a subspace of all statistics. We enlarge this subspace to
the subspace of all sums, where the transformations of the data may dier
from i to i:
n

L=
i (Xi ).
(4.1.1)
i=1

In the following section we present a famous result due to Hajek, who was
able to compute the projection of a general square integrable statistic. Then
we shall apply Hajeks projection lemma to U -statistics.


4.2. THE HAJEK
PROJECTION LEMMA

4.2

115

The H
ajek Projection Lemma

The Projection Lemma which we will discuss now was established by Hajek
(1968) and then applied to achieve asymptotic normality of linear rank
statistics. Much earlier, Hoeding (1948) has already used projection techniques to linearize U -statistics.
The original Hajek projection lemma only assumes independence of the underlying variables. Equality of the distributions is not required.
Lemma 4.2.1. (Hajek). Let X1 , . . . , Xn be independent random variables
and let S = S(X1 , . . . , Xn ) be a square-integrable function (statistic) of
X1 , . . . , Xn . Set
n

S =
E(S|Xi ) (n 1)ES.
i=1

Then S is a member of the previously mentioned subspace. Actually, S


equals the projection of S and has the following properties:
= E(S)
(i) E(S)
2 = Var(S) Var(S)

(ii) E(S S)
(iii) For any L of the form (4.1.1), one has
2 + E(S L)2 ,
E(S L)2 = E(S S)

i.e., the left-hand side is minimized for L = S.


Proof. (i) is trivial, and (ii) follows from (iii), if we take for L the constant
To show (iii), assume E(S) = E(S)
= 0 w.l.o.g. We have
E(S) = E(S).
[

n
]
[
]

E (S S)(S L) =
E (S S)(E(S|X
i ) li (Xi ))
i=1

[
]
i) .
E (E(S(Xi ) i (Xi ))E(S S|X

i=1

Because of independence,
{
E(E(S|Xj )|Xi ) =

E(S)
for i = j
.
E(S|Xi ) for i = j

116

CHAPTER 4. U -STATISTICS

Hence, since E(S) = 0,


i ) = E(S|Xi ),
E(S|X
i.e.
i ) = 0.
E(S S|X
It follows that

[
]
S L) = 0
E (S S)(

and therefore (iii).



From time to time the statistic S also contains other variables which are
measurable with respect to a -eld F. In such a situation we are looking
for an approximation of S through a statistic
L=

Zi ,

(4.2.1)

i=1

where Zi is an F-measurable function which may depend on Xi but not on


the others. The following lemma is a conditional version of Hajeks lemma.
Assumption: E[E(S|Xi , F)|Xj , F] = E(S|F) for i = j.
Lemma 4.2.2. Under ES 2 < , let
S =

E(S|Xi , F) (n 1)E(S|F).

i=1

Then

(i) E(S|F)
= E(S|F)
2 |F] = Var(S|F) Var(S|F)

(ii) E[(S S)
(iii) For any admissable L, one has
2 |F] + E[(S L)2 |F],
E[(S L)2 |F] = E[(S S)
i.e., S minimizes the left-hand side.

4.3. PROJECTION OF U -STATISTICS

117

Proof. Needless to say, S is admissable. (i) is trivial, since F (Xi , F).


(ii) follows from (iii) upon setting

L = E(S|F) = E(S|F).

To show (iii), assume E(S|F) = 0 = E(S|F)


w.l.o.g. We then have
S L)|F] =
E[(S S)(

i=1

E[(S S)(E(S|X
i , F) Zi )|F]
{
}
i , F]|F .
E (E(S|Xi , F) Zi )E[S S|X

i=1

From the assumption,


{
E[E(S|Xi , F)|Xj , F] =

E(S|F)
for i = j
.
E(S|Xi , F) for i = j

It follows
i , F) = E(S|Xi , F)
E(S|X
whence
S L)|F] = 0.
E[(S S)(
Conclude (iii).

Equality (ii) is likely to be applied in the following way. Assume that as
n the right-hand side converges to zero in probability or P-a.s. Then
so does the left-hand side. In case we may apply an integral convergence
theorem, we infer S S 0 in L2 . Coming back to the right-hand side,
the -elds may depend on n. In case they are increasing or decreasing,
martingale arguments might be useful. Conditioning on F is always useful
when S contains awkward F-measurable components.

4.3

Projection of U -Statistics

As before, consider
Un =

(n k)!
h(Xi1 , . . . , Xik ),
n!

118

CHAPTER 4. U -STATISTICS

n its Hajek projection. Dene


and denote with U

hj (x) =
h(x1 , . . . , xj1 , x, xj+1 , . . . , xk )F (dx1 ) . . . F (dxk ).
Rk1

If h is symmetric, i.e., invariant w.r.t. permutations of the xi , the hj s are


all equal.
Lemma 4.3.1. We have
n = n1
U

n
k

[hj (Xi ) ],

i=1 j=1

where = E[h(X1 , . . . , Xk )], the target.


Proof. Obviously, for 1 i n,
]
k j
(n k)! [0
(n k)!
h
(X
)
E [h(Xi1 , . . . , Xik )|Xi ] =

+
j
i
j=1

n!
n!

nk
+ n1
hj (Xi ).
n
k

j=1

j
0
denote summation over all not containing i resp.
resp.
Here
containing i at position j. It follows that
n =
U

E(Un |Xi ) (n 1)

i=1

= (n k) + n

n
k

hj (Xi ) (n 1)

i=1 j=1

=+n

n
k

[hj (Xi ) ].
i=1 j=1

The proof is complete.


Now, putting

h(x)
=

[hj (x) ],
j=1

4.3. PROJECTION OF U -STATISTICS


we get
n = n1
U

119

i ).
h(X

i=1

1 ), . . . , h(X
n ) are i.i.d. and centered. So
The random variables h(X
n ) N (0, 2 ) in distribution
n1/2 (U

(4.3.1)

with
2

2 (x)F (dx).
h

Also note that


n =
Vn = Un U

(n k)!
H(Xi1 , . . . , Xik )
n!

is again a U -statistic with kernel


k
1
H(x1 , . . . , xk ) = h(x1 , . . . , xk )
h(xj ).
k
j=1

function associated with H vanishes: H


0.
Lemma 4.3.2. The H
Proof. The statement follows from straightforward computations or the fact
that
=U
n U
n U
n = 0.
Vn = U
n

= 0 are henceforth called degenerU -statistics with kernel H such that H
ate.
To properly understand the (limit) behavior of Un , we may decompose Un
into
n ) + (Un U
n ) = (U
n ) + Vn .
Un = (U
Two dierent scenarios are possible.
Scenario 1:
n1/2 Vn 0 in probability and Un is nondegenerate.

120

CHAPTER 4. U -STATISTICS

In this case, (4.3.1) applies and we obtain


n1/2 (Un ) N (0, 2 ) in distribution.
Scenario 2:
vanishes and so does 2 in (4.3.1).
Un is degenerate, i.e., h
Now n1/2 is unlikely to be the correct standardizing factor, and it is not
clear how the correct approximation through a nondegenerate distribution
may look like.
A partial answer to these questions will be given in the following section,
when we discuss the variance of a U -statistic.

4.4

The Variance of a U -Statistic

In this section we compute the variance of a U -statistic and provide some


useful upper bound. Again, let
(n k)!
h(Xi1 , . . . , Xik ),
Un =
n!

where h is square-integrable and denotes summation over the n!/(nk)!


permutations of k distinct elements from {1, . . . , n}. Set = EUn . We
may assume w.l.o.g. = 0. Clearly,
]
[
(n k)! 2
2
Un =
h(Xi1 , . . . , Xik )h(Xj1 , . . . , Xjk ).
n!
,
1

The sum over 1 and 2 may be written as

1 ,2

r=0 1 ,2

Here r denotes summation over all permutations 1 and 2 (of length k)


with exactly r indices in common. If these indices are in position 1 and
2 , then by the i.i.d. property
E [h(Xi1 , . . . , Xik )h(Xj1 , . . . , Xjk )]

= ...
h(x1 , . . . , xk )h(y1 , . . . , yk )F (dx1 ) . . . F (dx2kr ),
R2kr

4.4. THE VARIANCE OF A U -STATISTIC

121

in which the ys in position 2 agree with the xs in position 1 and are


taken from xk+1 , . . . , x2kr otherwise. In particular, the expectation only
depends on 1 and 2 and may thus be written as I(1 , 2 ). There are
(n r)!/(n 2k + r)! possibilities which lead to the same I(1 , 2 ). For
r = 0, each I vanishes. Hence we come up with the following
Lemma 4.4.1. Let Un be a (centered) square-integrable U -statistic. Then
we have
] k
[

(n k)! 2 (n r)!
Var(Un ) =
I(1 , 2 ).
(4.4.1)
n!
(n 2k + r)!
|1 |=r=|2 |

r=1

For a symmetric h, I(1 , 2 ) Ir only depends on r. There are


( )( )( )
k
k
n
r!r!
r
r
r
possibilities for choosing 1 and 2 . It follows that
Var(Un ) =

)
k ( )1 ( )(

n
k
nk
r=1

kr

Ir .

(4.4.2)

Equality (4.4.2) is due to Hoeding (1948).


Now, for a general kernel, by the Cauchy-Schwarz inequality,
|I(1 , 2 )| Eh2 (X1 , . . . , Xk ).
The r-th term in (4.4.1) is thus bounded from above in absolute value by
( )1 ( )(
)
n
k
nk
Eh2 = O(nr ).
k
r
kr
In particular, if the sum over the I(1 , 2 ) vanishes for r = 1, . . . , s 1,
then
Var(Un ) = O(ns ).
(4.4.3)
In some applications we need the last bound for an Un in which the kernel
h may depend on n. Under the assumptions leading to (4.4.3), we then get
Var(Un ) = O(ns cn ),
when cn is an upper bound for Eh2n (X1 , . . . , Xk ).

122

CHAPTER 4. U -STATISTICS

We are now in a position to discuss Scenario 1 from the previous section


does not vanish. We already noticed that
in full length. So suppose that h
Vn is a U -statistic with kernel H. The random variable H(Xi1 , . . . , Xik )
has expectation zero, and for two permutations (i1 , . . . , ik ) and (j1 , . . . , jk )
having only one index in common (say il = js ),
E [H(Xi1 , . . . , Xik )|Xj1 , . . . , Xjk ]
= hl (Xjs )

k
]
1 [
E h(Xij )|Xj1 , . . . , Xjk
k
j=1

1
= hl (Xjs ) h(X
js ).
k

(4.4.4)

For a symmetric kernel the term in (4.4.4) vanishes so that


E [H(Xi1 , . . . , Xik )H(Xj1 , . . . , Xjk )] = 0.
For a general kernel h and for a given = (j1 , . . . , jk ), the k being equal to
js may sit in any position l so that the sum over the terms in (4.4.4) also
vanish. We may conclude that for r = 1 the sum over the s in (4.4.1)
equals zero. From (4.4.3) we obtain
Var(Vn ) = EVn2 = O(n2 ).
Altogether this yields
Theorem 4.4.2. Assume that Un is a nondegenerate square integrable U statistic. Then
n1/2 (Un ) N (0, 2 ) in distribution,

with
2

4.5

2 (x)F (dx).
h

U -Processes: A Martingale Approach

Assume that X1 , . . . , Xn is a (nite) sequence of independent identically


distributed (i.i.d.) random variables with distribution function (d.f.) F ,
dened on a probability space (, A, P). Let h be a function (the kernel) on
m-dimensional Euclidean space and set (for n m)
Un =

(n m)!
h(Xi1 , . . . , Xim ),
n!

4.5. U -PROCESSES: A MARTINGALE APPROACH

123

where extends over all multiindices = (i1 , . . . , im ) of pairwise distinct 1


ij n, 1 j m. Commonly Un is called a U -statistic. U -statistics were
introduced by Hoeding (1948). They have been extensively investigated
over the last 40 years. Most of the basic theory is contained in Sering
(1980), Denker (1985) and Lee (1990). See also Randles and Wolfe (1979).
More recently much attention has been given to what has been called a U (statistic) process. For ease of representation we shall restrict ourselves to
degree m = 2. The U -statistic process is then dened as follows: for real u
and v set

1
Un (u, v) =
h(Xi , Xj )1{Xi u,Xj v} .
n(n 1)
1i=jn

Write

1
1{Xi x} ,
n
n

Fn (x) =

x R,

i=1

the empirical d.f. of the sample. Then Un (u, v) becomes (assuming no ties)
n
Un (u, v) =
n1

u v
h(x, y)1{x=y} Fn (dx)Fn (dy).

Now, a standard method to analyze the (large sample) distributional behavior of Un is to write Un as
n + Rn ,
Un = U
n is the Hajek projection of Un and the remainder Rn = Un U
n
in which U
is a degenerate U -statistic that is asymptotically negligible when compared
n . In fact, provided that h has a nite pth moment and zero mean,
with U
n N (0, 2 )
n1/2 U
and

in distribution

E|Rn |p Cnp .

See Sering [(1980), page 188]. It follows that also


n1/2 Un N (0, 2 )

in distribution.

Of course, this approach immediately applies to Un (u, v) for each (u, v)


xed. Just replace h(x, y) by h(x, y) by 1{xu,yv} . Unfortunately, this

124

CHAPTER 4. U -STATISTICS

is insucient for handling Un (u, v) [resp. Rn (u, v)] as a process in (u, v).
Particularly, the (pointwise) Hajek approach does not yield bounds for
supu,v |Rn (u, v)|. Such bounds are, however, extremely useful in applications. For example, in survival analysis, U -statistic processes or variants of
them appear in the context of estimating the lifetime distribution F and
the cumulative hazard function when the data are censored or truncated
[cf. Lo and Singh (1986), Lo, Mack and Wang (1989), Major and Rejto
(1988) and Chao and Lo (1988)]. In Lo and Singh (1986) the analysis of
the remainder term incorporated known global and local properties of empirical processes. In Lo, Mack and Wang (1989) the error bounds were
improved by applying a sharp moment bound for degenerate U -statistics
due to Dehling, Denker and Philipp (1987). In Major and Rejto (1988) a
bound for supu |Rn (u, 1)| of large deviation type due to Major (1988) was
applied, which required h to be bounded. In all these papers estimation of
F (resp. ) could only be carried through on intervals strictly contained
in the support of the distribution of the observed data; similarly in Chao
and Lo (1988) for truncated data situations. This general drawback mainly
arose because of lack of a sharp bound for supu,v |Rn (u, v)| when the kernel
h is not necessarily bounded.
Classes of degenerate U-statistics also have been studied, from a dierent
point of view, by Nolan and Pollard (1987). In their Theorem 6 they derive
an upper bound for the mean of the supremum by rst decoupling the U process of interest and then using a chaining argument conditionally on the
observations. Now, by Holder, a more ecient inequality would be one
relating the pth order mean of the supremum to the pth order mean of the
envelope function, p 2.
At least this is a typical feature of many other maximal inequalities. We also
refer to de la Pe
na (1992) and the literature cited there. In these papers the
main emphasis is on relating the maximum of interest to the maximum of a
decoupled process. No explicit bounds for a degenerate U -statistic process
are derived that are comparable to ours. Note, however, that in applications
the leading (Hajek) part is well understood and it is the degenerate part that
creates the more serious problems.
In this section we shall employ martingale methods to provide a maximal
bound satisfying the above requirements. As a consequence we would be able
to improve the a.s. representations of the product-limit estimators of F for
censored and truncated data as discussed above; see Stute (1993, 1994).

4.5. U -PROCESSES: A MARTINGALE APPROACH

125

n (u, v) the Hajek projection of Un (u, v). As for proofs, unforDenote by U


tunately, as a process in (u, v),
n (u, v)
Un (u, v) U
does not enjoy any particular properties, so that standard maximal inequalities could be applied. As another possibility, assume for a moment that h
is nonnegative. Then Un (u, v) is nondecreasing in (u, v) and adapted to the
ltration
(
)
Fu,v = 1{Xi x} , x max(u, v) .
Let Cn (u, v) denote the compensator in the Doob-Meyer decomposition of
Un (u, v); see, for example, Dozzi (1981). At rst sight one might expect
Un (u, v) Cn (u, v)
to be a two-parameter martingale to which standard maximal bounds could
be applied; see Cairoli and Walsh (1975). A serious drawback of this approach is that with this choice of F:
1. The process (Un () Cn (), F) does not satisfy the fundamental conditional independence property (F4) in Cairoli and Walsh (1975).
2. The compensator Cn () is still a U -statistic process rather than a sum
of i.i.d. processes.
3. Un () Cn () turns out not to be a degenerate U -statistic.
The last comments were meant only to express the authors diculties when
writing the paper, in nding a proper decomposition of Un (u, v), in which
the remainder term (at least the most interesting part of it) is both a twoparameter (strong) martingale in (u, v) and a degenerate U -statistic for (u, v)
xed. Given such a decomposition we could then apply standard maximal
inequalities for (strong) two-parameter martingales. Having thus replaced
supu,v |Rn (u, v)| by a single Rn (u, v), E|Rn (u, v)|p could be further dealt with
by applying Burkholders inequality.
Furthermore, in our analysis, the Doob-Meyer decomposition of the process

h(s, Xj )1{Xj v}
s xed
j

will be employed. Finally, some global bounds for empirical d.f.s `


a la
Dvoretzky-Kiefer- Wolfowitz (1956) will be required.

126

CHAPTER 4. U -STATISTICS

Now, the process Un (u, v) may be written as

n(n 1)Un (u, v) =

h(Xi , Xj )1{Xi u,Xj v}

1i<jn

h(xi , Xj )1{Xi u,Xu v}

1j<in

In (u; v) + IIn (u, v).


The following theorem contains the key representation of In (u, v) in terms
of a sum of independent random processes.
Theorem 4.5.1. Assume h Lp (F F ), with p 2. Then we have

In (u, v) =

[ u

v
h(x, Xj )1{Xj v} F (dx) +

1i<jn

h(Xi , y)1{Xi u} F (dy)

u v

]
h(x, y)F (dx)F (dy) + Rn (u, v),

where for each u0 , v0 ,


[
E

]
sup
uu0 ,vv0

|Rn (u, v)|p C p np .

(4.5.1)

The constant C satises

C C

u0 v0

1/p
|h(x, y)|p F (dx)F (dy)

with C depending only on p.


A similar representation also holds for IIn (u, v). Putting these together we
get the following corollary.

4.5. U -PROCESSES: A MARTINGALE APPROACH

127

Corollary 4.5.2. Under the assumptions of Theorem 4.5.1,


[ u v
n(n 1)Un (u, v) = n(n 1)

h(x, y)Fn (dx)F (dy)



u v

h(x, y)F (dx)Fn (dy)


u v

]
h(x, y)F (dx)F (dy) + Rn (u, v).

where the remainder satises (4.5.1), with C replaced by 2C.


Since (assuming no ties)
u v
n(n 1)Un (u, v) = n

h(x, y)1{x=y} Fn (dx)Fn (dy),

we may write the equation in Corollary 4.5.2 as


n
n1

u v
h(x, y)1{x=y} Fn (dx)Fn (dy)

u v

h(x, y)Fn (dx)Fn (dy)



u v

h(x, y)Fn (dx)Fn (dy)


u v

h(x, y)Fn (dx)Fn (dy) + Rn (u, v)/n(n 1).

Inequality (4.5.1) together with the Markov inequality yield, with probability 1,
(4.5.2)
sup |Rn (u, v)| = o(n1+1/p (ln n) ),
uu0
vv0

whenever satises p > 1. Furthermore, if h is bounded, (4.5.1) may be


applied for each p 2.

128

CHAPTER 4. U -STATISTICS

So far we kept u0 and v0 xed. In such a situation integrability of hp only


up to u0 , v0 ) is sucient. Actually, it may happen that (u0 , v0 ) = (un , vn )
varies
n in such a way that hp is not integrable over the whole plane, but
un with
vn
p
|h| dF dF at a prescribed rate. Theorem 4.5.1 is particularly
useful also in this case. On the other hand, if either un or vn becomes
small as n (such situations occur quite often in nonparametric curve
estimation), then the integral
un vn
|h(x, y)|p Fn (dx)Fn (dy)

also becomes small, to the eect that the bound in (4.5.2) may be replaced
by smaller ones. The last remarks also apply to the results that follow.
Interestingly enough (4.5.2) may be improved a lot. This is due to the fact
that according to Berk (1966) a sequence of normalized U -statistics is a
reverse-time martingale. Utilizing this, we get the following result.
Theorem 4.5.3. Under the assumptions of Theorem 4.5.1, with probability
1,
sup |Rn (u, v)| = o(n(ln n) )
uu0
vv0

whenever p > 1. For bounded hs, we may therefore take any > 0.
With some extra work the logarithmic factor may be pushed down so as to
get a bounded LIL. The necessary methodology may be found, for a xed
U -statistic rather than a process, in a notable paper by Dehling, Denker
and Philipp (1986). After truncation, they applied their moment inequality,
at stage n, with a p = pn depending on n such that pn slowly, to
the eect that for a bounded LIL the moment inequality serves the same
purpose as an exponential bound (personal communication by M. Denker).
Since this method is well established now, we need not dwell on this here
again.
In the next theorem we are concerned with a two-sample situation. Let
X1 , . . . , Xn be i.i.d. with common d.f. F and let, independently of the Xs,
Y1 , . . . , Ym be another i.i.d. sequence with common d.f. G. We shall derive
a representation of the process
nmUnm (u, v) =

n
m

i=1 j=1

h(Xi , Xj )1{Xi u,Yj v} .

4.5. U -PROCESSES: A MARTINGALE APPROACH

129

Theorem 4.5.4. Assume h Lp (F G), with p 2. Then we have


nmUnm (u, v) =

n
m [ u

i=1 j=1

h(x, Yj )1{Yj v} F (dx)


u

h(Xi , y)1{Xi u} G(dx)

]
h(x, y)F (dx)G(dy) + Rnm (u, v).

where for each u0 , v0 ,

E sup |Rnm (u, v)|p [C 2 nm]p/2 .


uu0
vv0

The constant C satises

C C

u0 v0

1/p
|h(x, y)|p F (dx)G(dy)

The analogue of Theorem 4.5.3 is only formulated for m = n.


Theorem 4.5.5. Under the assumptions of Theorem 4.5.4, with probability
1 as n ,
sup |Rnm (u, v)| = o(n(ln n) )
uu0
vv0

whenever p > 1.
Another variant (resp. extension) of Theorem 4.5.1, which is extremely
useful in applications, comes up when, in addition to Xi , there is a Yi paired
with Xi . Typically Xi is correlated with Yi . We may then form

In (u, v) =
h(Xi , Yj )1{Xi u,Yj v} .
1i<jn

Clearly, this In equals the In from Theorem 4.5.1 if Xi = Yi ; similarly,


for IIn (u, v). Theorem 4.5.5 is an extension of Theorem 4.5.1 to paired
observations.

130

CHAPTER 4. U -STATISTICS

Theorem 4.5.6. Assume that (Xi , Yi ), 1 i n, is an i.i.d. sample from


some bivariate d.f. H with marginals F and G. Assume h Lp (F G)
with p 2. Then we have
In (u, v) =

[ u

v
h(x, Yj )1{Yj v} F (dx) +

1i<jn

h(Xi , y)1{Xi u} G(dy)

u v

]
h(x, y)F (dx)G(dy) + Rn (u, v),

where Rn satises (4.5.1) and the h-integral in the bound is taken w.r.t
F G. The assertion of Theorem 4.5.3 also extends to the present case.
Remark 4.5.7. The results of this section may be extended to U -statistic
processes of degree m > 2, but proofs become more complicated and the
notation even more cumbersome. As far as applications are concerned, however, the case m = 2 is by far the most important one.
We end this section by presenting ve examples to which the theorems may
be applied. For these we remark that in the formulation of the previous
results, the point innity could also be included in the parameter set. What
matters is that the parameter sets of the coordinate spaces need to be linearly ordered.
Example 4.5.8. (Censored data). In the random censorship model the
actually observed data are Zi = min(Xi , Ci ) and i = 1{Xi Ci } , where Xi is
the variable of interest (the lifetime), which is at risk of being censored by
Ci , the censoring variable. For estimation of the cumulative hazard function
of X, a crucial role is played by the (one-parameter) process

1i<jn

i 1{Zj >Zi }
1
= In (u),
(1 H)2 (Zi ) {Zi u}

Here H is the d.f. of Zi . If we introduce


{
Zi , if i = 1,
Yi =
, if i = 0,
then
i 1{Zi u} = 1{Yi u} ,

u R.

4.5. U -PROCESSES: A MARTINGALE APPROACH

131

and, therefore,

In (u) =

1i<jn

1{Zj >Yi }
1
= In (u, )
(1 H)2 (Yi ) {Yi u}

for an appropriate kernel h. The fact that Yi is an extended random variable


is of no importance to us. The theorems have been formulated for real
variables just for the sake of convenience, but may be generalized easily to
the foregoing setup. This example is discussed in greater detail in Stute
(1994).
Example 4.5.9. (Truncated data). Here one observes (Xi , Yi ) only if Yi
Xi . Though originally Xi is assumed independent of Yi the actually observed
pair has dependent components. For estimation of the cumulative hazard
function the following process constitutes a crucial part in the analysis:
1{Yj Xi Xj }
1{Xi u}
In (u) =
C 2 (Xi )
1i<jn

for some particular function C. Obviously In may be decomposed into two


parts, each of which is of the type as discussed in Theorem 4.5.6, with v = .
See Stute (1993) for a thorough discussion of this example.
Example 4.5.10. (Two samples). In the situation of Theorem 4.5.4, the
Wilcoxon two-sample rank test for H0 : F = G versus H1 : F = G( ) is
based on the U-statistic
m
n

h(Xi , Yj ),
Tnm =
i=1 j=1

with h(x, y) = 1{xy} . In our previous notation,


Tnm = nmUnm (, ).
We may now consider the associated process Unm in order to construct tests
for H0 versus H1 , which are based on the whole of Unm rather than only
Tnm . It would be interesting to compare the power of these tests with that
of the standard Wilcoxon test.
Example 4.5.11. (Trimmed U -statistics). Gijbels, Janssen and Veraverbeke (1988) investigated so-called trimmed U -statistics
n(n

1)Un0 (, )

[n] [n]

h(Xi:n , Xj:n ),

i=1 j=1

where X1:n . . . Xn:n denotes the ordered sample.

132

CHAPTER 4. U -STATISTICS

Since
n(n

1)Un0 (, )

n
n

h(Xi , Xj )1{Xi Fn1 (),Xj Fn1 ()} ,

i=1 j=1

we see that Un0 (, ) is related to Un (Fn1 (), Fn1 ()), neglecting the sum
over i = j for a moment. Observe that u = Fn1 () and v = Fn1 () are random in this case. The (uniform) representation of Un in terms of a (simple)
sum of independent random processes together with their tightness in the
two-dimensional Skorokhod space allows for a simple analysis of Un0 (, ),
not just for a xed (, ) [as done in Gijbels, Janssen and Veraverbeke
(1988)], but as a process in (, ). Details are omitted.
U -statistic processes also occur in the analysis of linear rank statistics. We
only mention the possibility of representing a linear signed rank statistic
(up to an error term) as a sum of i.i.d. random processes [cf. Sen (1981),
Theorem 5.4.2].
Example 4.5.12. (Linear signed rank statistics). For a sample X1 , . . . , Xn
and a proper score function , it is required to represent the double sum

(Xi )1{|Xj |Xi } 1{0Xi x} .


1i=jn

We see that Theorem 4.5.6 applies with Yj = |Xj |.


Example 4.5.13. The Lorenz functional
p
L(p) =

F 1 (u)du

serves as a tool to measure the economic imbalances in a population. It is


also of interest to determine its variance. This leads to the function
]2
p p [ 1
F (u) F 1 (v) dudv
L1 (p) =

0 0

Chapter 5

Statistical Functionals
5.1

Empirical Equations

Empirical equations have been addressed briey in Section 1.4. For a more
detailed discussion, consider a parametric family M = {f : } of
densities w.r.t. a measure . Assume that the true distribution function F
belongs to M, i.e., F = F0 = f0 for some (unknown) 0 . Set

K(, 0 ) =

ln

f0 (x)
F (dx) =
f (x)

]
[
f (x)
ln 0
f0 (x)(dx).
f (x)

This quantity is called the Kullback-Leibler information. In the context


of this section the following property of K is important.
Lemma 5.1.1. We have K(, 0 ) 0 with K(, 0 ) = 0 if and only if
= 0 .
The assertion of this lemma may be reformulated as follows: consider the
mapping

TF : ln f (x)F (dx)
which of course depends on F . Denote with T (F ) the maximizer of TF , i.e.,
T (F ) = arg max TF ().

When F = F0 , Lemma 5.1.1 yields T (F0 ) = 0 . This property is frequently


called Fisher consistency. Since F is unknown, so is TF . Given an i.i.d.
133

134

CHAPTER 5. STATISTICAL FUNCTIONALS

sample from F , we may compute Fn and consider the mapping TFn instead
of TF :

n
1
ln f (Xi ).
TFn () = ln f (x)Fn (dx) =
n
i=1

The parameter T (Fn ) is the maximum likelihood estimator (MLE) of 0 .


For our discussion of T (Fn ) = arg max TFn () it is sucient to assume that
T (Fn ) is well dened. We see again that the target may be obtained as T (F ),
while the estimator equals T (Fn ). Under some smoothness assumptions, the
MLE solves the (vector) equation
1 ln f (Xi )
=
n

(x, )Fn (dx) = 0,

(5.1.1)

i=1

with

ln f (x)
.

Equation (5.1.1) is an example of an empirical equation. The same principle also applies to other s. See below. In such a situation the resulting
estimator is called an M- (or maximum likelihood type) estimator and the
associated functional attaching to a d.f. G the solution of

(x, )G(dx) = 0
(x, ) =

is called an M-Functional. As it will turn out, the class of M-functionals


may be substantially enlarged, if we let also depend on G:

L(, G) (x, , G)G(dx).


The parameter of interest is the solution of L(, F ) = 0, and the estimator
is obtained as the solution of the empirical equation
L(, Fn ) = 0.
We shall discuss several important estimators which all t into this scheme:
R-estimators: Let J be a nondecreasing function on [0,1] such that J(1
t) = J(t) for 0 t 1. Put
)
(
G(x) + 1 G(2 x)
L(, G) = J
G(dx).
2

5.1. EMPIRICAL EQUATIONS

135

If F is symmetric at 0 , then L(0 , F ) = 0, i.e., the center of symmetry 0


may be characterized as a solution of this equation. The estimator satises
L(, Fn ) = 0.
For J(t) = t 12 , we obtain
{
n = med

}
Xi + Xj
: 1 i, j n ,
2

the Hodges-Lehmann estimator.


L-M-estimators: We put
(x, , G) = J(G(x)) 0 (x ).
For J 1, we are back at the M-estimator. For 0 (y) = y, we arrive at socalled L-estimators.
Minimum-Distance estimators: Consider

2
d (G, F ) = [G(x) F (x)]2 G(dx).
The parameter of interest is the minimizing d2 (F, F ). Under smoothness
this minimizer satises

(x, , F )F (dx) = 0
with

F (x).

More examples are discussed in Stute (1986), Stoch. Proc. Appl. 22, 223244.
(x, , G) = [G(x) F (x)]

This paper also provides conditions under which


n1/2 [T (Fn ) T (F )]
has a normal limit. Generally speaking, dierentiability of in and G is
required guaranteeing that a Taylor expansion applies. For other estimators, like the Least Absolute Deviation estimator dierentiability is not
available, so that other arguments will be needed.

136

5.2

CHAPTER 5. STATISTICAL FUNCTIONALS

Anova Decomposition of Statistical Functionals

Let S = S(X1 , . . . , Xn ) be a square integrable functional of independent


variables X1 , . . . , Xn . Set, for T {1, . . . , n},
ST = E(S|Xi , i T ).
In particular,
S = E(S)
ST

= S for T = {1, . . . , n}.

Put
YT =

(1)|T \U | SU .

U T

Lemma 5.2.1. (Efron-Stein) We have

S=

YT ,

T {1,...,n}

where the summands satisfy


(i) E(YT ) = 0 for each T =
(ii) Cov(YT1 , YT2 ) = 0 for T1 = T2 .
Proof. The representation of S follows from

YT =
(1)|T \U | SU
T {1,...,n} U T

T {1,...,n}

U {1,...,n}

n|U |

SU

(1)

k=0

n |U |
k

)
.

The sum over k equals zero unless |U | = n, in which case it is one. For (i),
E(YT ) =

U T

where r = |T | 1.

(1)

|T \U |

( )
r

k r
E(S) = E(S)
(1)
= 0,
k
k=0

5.2. ANOVA DECOMPOSITION OF STATISTICAL FUNCTIONALS137


As to (ii), since Y is a constant, we may assume T1 , T2 = w.l.o.g. Suppose
also that T2 \ T1 is nonempty (otherwise, consider T1 \ T2 ). Then
Cov(YT1 , YT2 ) = E(YT1 YT2 )
= E(YT1 E(YT2 |Xi , i T1 )).
We show that the conditional expectation is zero. By independence we get

E(YT2 |Xi , i T1 ) =
=

(1)|T2 \U | E(S|Xi , i U T1 )

U T2
|T2 ||W ||T2 \T1 |

(1)

E(S|Xi , i W )

W T1 T2

(1)|T2 \T1 ||V | .

V T2 \T1

Because of |T2 \ T1 | > 0, the sum over the V s vanishes.



Remark 5.2.2. Set

Wk =

YT ,

1 k n.

T {1,...,k}

Then
Wk =

(1)|T \U | SU

T {1,...,k} U T

k|U |

SU

U {1,...,k}

(1)

r=0

)
k |U |
= E(S|Xi , i = 1, . . . , k),
r

i.e. (Wk )k is a martingale w.r.t. Fk = (Xi , 1 i k), k = 1, . . . , n.

If, in the Efron-Stein representation of S, we only consider the T s with

cardinality not exceeding one, we obtain the Hajek projection S:


S :=

|T |1

YT =

i=1

E(S|Xi ) (n 1)ES.

138

5.3

CHAPTER 5. STATISTICAL FUNCTIONALS

The Jackknife Estimate of Variance

Recall the Efron-Stein decomposition of a square integrable statistic S =


S(X1 , . . . , Xn ):
ST := E(S|Xi , i T )
and
YT :=

(1)|T \U | SU .

U T

Then
S=

YT .

T {1,...,n}

Moreover, the YT s are uncorrelated. In particular,

Var(S) =
Var(YT ).
=T {1,...,n}

In this section we study estimation of Var(S) in the case when the Xs are
i.i.d. and S is symmetric. Then Var(YT ) only depends on the cardinality of
T . Write k2 = Var(YT ) when |T | = k. Hence
s2n

Var(S) =

n ( )

n
k=1

k2 .

Note that k2 depends on n. We now introduce the jackknife estimate of


s2n . Denote with S(i) = S(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) the value of S after
deleting Xi from the sample. Observe that S(i) is computed for sample size
n 1. Let S() be the sample mean of S(1) , . . . , S(n) . The jackknife estimate
of s2n1 is then given by
s2n1

:=

[S(i) S() ]2 .

i=1

It is the purpose of this section to compare E


s2n1 with s2n1 . For this,
compare S(i) with S(j) , i = j. Check that
S(i) S(j) = YT YT ,

where denotes
summation over all T {1, . . . , i1, i+1, . . . , n} contain
ing j, and denotes summation over all T {1, . . . , j 1, j + 1, . . . , n}

5.3. THE JACKKNIFE ESTIMATE OF VARIANCE

139

containing i. Since thus all T appearing on the right-hand side are dierent,
the corresponding Y s are uncorrelated. By symmetry, we thus get
n1
(

E(S(i) S(j) ) = 2
2

k=1

Now,
2
s2n1 = n1

)
n2 2
.
k1 k

(S(i) S(j) )2 ,

i=j

whence
E
s2n1

= (n 1)

n1
(
k=1

Recall
s2n1

n1
(
k=1

)
n2 2
.
k1 k

)
n1 2
k .
k

We nd that
E
s2n1

s2n1

n1

k=2

)
n1 2
(k 1)
k 0.
k

It follows that the jackknife estimate of variance is always biased upwards.


It is unbiased only when k2 = 0 for k 2. In this case
S = E(S) +

Y{i} =

i=1

E(S|Xi ) (n 1)E(S),

i=1

i.e. S coincides with its Hajek projection. Such Ss are called linear. They
are called quadratic i k2 = 0 for k 3. Such an S admits a representation
S = E(S) +

i=1

Y{i} +

Y{i,j} ,

i=j

a U -statistic of degree two. More generally, every S for which k2 = 0 when


2 = 0).
k > m is a U-statistic of degree m (provided m
Conversely, if S only depends on X1 , . . . , Xm , then YT =
0 for each T
{1, . . . , n} not wholly contained in {1, . . . , m}. Thus, if S = h(Xi1 , .
. . , Xim )
is a (square-integrable)
U-statistic of degree m, we nd that S = YT ,

where
extends over all T {1, . . . , n} with |T | m. Hence i2 = 0

140

CHAPTER 5. STATISTICAL FUNCTIONALS

whenever k > m. In summary, i2 = 0 for k > m if and only if S is a


U-statistic of degree m.
The i-th pseudo-value is dened as
Si = nS(X1 , . . . , Xn ) (n 1)S(X1 , . . . , Xi1 , Xi+1 , . . . , Xn )
= nS(X1 , . . . , Xn ) (n 1)S(i) .
Denote with S their arithmetic mean. Then
n

(Si

S ) = (n 1)

i=1

(S(i) S() )2 ,

i=1

i.e., in terms of the pseudo-values the jackknife estimate of variance equals


(n 1)2

(Si S )2 .

i=1
= X
n . Since X
n
Example 5.3.1. For the sample mean, Si = Xi and S
n
2
2

is linear, the jackknife estimate of variance (n 1)


i=1 (Xi Xn ) is
unbiased (for sample size n 1).

Example 5.3.2. If S is the median and n = 2m, then


s2n1 =

5.4

n
[Xm+1:n Xm:n ]2 .
4

The Jackknife Estimate of Bias

Let Sn S = S(X1 , . . . , Xn ) be an integrable function of i.i.d. random variables X1 , . . . , Xn F . Assume that S is designed to estimate a population
parameter (F ). Consider the bias
bias := EF (S) (F ).
Quenouilles estimate of bias is
BIAS := (n 1)[S() S]
n
n1
S(i) (n 1)S.
=
n
i=1

This leads to the bias-corrected jackknife estimate of (F ):


S = S BIAS = nS (n 1)S() .

5.4. THE JACKKNIFE ESTIMATE OF BIAS

141

Lemma 5.4.1. Assume that bias has an expansion


bias =

a1
a2
a3
+ 2 + 3 + O(n4 ),
n
n
n

the parameters a1 , a2 and a3 usually depending on F . Then S has the bias


expansion
a2
E(S ) (F ) =
+ O(n3 ).
n(n 1)
Proof. We have
E(S ) (F ) = nE(Sn ) (n 1)E(Sn1 ) (F )
= n[E(Sn ) (F )] (n 1)[E(Sn1 ) (F )]
(
)
a2
a3
a2
a3
= a1 +
+ 2 a1 +
+
+ O(n3 )
n
n
n 1 (n 1)2
(
)
(
)
1
1
1
1
= a2

+ a3

+ O(n3 ),
n n1
n2 (n 1)2
whence the assertion.

Example 5.4.2. Take (F ) = 2 (F ), and let S = n1
simple calculation shows

i=1 (Xi

n )2 . A
X

1
n )2 .
(Xi X
n(n 1)
n

BIAS =

i=1

yielding

1
n )2 ,
S =
(Xi X
n1
n

i=1

the usual unbiased estimate of (F ).


We now compute the bias corrected value of the jackknife estimate of variance
n

s2n1 =
[S(i) S() ]2 W.
i=1

Clearly,
W =

i=1

2
S(i)

i,j=1

S(i) S(j) .

142

CHAPTER 5. STATISTICAL FUNCTIONALS

Denote with Sir the value of S taken at the sample deleted by Xi and Xr .
Then
n
n

1 2
1
W() =
Sir
Sir Sjr
n
n(n 1)
r=1 i=r

r=1 i,j=r

so that by symmetry
2
EW() = (n 2)ES12
(n 2)ES13 S23 .

On the other hand,


EW = (n 1)ES12 (n 1)ES1 S2 .
Since

W = W BIAS = nW (n 1)W() ,

we nd that
EW = n(n 1)[ES12 ES1 S2 ]
2
(n 1)(n 2)[ES13
ES13 S23 ].

But

1
ES1 S2 = ES12 E(S1 S2 )2
2

and

1
2
ES13 S23 = ES13
E(S13 S23 )2 .
2

Conclude that
1
1
n(n 1)E(S1 S2 )2 (n 1)(n 2)E(S13 S23 )2
2
2
n1
n2
(n 2)
(n 3)
2
= n(n 1)

(n 1)(n 2)
2
.
k 1 k,n1
k 1 k,n2

EW =

i=1

k=1

2 indicates the underlying sample size. Note that the rst


The index i in ki
summand equals nE
s2n1 , while the second is (n 1)E
s2n2 . Both expectations exceed the corresponding true variances. By subtracting them (when
appropriately normalized) it might well be that W leads to a truly bias
corrected value of the jackknife estimate of variance.

To make this argument work one also needs some smoothness of the variance
as a function of n.

5.4. THE JACKKNIFE ESTIMATE OF BIAS

143

Remark 5.4.3. In terms of the pseudo-values Si = nS (n 1)S(i) , we


have
n

BIAS = S n1
Si .
i=1

To motivate the choice of BIAS, assume that Sn = S(Fn ) is a smooth


statistical functional of the empirical d.f. Fn , i.e. (F ) = limn S(Fn ) in
L1 . For sample size n 1,
bias = ES(Fn1 ) (F ) =

[ES(Fn+k ) ES(Fn+k1 )].

k=0

Replace the series by the nite sum from k = 0 to k = m. In S(Fn+k )


S(Fn+k1 ) we compare values of S for two consecutive sample sizes. For
k = 0 we may generate such a pair by deleting a point mass from Fn . In
doing this for each 1 i n, we arrive at
1
[Sn S(i) ] = Sn S() .
n
n

ES(Fn ) ES(Fn1 )

i=1

Taking the last term also for estimating ES(Fn+k ) ES(Fn+k1 ) for k > 0,
and letting m = n 2, we nally arrive at BIAS.

144

CHAPTER 5. STATISTICAL FUNCTIONALS

Chapter 6

Stochastic Inequalities
6.1

The D-K-W Bound

This section deals with the most famous bound on the deviation between
Fn and F , the Dvoretzky-Kiefer-Wolfowitz (1956) exponential bound for the
upper tails of Dn+ .
Theorem 6.1.1. There exists a universal constant c > 0 so that for all
x 0 and n 1
P(n1/2 Dn+ > x) c exp(2x2 ).
(6.1.1)
Proof. In view of (3.1.17), it suces to study the case F = FU . The
original proof of Dvoretzky-Kiefer-Wolfowitz, which is presented here, rests

on Theorem 3.4.5. Since the left-hand side of (??) equals zero for x n,

we only need to consider 0 x n. Apply Theorem 3.4.5 to get

P(n1/2 Dn+ > x) = (1 x/ n)n + x n

n1

j=<x n>+1

Qn (j, x),

(6.1.2)

( )

n
Qn (j, x) =
(j x n)j (n j + x n)nj1 nn .
j

Taking logarithms and dierentiating we see that on 0 x n the function

x (1 x/ n)n exp(2x2 )
with

attains its maximum at zero. Conclude that

(1 x/ n)n exp(2x2 ).
145

146

CHAPTER 6. STOCHASTIC INEQUALITIES

Moreover, for x n < j < n, we get


d
ln Qn (j, x) =
dx
<

n
xn2

(j x n)(n j + x n) n j + x n
)2
4x
16x ( n
<
4x

j
+
x
n ,
(
)2
n2 2
1 n42 n2 j + x n

The last term, however, has the primitive


(
)2
8x2 n
2x n
4x4
2
x 2x 2
j+

.
n
2
3
9n
Integration therefore leads to
[

8x2
Qn (j, x) Qn (j, 0) exp 2x2 2
n

as well as, for x 1,

8x2
Qn (j, x) c1 Qn (j, 1) exp 2x 2
n

]
)2
n
4x4
2x n

j+
2
3
9n

(6.1.3)

]
)2
2x n
4x4
n
j+

(6.1.4)
2
3
9n

with a universal constant c1 > 0. To bound Qn (j, 0) and Qn (j, 1), we rst
consider the case |j n2 | n4 . From Stirlings formula
( )k

k
k!
2k
e
we may infer that

Qn (j, 0) c2 n3/2

for some generic constant c2 . Denote with the sum over those j in (6.1.2)
satisfying |j n2 | n4 , and let be the sum over the remaining ones. From
(6.1.3) we obtain
[
)2 ]
(
j
2x
1
Qn (j, x) c2 n3/2 exp(2x2 ) exp 8x2
+
2 n 3 n
2c2 n

3/2

exp(2x )

exp[8x2 j 2 /n2 ]

j=0

1
2c2 n1/2 exp(2x2 ) +
n

c3 x

1 1/2

exp(2x )

exp(8x2 t2 )dt

6.1. THE D-K-W BOUND


and therefore

147

xn1/2 Qn (j, x) c3 exp(2x2 ).

We now turn to the second sum. We assume w.l.o.g. that x > 1. If

2x n/3 n/8, the second term in the exponent of (6.1.4) does not ex
ceed x2 /8. For x > 3 n/16, the last term is less than or equal to
(4/9)(3/16)2 x2 so that
Qn (j, x) c1 Qn (j, 1) exp[2x2 c4 x4 ]
c5 x1 Qn (j, 1) exp(2x2 ).

It follows, since nj Qn (j, 1) 1, that

x n Qn (j, x) c5 exp(2x2 ) n Qn (j, 1) c5 exp(2x2 ).


The proof of the Theorem is complete.

Massart (1990) was able to rene the original arguments of D-K-W and
proved that the optimal c equals 1.
Theorem 6.1.1 immediately yields some rough rates for the convergence of
Fn to F . First, for a given positive > 0, the D-K-W bound yields
P(Dn > ) 2c exp(2n2 ).
Since the series of the terms on the right-hand side converges, the BorelCantelli Lemma yields the Glivenko-Cantelli Theorem, namely that Dn 0
with probability one. A slight modication of this argument yields the
following Corollary.
Corollary 6.1.2. With probability one,
[ n ]1/2
Dn < .
lim sup
ln n
n
Proof. Put x = (c ln n)1/2 with c > 1/2 and argue as before.

The assertion of Corollary 6.1.2 is often stated in the form


((
) )
ln n 1/2
Dn = O
with probability one.
n
For many applications this bound is sucient. The precise rate of con(
)
n 1/2
, which indicates that the almost sure
vergence is of the order ln ln
n
behavior of Fn , as n , is governed by a Law of the Iterated Logarithm.

148

6.2

CHAPTER 6. STOCHASTIC INEQUALITIES

Binomial Tail Bounds

As before, let X1 , . . . , Xn be i.i.d. from F so that := nFn (t) Bin(n, p)


with p = F (t). If 0 < p < 1, the CLT yields
(
)

F
(t)

F
(t)
1
2
n
1/2
P n
x
eu /2 du = P( x), (6.2.1)
2
F (t)(1 F (t))
x

where N (0, 1). Mills ratio states that for all x > 0
(
)
1
1
1
1 1
3 exp(x2 /2) P( x) exp(x2 /2),
x x
x 2
2

(6.2.2)

i.e., neglecting the factors 1/x and 1/x3 , the upper tails of a standard normal
distribution decrease exponentially fast. In this section we show that a
similar bound also holds, for nite n, for a standardized binomial variable.
First, the Chebychev inequality leads to
)
(
Fn (t) F (t)
1/2
x x2
P n
F (t)(1 F (t))
which is far worse than what we may expect from (6.2.1) and (6.2.2). A
much sharper bound is obtained if we incorporate the moment generating
function of a binomial random variable.
Lemma 6.2.1. Let Bin(n, p), 0 < p < 1. Then the moment generating
function of equals
M (z) = E[exp(z)] = [(1 p) + p exp z]n .
Proof. Use the fact that equals, in distribution, the sum of n independent
Bernoulli random variables with parameter p, whose moment generating
function equals (1 p) + p exp z.

To bound P( np ) for > 0, note that the Markov inequality yields
for each z 0:
P( np ) E [exp(z( np ))]
= M (z) exp[z(np + )].
We thus get the following bound.

6.2. BINOMIAL TAIL BOUNDS

149

Lemma 6.2.2. For each > 0 we have


P( np ) inf M (z) exp[z(np + )] .
z0

To determine let f be dened by


exp[f (u)] = inf M (z) exp(zu)
z0

so that
= exp[f (np + )].
The function f is called the Cherno-function of M .
Clearly, f is nonnegative and nondecreasing.
Lemma 6.2.3. We have

u ln u + (n u) ln nu
np
n(1p)
f (u) f (u, n, p) =

n ln p

for
for
for
for

u0
0<u<n
.
u=n
n<u

Proof. The assertion follows by formal dierentiation of M (z) exp(zu) in


(0, n) and a separate check outside of this interval.

Before we further analyze the Cherno-function we note that the lower tails
P( np )
equal
P(
+ n(1 p)), where Bin(n, 1 p)
whence
P( np ) exp [f (n(1 p) + , n, 1 p)] .
Coming back to the upper tails with u = np+, we shall focus on 0 < u < n,
i.e., < n(1 p). We write
[
] [
]

f (u, n, p) = np 1 +
ln 1 +
np
np
[
] [
]

+ n(1 p) 1
ln 1
.
n(1 p)
n(1 p)

150

CHAPTER 6. STOCHASTIC INEQUALITIES

For |x| < 1 we may expand


(1 + x) ln(1 + x) = x +

x2
x3
x4

+
...
12 23 34

and

x2
x3
x4
+
+
+ ...
12 23 34
With x1 = /np and x2 = /n(1 p) we thus get
[
]
[ 3
]
x31
x41
x2
x42
2
+np
+
. . . +n(1p)
+
+ ... .
f (u, n, p) =
2np(1 p)
23 34
23 34
(1 x) ln(1 x) = x +

When 0 < x1 , x2 < 1 it therefore follows that


[
]
x31
x41
2
+ np
+
...
f (u, n, p)
2np(1 p)
23 34
np
2
+
[. . .]

2np(1 p) 1 p
(
)
2

.
2np(1 p)
np
The function is dened, for |x| < 1, as
(x) = 1

x x2
(1)k 2xk
+
... +
+ ...
3
6
(k + 2)(k + 1)

Check that
(x) = 2h(1 + x)/x2

(6.2.3)

h(x) = x(ln x 1) + 1.

(6.2.4)

with
In terms of (6.2.3) and (6.2.4), is dened for all positive x. Putting
( )

/2np(1 p)
g1 () = f (np + , n, p) and g2 () = 2
np
we easily nd that
g1 (0) g2 (0) and g1 / g2 /.
The second inequality is equivalent to
[
]
[
]

(1 p) ln 1
+ p ln 1 +
0
n(1 p)
np
which easily follows from the concavity of x ln(1 + x). We thus obtain
g1 () g2 ().

6.2. BINOMIAL TAIL BOUNDS

151

Lemma 6.2.4. Let Bin(n, p). Then we have for all > 0
(i)
( )]
2

P( np ) exp

2np(1 p)
np
[
(
)]
np

= exp
h 1+
.
1p
n(1 p)
[

(ii)

2
P( np ) exp

2np(1 p)

n(1 p)

)]
.

Improved bounds are obtained if we use sharper lower bounds for f (u, n, p).
For example, when 1 p p and therefore p 1/2, we get for u = np +
f (u, n, p) 2 /2np(1 p).

(6.2.5)

For an arbitrary 0 < p < 1, the somewhat weaker bound


2
np
f (u, n, p)

2np(1 p)
6

np

)3

[
]
(1 p)
2
1
(6.2.6)
=
2np(1 p)
3np

will suce for most applications.


In many situations
(1 p)
, a given threshold.
3np
In this case
f (u, n, p)

2 (1 )
.
2np(1 p)

(6.2.7)

For the standardized Fn (t), we obtain for all x 0 and a given threshold
0<<1
(
) {
n1/2 [Fn (t) F (t)]
exp[x2 /2]
if p = F (t) 1/2
P
x
2 (1 )/2] for a general p = F (t),
exp[x
F (t)(1 F (t))
provided that x

3 np
.
(1p)3/2

152

6.3

CHAPTER 6. STOCHASTIC INEQUALITIES

Oscillations of Empirical Processes

Consider the uniform empirical process

n (t) = n1/2 [Fn (t) t] on 0 t 1,


based on a sample U1 , . . . , Un . Since each Ui is a discontinuity of
n , the
number of jumps increases with n. On the other hand the jumpsize n1/2
tends to zero so that a priori it is not clear, how the paths evolve as n gets
large. A measure for the roughness of a sample path is given through the
oscillation modulus

n (a) = sup |
n (t)
n (s)|.
|ts|a

Typically, 0 < a < 1 is a small number so that


n measures local oscillations of
n . To study
n in greater detail we rst keep s xed, say s = 0 and
consider the one-sided (local) deviation sup0ta
n (t). Since the supremum
is taken over an uncountable set, it is wise to rst bound
n on a nite grid
0 = t0 < t1 < . . . < tm = a and then let the mesh of the partition tend to
zero. For level x > 0, put
Ai = {
n (ti ) > x}.
We then have

(
P

)
(m
)

sup
n (ti ) > x = P
Ai .

0im

i=0

Next we introduce an event


Am = {
n (tm ) > x }.
The threshold x will be a little smaller than x so that Am Am . We have
)
(m
)
(m
)
(m

P
Ai
= P
Ai Am + P
Ai Am
i=0

i=0

P(Am ) +

i=0
m1

(
)
P Am Ai A0 . . . Ai1 .

i=0

We may further decompose each set Ai A0 . . . Ai1 into nitely many


sets on which the discrete variables
n (tj ), j i, attain specied values

6.3. OSCILLATIONS OF EMPIRICAL PROCESSES

153

n (tj ) = xj . Write
P(Am ,
n (tj ) = xj for 0 j i)
= P(Am |
n (tj ) = xj for 0 j i)P(
n (tj ) = xj for 0 j i)

= P(A |
n (ti ) = xi )P(
n (tj ) = xj for 0 j i),
m

where the last equality uses the Markov-property of


n . Suppose x has been
chosen so that each of the conditional probabilities is less than or equal to
a constant c, c < 1. This would yield
(m
)
m1

P
Ai
P(Am ) + c
P(Ai A0 . . . Ai1 )
i=0

i=0

P(Am ) + cP

(m

)
Ai

i=0

From this we immediately obtain


(
)
P sup
n (ti ) > x
0im

1
P(
n (a) > x ).
1c

Since the right-hand side does not depend on the chosen grid, the bound
also holds when we let the mesh tend to zero. Since
n is continuous from
the right and has left-hand limits, we nally arrive at
(
)
1
P sup
n (t) > x
P(
n (a) > x ).
(6.3.1)
1c
0ta
Arguments similar to those which led to the maximal inequality (6.3.1)
are well-known and have been successfully applied also outside empirical
process theory.
We now investigate conditions on x x guaranteeing
P(Am |
n (ti ) = xi ) c
for each xi > x, Now,
P(
n (tm ) x |
n (ti ) = xi )
)
)
(
(

tm ti

nx + tm n N ,
= P (n N )FnN
1 ti

154

CHAPTER 6. STOCHASTIC INEQUALITIES

with N = nxi + nti . The last probability equals


{
(
)
[
]}
tm ti
tm ti
1 tm

P (n N )FnN
(n N )
n x xi
.
1 ti
1 ti
1 ti
Since xi > x,

[
]
[
]

1 tm
1 tm

n x xi

n x x
1 ti
1 ti

n [x x(1 a)] .

Write x = x(1 ) with a < /2. Then


n[x x(1 a)] n /2.


Under these conditions the Chebychev inequality yields
P (
n (tm ) x |
n (ti ) = xi )
tm ti
(n N ) 1ti
4a
2 2 1/2,

2
2
nx /4
x
where the last inequality holds provided that 8a x2 2 .
Lemma 6.3.1. Assume that
(i) a < /2
(ii) 8a x2 2
(

Then
P

)
sup
n (t) > x
0ta

2P (
n (a) > x(1 )) .

A similar bound holds for the lower tails so that summarizing we get under
(i) - (ii)
(
)
P sup |
n (t)| > x 2P (|
n (a)| > x(1 )) .
(6.3.2)
0ta

Next we derive an upper bound for the oscillation modulus


n.
Lemma 6.3.2. For 0 < a, < 1 and s > 0 such that
(i) a < /2
(ii) 8 [s/(1 + )]2

6.3. OSCILLATIONS OF EMPIRICAL PROCESSES


(iii) s x
Then

155

na/4 for some positive x depending only on .

P(
n (a) > s a) C a1 exp[s2 (1 )5 /2],

with C = 64 2 .

Proof. Put x = s a and let R be the smallest integer satisfying 1/ R

a/2. We obviously have


(
)
i
i

n (a)
max sup |
n
+t
n ( )|
0iR1 0ta
R
R
(
)
i
i
+ 2 max
sup |
n
+
n ( )|.
0iR1 0 1/R
R
R
Since
n has stationary increments, we obtain
(
(
)
)
i
i
x
P(
n (a) > x) P
max sup |
n
+t
n ( )| >
0iR1 0ta
R
R
1+
(
)
)
(
i
x
i
+
n ( )| >
+ P 2 max
sup |
n
0iR1 0 1/R
R
R
1+
(
)
(
)
x
x
RP sup |
n (t)| >
+ RP
sup |
n (t)| >
.
1+
2(1 + )
0ta
0t1 /R
It follows from (i) - (ii) and the choice of R, that (6.3.2) is applicable to
each of the above probabilities. Hence
[ (
)
(
)]
x(1 )
1
x(1 )
P(
n (a) > x) 2R P |
n (a)| >
+ P |
n ( )| >
.
1+
R
2(1 + )
Under (iii), we may apply (6.2.7) and thus get the nal result.

Theorem 6.3.3. For each > 0 and every > 0 there exists a small a > 0
such that for all n n0 (, )
P(
n (a) ) .

(6.3.3)

Proof. Put = 1/2. We shall apply Lemma 6.3.2 with s = a1/4 . Conditions (i) - (iii) from Lemma 6.3.2 are then all satised for a > 0 suciently
small and n suciently large. Moreover, we may choose a > 0 so small that

156

CHAPTER 6. STOCHASTIC INEQUALITIES

also a1/4 and C1/2 a1 exp[a1/2 /64] hold. This completes the
proof.

We also mention that Theorem 6.3.3 may be readily applied to bound the
oscillation modulus of empirical processes from a general F satisfying some
weak smoothness assumptions.
If, for example, F is Lipschitz:
|F (x) F (y)| M |x y|

(6.3.4)

n (a)
n (M a),

(6.3.5)

n (a) = sup |n (t) n (s)|.

(6.3.6)

then, in view of (6.3.4),

where
|ts|a

is the oscillation modulus pertaining to a general empirical process. Condition (6.3.4) is satised if F has a bounded Lebesgue density.
We may also let a = an tend to zero as n
. A rough upper bound for

n (an ) may be obtained if we set s = sn = K ln a1


n , for a large K > 0.
Theorem 6.3.4. Assume
(i) ln a1
n = o(nan )
and
(ii)

r
n1 an

< for some r > 0.

Then, with probability one,

n (an ) = O( an ln a1
n ).
Improvements of these results may be found in Stute (1982). Extensions to
a general F satisfying (6.3.4) utilize (6.3.5).

6.4. EXPONENTIAL BOUNDS FOR SUMS OF INDEPENDENT RANDOM VARIABLES157

6.4

Exponential Bounds for Sums of Independent


Random Variables

Let X1 , . . . , Xn be independent random variables, not necessarily from the


same distribution. Set
S :=

Xi ,

S := S/n

i=1

and
= n1
:= E(S)

E(Xi ).

i=1

Clearly, for each > 0 and any h > 0,


P(S ) = P(S E(S) n)
= P(h(S E(S)) hn)
n

ehnhE(S)
EehXi .

(6.4.1)

i=1

Lemma 6.4.1. Suppose a X b, and let h R. Then


EehX

b E(X) ha E(X) a hb
e +
e .
ba
ba

Proof. We have
ehX

bX

Xa

= eh ba a+h ba b
b X ha X a hb

e +
e ,
ba
ba

by convexity. Integrate out.



Lemma 6.4.2. (Hoeding). Assume that X1 , . . . , Xn are independent with
0 Xi 1. We then have for 0 < < 1 :
[(
)+ (
)1 ]n

P(S )
+
1
en

2 g()

e2n ,
2

158

CHAPTER 6. STOCHASTIC INEQUALITIES

where
g() =

ln 1

1
12

for 0 < 1/2


for 1/2 1

1
2(1)

Proof. Apply Lemma 6.4.1 to get (with i EXi )


EehXi 1 i + i eh
and therefore

Ee

hXi

(1 i + i eh ).

i=1

i=1

Now use the fact that the geometric mean is always less than or equal to
the arithmetic mean. It follows that
]n
[ n
n
[
]n

1
h
h
(1 i + i e ) = 1 + eh .
(1 i + i e )
n
i=1

i=1

Plugging this into (6.4.1) we get


P(S ) [ehh (1 + h )]n .
Formal dierentiation shows that the right-hand side is minimized for
h = ln

(1 )( + )
.
(1 )

Inserting this h we obtain the rst inequality. For the second we note that
the rst inequality may be written as
P(S ) en

2 G(,)

with
G(, ) =

+
ln
2

)
+

1
ln
2

1
1

)
.

Now check
g() =
Finally, g() 2.

inf

01

G(, ).


Hoedingss inequality may be extended to the case, when the bounds vary
with i.

6.4. EXPONENTIAL BOUNDS FOR SUMS OF INDEPENDENT RANDOM VARIABLES159


Lemma 6.4.3. (Hoeding). Assume that X1 , . . . , Xn are independent with
ai Xi bi . Then we have for each > 0:
[
]
2n2 2

P(S ) exp n
.
2
i=1 (bi ai )
Proof. Omitted.


160

CHAPTER 6. STOCHASTIC INEQUALITIES

Chapter 7

Invariance Principles
7.1

Continuity of Stochastic Processes

The functions Fn , Fn1 and n depend on the data X1 , . . . , Xn and are


thus random. Random functions are realizations of so-called Stochastic Processes. If, e.g., one studies n at nitely many points t1 , . . . , tm ,
then the multivariate CLT may be used to approximate the distribution of
(n (t1 ), . . . , n (tm )). See Lemma 3.2.8. The Kolomogorov-Smirnov (K-S)
statistic
n = sup |Fn (t) t|,
D
0t1

is an example of a quantity depending on all Fn (t), 0 t 1, so that a


reduction to a specic nite-dimensional subvector is not possible. In this
chapter we shall discuss the problem of how to determine or approximate
the distributions of a large class of functions of n . First, we x a general
stochastic process S = (St )0t1 . Later on we shall specify S and discuss
important examples appearing in statistics. That the parameter set equals
the unit interval [0, 1] is only for mathematical convenience. What is important is compactness. Along with S we consider the associated family
of nite-dimensional distributions (dis), i.e., the distributions of subvectors
(St1 , . . . , Stm ), where 0 t1 < t2 < . . . < tm 1
are arbitrary. From a measure theoretic argument one can see that the
distribution of S, which is dened on the space of all functions, is uniquely
determined through its dis. This means, that if two processes have the
same dis then they give equal probabilities to events which are measurable
161

162

CHAPTER 7. INVARIANCE PRINCIPLES

w.r.t. the -eld generated by all nite-dimensional projections


f (f (t1 ), . . . f (tm )),

0 < t1 < t2 < . . . tm 1.

For any function f dened on I = [0, 1], the degree of smoothness may be
measured through the oscillation modulus
(f, ) =

sup
|t1 t2 |

|f (t1 ) f (t2 )|.

Uniform continuity of f is equivalent to


lim (f, ) = 0,
0

(7.1.1)

while Lipschitz-continuity
|f (t1 ) f (t2 )| C|t1 t2 |
yields
(f, ) C.
Since
(f, 1 ) (f, 2 ) if 1 2 ,
(7.1.1) is equivalent to
lim (f, n ) = 0 for any sequence n 0.

Sample continuity of S() for almost all is equivalent to


lim P((S, n ) > ) = 0 for all > 0.

(7.1.2)

Verication of (7.1.2) is not easy since the oscillation modulus incorporates


all pairs |t1 t2 | .
A well-known result due to A. N. Kolmogorov gives sucient conditions
through bivariate dis which ensure uniform continuity of S() with probability one, at least if one restricts S to the dyadic numbers T = {k/2n :
0 k 2n , n 1}.
Theorem 7.1.1. (A. N. Kolmogorov). Let S = (St )0t1 be any stochastic
process such that there exist K < , a 0 and b > 1 so that for any > 0
P(|St St | > ) Ka |t t |b .
Then with probability one S() is uniformly continuous on T .

(7.1.3)

7.2. GAUSSIAN PROCESSES

163

A sucient condition for (7.1.3) is


E|St St |a K|t t |b .

(7.1.4)

In fact, by the Chebychev inequality, (7.1.4) implies


P(|St St | > ) a E|St St |a
Ka |t t |b ,
i.e., (7.1.3) holds,
The restriction to the dyadic numbers is not serious, since by putting
St := lim
St ,

t t

where t T , we obtain a continuous version S which has the same dis as


S.

7.2

Gaussian Processes

Lemma 3.2.8 asserts that the dis of n have normal limits. Processes
which have normal dis not only in the limit are of special importance
since they are candidates for limit processes.
Denition 7.2.1. A stochastic process (St )0t1 is called a Gaussian Process if all dis are normal distributions.
Normal distributions are uniquely determined through
m(t) = ESt the mean function
and
K(s, t) = Cov(Ss , St ) the covariance function.
If m(t) = 0 for all t, (St )t is called a centered Gaussian Process. Recall
that K is symmetric:
K(s, t) = K(t, s)
and nonnegative denite: for any t1 , . . . , tk and 1 , . . . , k R:
k
k

i=1 j=1

i K(ti , tj )j 0.

164

CHAPTER 7. INVARIANCE PRINCIPLES

It can be shown that for any K satisfying these two properties there exists
a centered Gaussian process (St )t such that
Cov(Ss , St ) = K(s, t).
We now introduce the most famous Gaussian Process the Brownian Motion. This belong to
K(s, t) = min(s, t).
Clearly, K is symmetric. For 0 t1 < t2 < . . . < tk let X1 , . . . , Xk be
independent normal random variables such that
X1 N (0, t1 ), X2 N (0, t2 t1 ), . . . , Xk N (0, tk tk1 ).
Put
S1 = X 1

S2 = X1 + X2 , . . . , Sk = X1 + . . . + Xk .

Then, for i < j, we have


[

= E[Si Sj ] = E Si

ij

Si +

)]
Xk

k=i+1

= ESi2 = ti = min(ti , tj ).
Hence K(ti , tj ) is a covariance function and therefore nonnegative denite.
Theorem 7.2.2. Let B = (Bt )0t1 be a Gaussian Process with
EBt = 0 and Cov(Bs , Bt ) = min(s, t).
Then we have:
(i) The process B exists
(ii) B has continuous sample paths
(iii) B0 = 0 w.p. 1
(iv) B has independent increments with
Bt Bs N (0, t s) for 0 s t 1
[
]
(v) E (Bt Bs )2 = t s for 0 s t 1

7.3. BROWNIAN MOTION

165

Proof. We have already seen that min(s, t) is an admissable covariance


function. Furthermore, m(t) 0. Hence B exists. Putting s = t, we
get VarBt = t. For t = 0, we have EB0 = 0 = VarB0 and therefore
B0 = 0 with probability one. The variable Bt Bs is a linear function of
the Gaussian random vector (Bs , Bt ) and therefore Gaussian. Check that
for 0 s t 1
E(Bt Bs )2 = EBt2 + EBs2 2EBt Bs = t + s 2s = t s.
For increments Bt1 , Bt2 Bt1 , . . . , Btk Btk1 we get for i < j:
[
]
E (Btj Btj1 )(Bti Bti1 )
= ti ti1 ti + ti1 = 0.
Hence the increments are uncorrelated. Since they are jointly Gaussian,
they are also independent. It remains to prove continuity. For this we verify
Kolmogorovs criterion: For s < t, Bt Bs N (0, t s). Therefore,
E[Bt Bs ]4 = (t s)2 E 4 ,

N (0, 1).


This completes the proof.

We already mentioned that we could dene a Brownian Motion on any


interval [0, T ], and nally on [0, ).

7.3

Brownian Motion

The Brownian Motion is the most important Gaussian Process. Since B0 =


0, it starts in zero. The process
t = x + Bt
B
is called a Brownian Motion starting in x. The process
t = Bt
B
is called a Brownian Motion with scale parameter > 0. For = 1 and
x = 0, we call B a standard Brownian Motion. In many applications,
we come up with Brownian Motions which involve a transformation of
time:
t = B(t) .
B

166

CHAPTER 7. INVARIANCE PRINCIPLES

This process is also a centered Gaussian process with covariance


s , B
t ) = min((s), (t)).
cov(B
If is nondecreasing
min((s), (t)) = (s) for s t.
From B we may obtain other Gaussian processes.
Lemma 7.3.1. Let > 0. Then
t = 1/2 Bt ,
B

t0

is a Brownian Motion.
t )t is a centered Gaussian process with covariance
Proof. (B
s , B
t ) = 1 cov(Bs , Bt )
Cov(B
= 1 s = s for 0 s t.

Lemma 7.3.2. Fix 0 < t0 . Then
t = Bt +t Bt ,
B
0
0

t0

is a Brownian Motion.
is a centered Gaussian process with covariance
Proof. B
s , B
t ) = E [(Bt +s Bt )(Bt +t Bt )]
Cov(B
0
0
0
0
= t0 + s t0 t0 + t0 = s for 0 s t.

The following process is of central importance in the theory of Empirical
Processes.
Lemma 7.3.3. Let B = (Bt )0t1 be a Brownian Motion on the unit
interval. Put
Bt0 = Bt tB1 , 0 t 1.
Then (Bt0 )0t1 is a centered Gaussian process with covariance
Cov(Bs0 , Bt0 ) = s st for all 0 s t 1.

7.4. DONSKERS INVARIANCE PRINCIPLES

167

Proof.
Cov(Bs0 , Bt0 ) = E [(Bs sB1 )(Bt tB1 )]
= s st st + st = s st.

Remark 7.3.4. Note that n and B 0 are both centered and have the same
covariance structure. Also

n (0) = B00 = 0 =
n (1) = B10 .
Therefore B 0 is called a Brownian Bridge.

7.4

Donskers Invariance Principles

In Section 3.4 we have derived several probabilities for events depending on


the sample path of Fn . Most of these formulae are complicated and may be
implemented only for a few small n. For other n, one may wonder if there is
a limit as n which may then serve as an approximation when no exact
probabilities are available.
The most important example of an approximation through a limit distribution is the CLT: Assume that X1 , . . . , Xn are i.i.d. F with EX12 < ,
then
(
)
x
n

1
2
1/2
lim P n
(Xi EX1 ) x = (x) =
ey /2 dy,
n
2
i=1

where 2 = VarX1 .
In other words, if Gn denotes the d.f. of (n 2 )1/2

i=1 (Xi

EX1 ), then

lim Gn (x) = (x) for all x R.

This convergence is called convergence in distribution or convergence


in law. In other situations, the limit distribution need not be continuous.
Then convergence is only required at those points at which the limit function
is continuous:
lim Gn (x) = G(x) for all x at which G is continuous.

168

CHAPTER 7. INVARIANCE PRINCIPLES

The same concept applies to random vectors.


To discuss the distributional convergence of random processes like n , as
n , one has to replace intervals or quadrants by sets A in the space of
functions, which carry no mass on their boundary.
Denition 7.4.1. Assume that (Sn (t))0t1 is a sequence of stochastic processes whose sample paths are in some metric space (X , d). Let (S(t))0t1
be another process. Then Sn converges in distribution to S:
L

Sn S,
if and only if
lim P(Sn A) = P(S A) for all A X such that P(S A) = 0.

Here A is the boundary of A.


A classical book on convergence of stochastic processes is Billingsley (1968).
In applications the most important examples for the space X are X = C(),
the space of continuous functions over some () compact space .
So far, we only studied = [0, 1], but we may also consider higher-dimensional
cubes or other subsets of Rd as well. The other important space of sample
functions is the so-called Skorokhod space D[0, 1], consisting of all right
continuous functions with left-hand limits. The empirical d.f. and the empirical process have sample paths in this space. Billingsley (1968) provides
a thorough discussion of how D[0, 1] may be metrized. The following two
results give sucient conditions which guarantee convergence in distribution
in C and D.
Theorem 7.4.2. Let (Sn (t))0t1 and (S(t))0t1 be stochastic processes
in C[0, 1] such that
L

(Sn (t1 ), . . . , Sn (tk )) (S(t1 ), . . . , S(tk ))


for all 0 < t1 < . . . < tk < 1 (convergence of dis) (7.4.1)
and
lim lim sup P ((Sn , ) > ) = 0 for all > 0.
0

(7.4.2)

7.4. DONSKERS INVARIANCE PRINCIPLES

169

Then we have:
Sn S in distribution.
In many applications, the limit process will be Gaussian, and the convergence of the dis follows from an application of the multivariate CLT. One
possibility to check (7.4.2) is to show that (7.1.3) holds uniformly in n, with
the same K, a and b.
The following Theorem constitutes the counterpart of Theorem 7.4.2 for the
case D[0, 1].
Theorem 7.4.3. Let (Sn (t))0t1 and (S(t))0t1 be stochastic processes
in D[0, 1] such that
S has continuous sample paths

(7.4.3)

(Sn (t1 ), . . . , Sn (tk )) (S(t1 ), . . . , S(tk ))


for all 0 t1 < t2 < . . . < tk 1 (convergence of dis) (7.4.4)
lim lim sup P ((Sn , ) > ) = 0 for all > 0.
0

(7.4.5)

Then:
Sn S in distribution.
The last Theorem covers an important special case, namely that the limit S
is continuous though for each n 1 the process Sn may have jumps. This
is exactly the case for the uniform empirical process
n . In view of Lemma
3.2.8 the candidate for the limit is the Brownian Bridge S = B 0 . While
this process has continuous sample paths,
n has n discontinuities. Note,
however, that the jump size n1/2 tends to zero so that, as n ,
the number of jumps tends to innity
the jump sizes tend to zero
the limit process is continuous.
In many cases the continuity of S is obtained through an application of
Theorem 7.1.1, while the convergence of the dis will again follow from an
application of the multivariate CLT. Verication of (7.4.5) typically requires
more work.
In the literature the convergence of Sn to some limit S is often called an
Invariance Principle. The reason for this becomes clear if one recalls the

170

CHAPTER 7. INVARIANCE PRINCIPLES

classical CLT. Typically, the distribution F of the summands Xi is arbitrary


and the distribution of the sums is complicated. The CLT then implies that
in the limit this distribution is, up to the scale parameter , invariant w.r.t.
F , i.e., does not depend on F . Similar things occur with the so-called
functional limit theorems covered by Theorems 7.4.2 and 7.4.3.
The rst invariance principles were due to Donsker (1952/53).
Theorem 7.4.4. (Donsker). Let X1 , . . . , Xn be i.i.d. with EXi = 0 and
VarXi = 1. Put
nt
1
Sn (t) =
Xi ,
0 t 1,
n
i=1

the so-called partial sum process. Then we have


L

Sn B,
where B is the Brownian Motion.
Theorem 7.4.5. (Donsker). Let U1 , . . . , Un be i.i.d. from U [0, 1]. Denote
with
n the uniform empirical process. Then
L

n B 0 ,
where B 0 is the Brownian Bridge.

Proof. The assertion follows from Theorem 7.4.3 upon applying Lemma
3.2.8 and Theorem 6.3.3.

There is an important technique which shows how useful invariance principles may be. This is based on the so-called Continuous Mapping Theorem (CMT).
L

Theorem 7.4.6. (CMT). Assume Sn S. Let T be any continuous


functional. Then
L
T (Sn ) T (S).
We shall now demonstrate how Donskers invariance principle for empirical
processes in connection with the CMT may be applied to easily get the limit
distribution of various statistics of interest.

7.4. DONSKERS INVARIANCE PRINCIPLES

171

The one- and two-sided Smirnov and Kolmogorov-Smirnov statistics have


been dened as
Dn = sup[F (t) Fn (t)]

Dn+ = sup[Fn (t) F (t)]


tR

tR

and
Dn = sup |Fn (t) F (t)|.
tR

By the Glivenko-Cantelli Theorem, all three quantities converge to zero and


are therefore degenerate in the limit. To obtain non-degenerate limits one
needs to redene them accordingly:
Dn = n1/2 sup[F (t) Fn (t)]

Dn+ = n1/2 sup[Fn (t) F (t)]


tR

tR

and
Dn = n1/2 sup |Fn (t) F (t)|.
tR

Under continuity of F we have, e.g.,


Dn = n1/2 sup |Fn (t) t| = sup |
n (t)|.
0t1

0t1

This is a continuous functional of


n . Therefore, Donsker and the CMT
imply
L
Dn sup |B 0 (t)|.
(7.4.6)
0t1

Similarly,

Dn+ sup B 0 (t)


0t1

Dn sup B 0 (t).

(7.4.7)

0t1

We see that three convergence results easily follow from one basic result:
the invariance principle for the underlying process
n.
Another example is the Cramer-von Mises statistic

CvM = n [Fn (t) F (t)]2 F (dt),


which under continuity of F becomes
1
CvM = n
0

Fn (t) t

]2

n2 (t)dt.

dt =
0

172

CHAPTER 7. INVARIANCE PRINCIPLES

By Donsker and CMT we get


L

CvM

[B 0 (t)]2 dt.
0

The distributions of the limits in (7.4.6) and (7.4.7) are known and tabulated.
Lemma 7.4.7. For x 0 we have
(
)
0
= exp[2x2 ]
P sup B (t) x
(
P

0t1

sup |B (t)| x
= 2
(1)k+1 exp[2k 2 x2 ] K(x).
0

0t1

k=1

Hence, e.g., from (7.4.6) we get


lim P(Dn x) = K(x).

(7.4.8)

Equation (7.4.8) is the basis for the classical Kolmogorov-Smirnov Goodnessof-Fit Test. The null hypothesis to be tested is
H0 : F = F0 ,
where F0 is a fully specied distribution (simple hypothesis). Consider
Dn := n1/2 sup |Fn (t) F0 (t)|.
tR

For a given signicance level 0 < < 1, choose c > 0 so that K(c) = .
Reject H0 i Dn c. From (7.4.8) we get that asymptotically the error of
the rst kind equals .
The same procedure works for CvM. In this case the critical value c needs
1
to be taken from the distribution of [B 0 (t)]2 dt, which is tabulated.
0

7.5

More Invariance Principles

The next invariance principle covers a two-sample situation. Let X1 , . . . , Xn


and Y1 , . . . , Ym be two independent samples with
X1 F and Y1 G (continuous),

7.5. MORE INVARIANCE PRINCIPLES

173

both unknown. The null hypothesis to be tested is


H0 : F = G.
Denote with Fn and Gm the associated empirical d.f.s. The test statistic of
K-S type equals

Dnm = sup |Fn (t) Gm (t)|,


tR

where at this moment we leave it open how the standardizing factor looks
like. To determine critical values one needs to know or at least approximate
the distribution of Dnm under H0 . Now, under H0 ,

m (t)|.
Dnm = sup |Fn (t) G
0t1

Put
m (t) t],

n (t) = n1/2 [Fn (t) t] and m (t) = m1/2 [G


and set N = n + m, the sample size of the pooled sample.
With N =

n
N,

we get

m (t)] =
N [Fn (t) G

[
]
N n1/2
n (t) m1/2 m (t)

1/2

= N

n (t) (1 N )1/2 m (t).

Assuming that
N for some 0 < < 1,
we get from Donsker that

m ] 1/2 B 0 (1 )1/2 B 1 B,

N [Fn G

where B 0 and B 1 are two independent Brownian Bridges. The process B,


as a linear combination of two independent Gaussian processes, is also a
Gaussian process with expectation zero and covariance function
B(t)]

K(s, t) = E[B(s)
= 1 s(1 t) + (1 )1 s(1 t)
[
]
1
1
1
=
+
s(1 t) =
s(1 t).
1
(1 )
This gives us a clue how to choose the standardizing factor in the denition
of Dnm . Put

m ].
N = N (1 N )N [Fn G

174

CHAPTER 7. INVARIANCE PRINCIPLES

Then
in distribution,
N B
is a Brownian Bridge. Hence H0 is rejected at level i
where B

nm
Dnm =
sup |Fn (t) Gm (t)| c,
N tR
where c is the (1 )-quantile of K in Lemma 7.4.7.
In our next example we use empirical process theory to design tests for
symmetry of F at zero:
H0 : F (x) + F (x) = 1 for all x R.

(7.5.1)

We may expect that under H0


Fn (x) + Fn (x) 1 for all x R.
Therefore we study the following variant of an empirical process:
n (x) = n1/2 [Fn (x) 1 + Fn (x)],

x R.

Under (7.5.1) we obtain


n (x) = n1/2 [Fn (F (x)) 1 + Fn (F (x))]
= n1/2 [Fn (F (x)) 1 + Fn (1 F (x))].
Putting t = F (x) it remains to study
n (t) = n1/2 [Fn (t) 1 + Fn (1 t)]
=
n (t) +
n (1 t),
From Donsker we get

0 t 1.

L
n B,

with
= B 0 (t) + B 0 (1 t).
B(t)

Our next aim will be to determine the distribution of sup0t1 |B(t)|.


For
this, put
(
)
(
)
0 1+t
0 1t
R(t) = B
+B
, 1 t 1.
2
2

7.5. MORE INVARIANCE PRINCIPLES

175

Note that
= R(2t 1),
B(t)

0 t 1.

Z(t) = R(1 t),

0 t 1,

The process

is a Brownian Motion on [0, 1]. Actually Z is a zero-means Gaussian process


with covariance, if 0 s t 1,
E[Z(s)Z(t)] = E[R(1 s)R(1 t)]
[ (
(
)]
[ (
( )]
t
s) 0 t
s) 0
= E B0 1
B 1
+ E B0 1
B
2
2
2
2
[ ( )
(
)]
[ ( )
( )]
s
t
s
t
+E B 0
B0 1
+ E B0
B0
2
2
2
2
) (
)
(
(
(
)
)
s
t
s t
t
t
1
+ 1
=
1
1
2
2
2
2
2 2
(
)
s s
t
s st
+
1
+
= s.
2 2
2
2 22
Since
L

{
n (t) : 0 t 1} {R(2t 1) : 0 t 1},
the CMT implies
L

sup |
n (t)| sup |R(2t 1)|.

0t1

0t1

Now, R(t) = R(t), whence


sup |R(2t 1)| = sup |R(t)| = sup |Z(t)|.

0t1

0t1

0t1

Thus we have obtained


Theorem 7.5.1. Under H0 (provided F is continuous):
L

sup |n (x)| sup |Z(t)|,


xR

where Z is a Brownian Motion.

0t1

176

7.6

CHAPTER 7. INVARIANCE PRINCIPLES

Parameter Empirical Processes (Regression)

So far invariance principles were discussed for processes where the parameter
set was identical with the scale of the observations X1 , . . . , Xn . In this
section we study a typical situation in parametric statistics. The example is
taken from regression. The observations (Xi , Yi ), 1 i n, are independent
from the same unknown d.f. F . The Xs take their values in Rd , while the
Y s are real-valued. If E|Yi | < , each Yi may be decomposed into
Yi = m(Xi ) + i ,
where E[i |Xi ] = 0. The function m is the regression function of Y w.r.t. X
and is also unknown. Very often one assumes that m belongs to a parametric
family M = {m(, ) : Rp } of functions, i.e., one assumes that the
true m satises
m(x) = m(x, 0 ) for some 0 and all x Rd .

(7.6.1)

The best studied case is the Linear Model in which case


m(x, ) =< x, >=

xi i .

i=1

If necessary, one may enlarge x by a component 1 leading to an additional


intercept parameter 0 :
m(x, ) = 0 +

xi i .

i=1

Another class of popular models are the so-called Generalized Linear


Models in which the dependence on is nonlinear. See the book of
Fahrmeir and Tutz.
The classical estimator of 0 is the Least Squares Estimator (LSE)
1

n = arg min n

[Yi m(Xi , )]2 .

i=1

In this section we show how so-called Parameter Empirical Processes


may be used to analyze n . Of course, the model assumption (7.6.1) need
not be true at all so that we have to study the behavior of n whether
m M is true or not.

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION)

177

In the rst step we want to nd out whether n converges to some 0 . This


limit should be the true parameter 0 under H0 : m M. If the model
is not correctly specied, n also makes sense but there exists no such 0 .
Summarizing, we may expect that n approaches some 0 , as n , which
coincides with 0 when the model is true.
In the second step we need to analyze the distributional convergence of
n1/2 (n 0 ) a little closer. Among other things, we will come up with a
so-called i.i.d. representation.
Before we analyze n , we x . Then, by the SLLN,
n1

[Yi m(Xi , )]2 E[Y m(X, )]2 =

i=1

= EY 2
2

m(x)m(x, )F0 (dx) +

m2 (x, )F0 (dx), (7.6.2)

where F0 is the (marginal) d.f. of X. Expression (7.6.2) equals

EY 2

m2 (x)F0 (dx) +

[m(x) m(x, )]2 F0 (dx).

(7.6.3)

Since EY 2 m2 (x)F0 (dx) does not depend on , (7.6.2) and (7.6.3) only
dier by a constant. Hence they have the same minimizer, say 0 . In other
words,

0 = arg min

[m(x) m(x, )]2 F0 (dx).

If m(x) = m(x, 0 ), then clearly 0 = 0 .


As regularity assumptions we require that as a function of the function
m(x, ) is twice continuously dierentiable. Put
m (x, ) =

m(x, )
and

2 m(x, )
mij (x, ) (M (x, ))ij ,
i j

1 i, j p.

(7.6.4)

As we shall see second order dierentiability is indispensable for the understanding of n if H0 is not satised. All vectors will be column vectors,

178

CHAPTER 7. INVARIANCE PRINCIPLES

and T will be their transpose. Recall 0 . A crucial role in our analysis will
be played by the p p matrix

A = m (x, 0 )[m (x, 0 )]T F0 (dx)

+ [m(x, 0 ) m(x)]M (x, 0 )F0 (dx).


The second integral vanishes in two cases:
when H0 is satised so that 0 = 0 and m(x, 0 ) = m(x).
when M is a linear model so that M = 0 is the null matrix.
The rst matrix is well known and corresponds to the matrix XX T in
the linear model and for nite samples. Throughout this section we assume
that
A is nonsingular
(7.6.5)
Our result provides a representation of n which is valid under both H0 :
m M and H1 : m
/ M.
Theorem 7.6.1. Under (7.6.4) and (7.6.5), assume that 0 is unique and
lies in the interior set of . Then
lim n = 0 with probability one

and
n1/2 (n 0 ) = n1/2 A1

(Yi m(Xi , 0 ))m (Xi , 0 ) + oP (1). (7.6.6)


i=1

The summands on the right-hand side are centered with, under independence
of Xi and i , covariance matrix
[
]
=VarE m (X, 0 )[m (X, 0 )]T
(7.6.7)
[
]
+ E (m(X) m(X, 0 ))2 m (X, 0 )[m (X, 0 )]T .
(7.6.8)
The second term vanishes under H0 .
Equation (7.6.6) immediately yields the asymptotic (multivariate) normality
of n1/2 (n 0 ):
n1/2 (n 0 ) N (0, A1 A1 ) in distribution.

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION)


Under H0 ,

179

A1 A1 = VarA1 ,

a well known fact.


Proof of Theorem 7.6.1. Since 0 is in the interior set of , we may tacitly
assume that n takes its value there as well. Then we have
n

0=
(Yi m(Xi , n ))m (Xi , n )
i=1

from which we get


n

1/2

(Yi m(Xi ))m (Xi , n )

i=1
1/2

=n

(m(Xi , n ) m(Xi ))m (Xi , n ).

(7.6.9)

i=1

Dene the process


n () = n1/2

(Yi m(Xi ))m (Xi , ).

i=1

Standard arguments show that the n are uniformly continuous in C(),


the space of (vector-valued) continuous functions on , endowed with the
topology of uniform convergence on compacta. Since n 0 , we get
n

1/2
n
(Yi m(Xi ))m (Xi , n ) = n (n ) = n (0 ) + oP (1).
i=1

The right-hand side of (7.6.9) may be expanded into


n

1/2

(m(Xi , n ) m(Xi , 0 ))m (Xi , n )

i=1

+ n

1/2

(m(Xi , 0 ) m(Xi ))m (Xi , n )

i=1

= n1/2 (n 0 )T

m (Xi , n )m (Xi , n )

i=1

+ n1/2

(m(Xi , 0 ) m(Xi ))m (Xi , 0 )

i=1

+ n1/2 (n 0 )T

i=1

(m(Xi , 0 ) m(Xi ))M (Xi , n ),

180

CHAPTER 7. INVARIANCE PRINCIPLES

for appropriate n , n tending to 0 along with n . By the assumed continuity of M as a function of and a uniform version of the SLLN, equation
(7.6.9) may be rewritten as
n (0 ) + oP (1) = n1/2

(m(Xi , 0 ) m(Xi ))m (Xi , 0 )

i=1

+ [A + oP (1)]n1/2 (n 0 ).

Note that because 0 is the minimizer of [m(x) m(x, )]2 F0 (dx) lying in
the interior set of , the variables
[m(Xi , 0 ) m(Xi )]m (Xi , 0 ),

1 i n,

have expectation zero. Conclude that, up to an oP (1) term,


n1/2 (n 0 )
]
n [

1/2 1

Yi m(Xi ) m(Xi , 0 ) + m(Xi ) m (Xi , 0 )


=n
A
i=1

= n1/2 A1

n [

]
Yi m(Xi , 0 ) m (Xi , 0 ),

i=1

a sum of standardized centered independent random variables. For the


computation of , note that i = Yi m(Xi ) is orthogonal to the -eld
generated by Xi . Hence the covariance is the sum of the two covariances
Cov[(Y m(X))m (X, 0 )] and Cov[(m(X, 0 ) m(X))m (X, 0 )]. Finally,
use the independence of and X to come up with (7.6.7).

The result and its proof require some comments:
The LSE is an example of an estimator which is dened as the minimizer of the data depending function
n1

[Yi m(Xi , )]2 .

(7.6.10)

i=1

Only in very special cases (like the Linear Model) it is possible to


get explicit representations of n . Generally this is not possible. To
analyze its distributional behavior, the i.i.d. representation provides
a powerful tool to determine the limit distribution of n1/2 (n 0 ).

7.6. PARAMETER EMPIRICAL PROCESSES (REGRESSION)

181

Our discussion exhibits, for the rst time, that an assumed model may
be incorrectly specied. Therefore n needs to be studied under both
m M and m
/ M.
To derive the i.i.d. representation we use arguments which will appear,
in modied form, also in other situations. First regularity (or smoothness) of the model is important. Taylors expansion is important for a
local linearization at 0 .
This 0 is the minimizer of the distance (squared)

2
D () = [m(x) m(x, )]2 F0 (dx)
between m and M. In this way m(, 0 ) is the projection
of m onto

M. The function (7.6.10) is, up to the constant EY 2 m2 (x)F0 (dx),


the empirical analog of D2 ().
A crucial role in our arguments is played by the process n (). A
continuity argument is used to get
n (n ) = n (0 ) + oP (1).
The process n is an example of a Parameter Empirical Process.
The summands (Yi m(Xi ))m (Xi , ) of this process are independent
and centered (which is always important!):
[
]
E (Yi m(Xi ))m (Xi , ) = E[i m (Xi , )]
[
]
= E m (Xi , )E[i |Xi ] = 0.
The terms Yi m(Xi , n ), 1 i n, are the tted errors. They are
called residuals.

182

CHAPTER 7. INVARIANCE PRINCIPLES

Chapter 8

Empirical Measures: A
Dynamic Approach
8.1

The Single-Event Process

In many applied elds researchers are interested in the time X elapsed until
a certain event occurs. To give only a few examples, X could be
the survival time of a patient, i.e., the time from surgery to death
the free disease survival time, e.g., the time elapsed from a rst chemotherapy until reappearance of metastases
the appearance of the rst customer in a service station
the time a bond defaults
the breakdown of a technical unit
Since X very often is a random variable, the associated process
St = 1{Xt} ,

t R,

is a random process jumping from zero to one at t = X. As in Chapter 2


we call (St )t the Single-Event Process. This process may be extended
in many ways. Nevertheless it constitutes the cornerstone of many other
processes appearing in applied elds and statistics. Each St is a Bernoulli
0-1 random variable with expectation
E(St ) = P(X t) F (t),
183

184CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


the (unknown) distribution function (d.f.) of X. To better understand
the following assume that X is nonnegative, like any lifetime variable, and
imagine that we start at t = 0. Since we are no prophets we are not able to
precisely forecast the value of X. It may, however, be the case that at time
t we have some information helpful to predict X. In the case of a patient,
this could be some covariate-information about his/her health-status. In
the simplest case, when there are no covariates, at least the fact that X > s
for all s t means that the event has not occured before t. The event
history of the process is reected through a -algebra of events which are
observable up to t, say Ft , and which may be helpful to predict the future
development of the process. In the simplest case
Ft = ({1{Xs} : s t}).
Note that Ft as t increases. The value of St is known only at time t, but
not necessarily before. In technical terminology, this means that (St )t is
adapted to the ltration (Ft )t .
To discuss the following we rst restrict ourselves to nitely many ts, say
0 = t0 < t1 < t2 < . . . < tk . Put Si = Sti for short. At time ti , only
S0 , S1 , . . . , Si are known, but not necessarily Si+1 , Si+2 , . . . , Sk . The main
question we are concerned with now is how to predict future values of Sj
given the information at t = ti . Clearly, for a variable Ai+1 to be a predictor
for Si+1 at ti , e.g., means that this Ai+1 needs to be known and computable
at ti . Summarizing we have come up with two important concepts, which
can and will be discussed also in a context more general than the SingleEvent Process and therefore are formulated in full generality.
Denition 8.1.1. Let (Fi )0ik be some increasing ltration, and let (Si )0ik
and (Ai )0ik be two sequences of random variables. Then we call
(Si )i adapted to (Fi )i i Si is Fi -measurable
(Ai )i predictable w.r.t. (Fi )i i Ai+1 is Fi -measurable
As we know from our experience in everyday life many quantities are not
predictable and therefore can be predicted only with error. Relevant issues
of this scenario will be discussed in the next section.

8.2. MARTINGALES AND THE DOOB-MEYER DECOMPOSITION185

8.2

Martingales and the Doob-Meyer Decomposition

We know from probability theory that if S is an unknown (square-integrable)


variable and the information is given through events forming a -algebra F,
then the optimal
[ predictor A
] for S minimizing the (integrated) squared prediction error E (S A)2 |F equals the conditional expectation A = E(S|F).
Coming back to a sequence of observations S0 , S1 , S2 , . . . , Sk adapted to
F0 F1 . . . Fk , one may wish to predict, at each ti , the value of Si+1
on the basis of Fi . A possible candidate for Ai+1 is Si itself. This situation
describes one of the best studied cases and deserves a special name.
Denition 8.2.1. Let (Si )0ik be a sequence of (integrable) random variables adapted to an increasing ltration (Fi )0ik . Then (Si )i is called a
Martingale i
E(Si+1 |Fi ) = Si

P-almost surely

for all i = 0, . . . , k 1.
Being a martingale thus means that the best predictor of Si+1 is just the
previous value Si .
Of course it would be very naive to believe that martingales are all around
and the last observation would always be an appropriate (and simple) choice
for a predictor. Though, as the next important result will show us, we are
at least very close to such a situation.
Lemma 8.2.2. (Doob-Meyer). Let (Si )0ik be adapted to (Fi )0ik .
Then, for a given initial value a, there is a unique decomposition (almost
surely) of Si into
Si = Ai + Mi , i = 0, 1, . . . , k
such that
(Mi )i is a martingale
(Ai )i is predictable
A0 = a

186CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


Proof. The proof is interesting because it explicitly shows how to compute
Mi and Ai . Set, by recursion,
{
a
for i = 0
Ai =
Ai1 Si1 + E(Si |Fi1 ) for i 1
{
Mi =

S0 a

for i = 0

Mi1 + Si E(Si |Fi1 )

for i 1

Clearly, by induction,
Si = Ai + Mi ,

i = 0, 1, . . .

Also, (Ai )i is predictable with A0 = a. That (Mi )i is a martingale, is


readily checked. To show uniqueness, let Si = Ai + Mi be another such
decomposition. Conclude that
Ai Ai = Mi Mi
whence

]
]
[
[
E Ai Ai |Fi1 = E Mi Mi |Fi1 .

Since Ai and Ai are measurable w.r.t. Fi1 and Mi and Mi are martingales,
the last equation yields

Ai Ai = Mi1
Mi1 = Ai1 Ai1 .

By induction it follows that


Ai Ai = A0 A0 = a a = 0
and, nally, Mi = Mi . This concludes the proof of the lemma.

In the literature, the process (Ai )i is called the Compensator, while, from
time to time, (Mi )i is called the Innovation Martingale.
To interpret the Doob-Meyer Decomposition, note that a martingale is
trend-free. The compensator Ai is already known at time ti1 . If the original
sequence (Si )i includes a trend, this is compensated by (Ai )i .
We would also like to mention a well-known decomposition of an output
random variable Y in terms of an input X and noise :
Y = m(X) + .

(8.2.1)

8.3. THE DOOB-MEYER DECOMPOSITION OF THE SINGLE-EVENT PROCESS187


Here, m(X) = E(Y |X) and m is the regression function of Y w.r.t. X.
The variable is the so-called noise variable and satises E(|X) = 0 Palmost surely. Also (8.2.1) may be interpreted as a decomposition into a
predictable part, namely m(X), and some trend-free component . The
dierence between (8.2.1) and the Doob-Meyer decomposition is, that in
(8.2.1) only one pair (X, Y ) of random variables is considered, while in
Lemma 8.2.2 we consider a sequence of random variables over time. At the
same time the -algebras Fi containing the information at time ti may be
quite arbitrary. Of course, the decomposition heavily depends on the choice
of Fi and will change when we replace the ltration by another one.

8.3

The Doob-Meyer Decomposition of the SingleEvent Process

For a better understanding, we rst consider St = 1{Xt} at nitely many


t0 t1 . . . tk , with
(
)
Fi = {1{Xtj } : 0 j i} .
Put Si Sti , and recall F , the d.f. of X.
Lemma 8.3.1. For i = 1, . . . , k, we have
E[Si |Fi1 ] = 1{Xti1 } + 1{X>ti1 }

F (ti ) F (ti1 )
.
1 F (ti1 )

Proof. Write
Si = 1{Xti } = 1{Xti } 1{Xti1 } + 1{Xti } 1{X>ti1 }
= 1{Xti1 } + 1{ti1 <Xti } .
The rst summand is measurable w.r.t. Fi1 . For the second summand we
get
[
]
E 1{ti1 <Xti } |Fi1

1{ti1 <X}
1{ti1 <Xti } dP
=

{ti1 <X}

P(ti1 < X)
F (ti ) F (ti1 )
= 1{X>ti1 }
.
1 F (ti1 )

188CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH




The conclusion follows.

Now recall the proof of the Doob-Meyer Decomposition. Put t0 = so


that S0 0. Let a = 0. The martingale part of the Single-Event Process
then satises the recursion
Mi = Mi1 + Si E(Si |Fi1 )
= Mi1 + 1{ti1 <Xti } 1{X>ti1 }

F (ti ) F (ti1 )
.
1 F (ti1 )

Summation from i = 1 to i = k leads to


Mk = 1{Xtk }

1{X>ti1 }

i=1

F (ti ) F (ti1 )
.
1 F (ti1 )

In other words, the compensator equals


Ak =

1{X>ti1 }

i=1

F (ti ) F (ti1 )
.
1 F (ti1 )

From this we may get an idea how M and A look like when all processes are
considered in continuous time. Just let the grid {ti } of points get ner and
ner. If we keep tk = t xed, then Ak converges to

1{xX}
At =
F (dx).
1 F (x)
(,t]

Hence we obtain the following result.


Theorem 8.3.2. Let X be a real random variable with d.f. F . Put St =
1{Xt} and set Ft = ({1{Xs} : s t}). Then the innovation martingale
of (St )t equals

1{xX}
F (dx).
Mt = 1{Xt}
1 F (x)
(,t]

Remark 8.3.3. Note that (St )t and (At )t are nondecreasing. The martingale (Mt )t is trend-free and satises
E(Mt ) = 0 for all t R.

8.3. THE DOOB-MEYER DECOMPOSITION OF THE SINGLE-EVENT PROCESS189


Next, it will be convenient to introduce a new measure:
d =
i.e.,

dF
,
1 F

(t) =

F (dx)
.
1 F (x)

(,t]

As in Chapter 1 the function is called the Cumulative Hazardfunction


of F . Also recall that, if F is continuous, then

F (dx)
(t) =
= ln(1 F (t))
1 F (x)
(,t]

whence
F (t) 1 F (t) = exp((t)).

(8.3.1)

The function F is called the Survival Function.


For a continuous F , we have (t) as t . Hence, in this case, is
not nite. Note that when F is discrete with a nite number of atoms, the
associated is nite. If F has density f , then
d =
The function
(x) =

f
dx dx.
1F

f (x)
,
1 F (x)

xR

is called the Hazard-Function. The compensator becomes

1{Xx} (x)dx.
(,t]

Hence A has a Lebesgue-density, which equals (x) on x X and vanishes


afterwards. In most textbooks on Survival Analysis, the function (x) is
introduced as
P(x < X x + x|x < X)
.
x0
x

(x) = lim
Hence, for small x,

P(x < X x + x|x < X) (x)x.

190CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


The quantity (x) therefore describes the current speed or intensity, at
which, e.g., a patient approaches death, given that he/she has survived x.
Therefore (x) is an important parameter to describe the current risk status
of the patient. Since after the occurrence of X, the Single Event Process
does not allow for more events, it is only natural that in the compensator
the density jumps down to zero immediately after X.
Example 8.3.4. Assume that X has a uniform distribution on [0, 1] so that
(x) =

1
on 0 < x < 1.
1x

Given that X has exceeded x, the intensity for a default in the near future
increases to innity at a hyperbolic rate.

8.4

The Empirical Distribution Function

In this section we extend the previous results to the empirical distribution


function
n
1
Fn (t) =
1{Xi t} ,
t R.
n
i=1

Here, X1 , . . . , Xn , . . . are independent identically distributed (i.i.d.) random


variables from the same unknown d.f. F . The ltration now becomes, for
sample size n,
(
)
Ft = {1{Xi s} : s t, 1 i n} .
By independence of the Xi s, the compensator and martingale part of Fn is
just the sum of the compensators and innovation martingales of the single
indicators. Hence we get the following result.
Theorem 8.4.1. The Doob-Meyer decomposition of Fn equals
Fn (t) = Mn (t) + An (t),

where
Mn (t) = Fn (t)

1 Fn (x)
F (dx).
1 F (x)

(8.4.1)

(,t]

Since Mn is a centered process, the covariance of Mn becomes, for s t:


Cov(Mn (s), Mn (t)) = E[Mn (s)Mn (t)]
= E{E[Mn (s)Mn (t)|Fs ]}
= E {Mn (s)E[Mn (t)|Fs ]} = E{Mn2 (s)}

1
(s).
n

8.4. THE EMPIRICAL DISTRIBUTION FUNCTION

191

The function is computed in the next lemma.


Lemma 8.4.2. For s t, we have
Cov(Mn (s), Mn (t)) =

with

1
(s)
n

F {x}
F (dx).
1 F (x)

(s) = F (s)
(,s]

Here F {x} = F (x) F (x) is the point mass at x. If F is continuous, then


F {x} = 0 and
(s) = F (s) for all s.
Proof. Since (s) = EM12 (s), we obtain

(s) = EM12 (s) = E 1{Xs}

= E12{Xs} 2E 1{Xs}

+ E

1{xX}

F (dx)
1 F (x)

(,s]

1{xX}

F (dx)
1 F (x)

(,s]

1{xX}

F (dx) .
1 F (x)

(,s]

The second expectation equals

1{xX}

E 1{Xs}
F (dx) =
1 F (x)

E1{xXs}
F (dx)
1 F (x)

(,s]

(,s]

F (s) F (x)
F (dx).
1 F (x)

(,s]

Finally,

1{xX}

F (dx) =

1 F (x)
(,s]

(,s] (,s]

1{xX} 1{yX}
F (dx)F (dy).
(1 F (x))(1 F (y))

192CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


By Fubini, we obtain

2
E[. . .] =

1 F (max(x, y))
F (dx)F (dy).
(1 F (x))(1 F (y))

(,s] (,s]

The double integral is split into two pieces: y x s and x < y s. This
leads to

(1 F (x))
1{yx}
F (dx)F (dy)
(1 F (x))(1 F (y))
(,s] (,s]

1
1{yx}
F (dx)F (dy) =
1 F (y)

=
(,s] (,s]

F (s) F (y)
F (dy),
1 F (y)

(,s]

and

1{x<ys}

(,s] (,s]

1 F (y)
F (dx)F (dy)
(1 F (x))(1 F (y))

1{x<ys}

1
F (dy)F (dx)
1 F (x)

(,s] (,s]

F (s) F (x)
F (dx).
1 F (x)

(,s]

Summation of all terms yields the conclusion.

Remark 8.4.3. If F has discontinuities, = F is violated, but F is


always valid. In particular 1 is bounded. We could rewrite to obtain
]
[

F {x}
(s) =
1
F (dx) =
[1 {x}]F (dx).
1 F (x)
(,s]

(,s]

Since
F {x} = P(X = x) P(X x) = 1 F (x),
we get {x} 1. Conclude that is nondecreasing.
As a nal comment, if F is continuous, then
n (F (x))
Mn (x) = M

8.5. THE PREDICTABLE QUADRATIC VARIATION

with
n (t) = Fn (t)
M

1 Fn (u)
du,
1u

193

0 t 1,

[0,t]

where Fn is the empirical d.f. of a uniform sample U1 , . . . , Un .


It is useful to write (8.4.1) in terms of dierentials:
dMn = dFn

1 Fn
dF.
1 F

(8.4.2)

If we replace the indicator 1{xt} by an arbitrary function , then (8.4.2)


implies

1 Fn
dF
dMn =
dFn
1 F

Fn F
=
d(Fn F ) +
dF.
1 F

8.5

The Predictable Quadratic Variation

Mean and variance are two basic concepts in elementary probability theory
and statistics. In a dynamic framework we have seen so far that means
should be updated at each time t so as to incorporate the current available information. This led to the Doob-Meyer Decomposition, in which the
martingale part could be viewed as a noise process and the mean, i.e., predictable part, is captured by the compensator. Extending these ideas to the
variance, we need to also include expected variations over time.
Denition 8.5.1. Let S0 , S1 , . . . , Sk be a nite sequence of square-integrable
random variables adapted to the ltration F0 = {, } F1 . . . Fk .
Then we call
< S >k < S, S >k = S02 +

E{2 Si |Fi1 }

i=1

the Predictable Quadratic Variation (at time k). Here


Si = Si Si1
is the i-th increment.
The process < S >k is important when one studies squared martingales.

194CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


Lemma 8.5.2. Let M0 , M1 , . . . , Mk be a square-integrable martingale. Then
Mi2 < M >i ,

i = 0, 1, . . . , k

is also a martingale. In other words, < M >i is the compensator of Mi2 , i =


0, 1, . . . , k.
Proof. We have
E[Mi2 < M >i |Fi1 ] = E[Mi2 |Fi1 ] < M >i
2
= E(2 Mi |Fi1 ) + Mi1
< M >i
2
= Mi1
< M >i1 .


Next we compute the predictable quadratic variation of the martingale
Mk = 1{Xtk }

1{X>ti1 }

i=1

F (ti ) F (ti1 )
1 F (ti1 )

First,
Mk = Mk Mk1 = 1{tk1 <Xtk } 1{X>tk1 }

F (tk ) F (tk1 )
.
1 F (tk1 )

and therefore
[F (tk ) F (tk1 )]2
[1 F (tk1 )]2
F (tk ) F (tk1 )
2 1{tk1 <Xtk }
1 F (tk1 )

2 Mk = 1{tk1 <Xtk } + 1{X>tk1 }

Conclude that
F (tk ) F (tk1 )
1 F (tk1 )
[F (tk ) F (tk1 )]2
+ 1{tk1 <X}
[1 F (tk1 )]
[F (tk ) F (tk1 )]2
2 1{tk1 <X}
[1 F (tk1 )]2

E[2 Mk |Fk1 ] = 1{tk1 <X}

whence, with t0 = ,
< M >n =

k=1

1{tk1 <X}

[F (tk ) F (tk1 )](1 F (tk ))


.
[1 F (tk1 )]2

Going to the limit, we obtain < M >t in continuous time.

8.6. SOME STOCHASTIC DIFFERENTIAL EQUATIONS


Lemma 8.5.3.

195

1{xX} (1 F (x))
F (dx).
[1 F (x)]2

< M >t =
(,t]

For sample size n 1, the martingale Mn from (8.4.1) has the predictable
quadratic variation

1
(1 Fn )(1 F )
< M n >t =
dF.
n
(1 F )2
(,t]

Since Mn2 < Mn > is a centered martingale, it follows that

1F
1
1
dF,
EMn2 (t) = E < Mn >t = (t) =
n
n
1 F
(,t]

8.6

Some Stochastic Dierential Equations

From (8.4.2) we obtain


dMn = d(Fn F ) +

Fn F
dF.
1 F

(8.6.1)

This is a dierential equation in Fn F . If we multiply both sides with n1/2


and introduce the standardized processes
n = n1/2 Mn
M

n = n1/2 (Fn F ),

then, if F is the uniform distribution on [0, 1],


n1/2 Mn B and n B 0 .
Here, B and B 0 denote a Brownian Motion and a Brownian Bridge, respectively. This follows from Donskers invariance principle for the empirical
process n . In the limit we obtain the dierential equation
dB = dB 0 +

Bt0
dt.
1t

Rewriting the last equation we get


dB 0 =

Bt0
dt + dB.
1t

196CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH


This equation admits the solution

0
B (t) = (1 t)

1
B(ds),
1s

0 t 1.

[0,t]

Going back to nite sample size, the following result therefore is not surprising.
Theorem 8.6.1. Under F = U [0, 1],

Fn (t) t = (1 t)

Mn (ds)
.
1s

[0,t]

Proof. From (8.4.2), the right-hand side equals

Fn (ds)
1 Fn (s)

(1 t)

ds .
1s
(1 s)2
[0,t]

(8.6.2)

[0,t]

It suces to deal with n = 1. The rst integral then becomes


the second equals
tX1

1{X1 s}
1
1
1.
ds =
=

(1 s)2
1 s
1 t X1

1{X1 t}
1X1 ,

while

[0,t]

We have to consider two cases. When X1 t, (8.6.2) equals


]
[
1
1

+ 1 = 1 t = 1{X1 t} t.
(1 t)
1 X1 1 X1
If X1 > t, we obtain
[
(1 t) 0

]
1
+ 1 = 1 + 1 t = 1{X1 t} t.
1t


This concludes the proof of the Theorem.

Since Mn is a martingale and the integrand is a deterministic function, the


integral is also a martingale.
Corollary 8.6.2. The process

Fn (t) t
t
=
1t

[0,t]

is a martingale under F = U [0, 1].

Mn (ds)
,
1s

0t<1

8.7. STOCHASTIC EXPONENTIALS

8.7

197

Stochastic Exponentials

Stochastic Exponentials were briey discussed in section 1.10. In this section


we study some extensions and apply them to the Empirical Process. The
following result is an extension of Theorem 1.10.5
Lemma 8.7.1. (Gill). Let A and B be two distribution functions (not
necessarily nite or with total mass 1), such that
A{x} 1 and B{x} < 1 for all x R.
Then the equation

1 Z(s)
[A(ds) B(ds)]
1 B{s}

Zt =
(,t]

admits a unique solution, namely

c
st (1 A{s}) exp(A (t))

.
Z(t) = 1
c
st (1 B{s}) exp(B (t))
Proof. The function 1 Zt is the exponential of the process (Ct )t given by
dCt =

dBt dAt
.
1 B{t}

As usually, we decompose A and B into their continuous and discrete part:


A = Ac + Ad

B = Bc + Bd.

If we consider the exponential of C in discrete time, i.e., over a nite grid


of points, we have with (1.10.2)
]
n [

Bic + Bid Aci Adi


1+
Cn =
1 B{ti }
i=1
{ n
]}
[
Bic + Bid Aci Adi
= exp
ln 1 +
1 B{ti }
i=1

[
]
n

c
c
B{ti } A{ti }
Bi Ai
1

= exp
ln 1 +
+

+ o(1)

1 B{ti }
1 B{ti } 1 + B{ti }A{ti }
i=1
1B{ti }
]
[
B{s} A{s}
exp[Btc Act ].

1+
1 B{s}
st

198CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH




From this the conclusion follows.


In Corollary 8.6.2 we have seen that the process

Fn (t) t
Mn (ds)
t
=
1t
1s
[0,t]

is a martingale if F = U [0, 1]. Now we study the non-centered process


n (t) =

1 Fn (t)
,
1t

0 t < 1.

Note that the expectation of n (t) equals one. Now,


n (t) =

1 Fn (t)
1t

Fn (t) t
1t

Mn (ds)
1 Fn (s) Mn (ds)
= 1
=1
1s
1s
1 Fn (s)

= 1

[0,t]

[0,t]

n (s)Mn0 (ds),

= 1+

(8.7.1)

[0,t]

where
Mn0 (ds) =

Mn (ds)
.
1 Fn (s)

Note that this is legitimate only for s Xn:n . To be correct we need to


redene Mn0 so as to become
Mn0 (ds) =

1{sXn:n }
Mn (ds).
1 Fn (s)

(8.7.2)

For t Xn:n , (8.7.1) remains the same. For t > Xn:n we have n (t) = 0.
Hence the right hand side of (8.7.1) becomes
X
n:n

1
0

Mn (ds)
Fn (Xn:n ) Xn:n
=1
= 0.
1s
1 Xn:n

Corollary 8.7.2. The process


n (t) =

1 Fn (t)
,
1t

0t<1

is the Stochastic Exponential of the martingale Mn0 dened in (8.7.2).

8.7. STOCHASTIC EXPONENTIALS

199

Remark 8.7.3. Because of (8.4.2) we have on s Xn:n


dMn0 =

dMn
dFn
dF
=
+
.
1 Fn
1 Fn 1 F

Hence (8.7.1) may be also written, for t Xn:n , in the form

n (t) = 1
n (s)(n (ds) (ds)).
[0,t]

Since n is an integral of a deterministic function w.r.t. the martingale Mn ,


it is also a martingale. Since the index set [0, 1) is open on the right and
n (1) = 0, n cannot be extended to become a martingale on the closed set
[0, 1]. Therefore, to obtain maximal bounds for n , we rst have to restrict
to compact subintervals of [0, 1). For example, Doobs maximal inequality
yields, for any c > 0:
)
(
En (t)
1
1 Fn (s)
>c
= .
P sup
1s
c
c
0st
If we let t tend to one, then by -continuity
(
)
1 Fn (s)
1
P sup
>c .
1s
c
0s<1
If, for a given > 0, we choose c > 0 so large that 1/c , then on an event
with probability less than , we have
1 Fn (s) c(1 s).

(8.7.3)

for all 0 s < 1. Obviously, this inequality also holds true for s = 1.
Inequality (8.7.3) provides us with a so-called linear upper bound for 1 Fn .
Such bounds are very useful in proofs if one needs to bound 1 Fn from
above.

200CHAPTER 8. EMPIRICAL MEASURES: A DYNAMIC APPROACH

Chapter 9

Introduction To Survival
Analysis
9.1

Right Censorship: The Kaplan Meier Estimator

In Survival Analysis one is interested in phenomena evolving over time.


E.g., one may focus on the vital health status of a patient since surgery.
Information about the status may be given through covariates known at the
beginning of the observational period and which may be updated from time
to time. To give another example, in a nancial setting, investors holding
corporate bonds may wish to know more about the company. An updated
rating may then be helpful to assess the actual risk status of the company
as to possible defaults before or at maturity of the bonds.
In the following we let Y denote the random time elapsed from the beginning
of the study (surgery, investment) until default. We are interested in the
survival function
S(y) = F (y) = 1 F (y) = P(Y > y).
For example, in a medical setting, one may wish to know the probability that
the free-disease survival time exceeds y = 5 years. A typical feature of such
data is that due to follow-up losses or early end of the study the information
is incomplete. This means, that for a sample of size n not all relevant
Y1 , . . . , Yn are available. Rather, in some cases, some surrogates are known
which contribute less information than the full sample. Consequently, the
empirical d.f. Fn of Y1 , . . . , Yn cannot be computed and therefore cannot
201

202

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

serve as an estimator of F . In the following sections we discuss several


of such situations and show how statistical analysis needs to be properly
adapted to the present situation.
In this section we briey discuss the best studied case of incomplete data,
namely Right Censorship. A typical example is shown in the following
gure.

Y1

Y2

C3
C4

3
4
5

C5
1. Entries 2.

3.

4.

5.

End of Study

Time

We see that entries into the study are staggered. Patients 1 and 2 died after
staying Y1 resp. Y2 time units in the group, while patients 3 and 5 were
still alive at the end. Finally, patient 4 dropped out due to follow-up losses.
In the last three cases, rather than Yi , we observe Ci , the time spent under
observation until follow-up losses or due to some other reason. Summarizing,
rather than Yi , we observe a variable Zi which constitutes the minimum of
Yi and a so-called censoring variable Ci , together with the information
indicating which of Yi or Ci was actually observed:
Zi = min(Yi , Ci )

i = 1{Yi Ci } .

Hence i = 1 indicates that Yi has not been censored while i = 0 means


that instead of Yi the variable Ci was observed.
The problem now becomes one of estimating F , the unknown d.f. of the
lifetime Y , from the available (Zi , i ), 1 i n. Throughout it will be
assumed that, for each 1 i n, Yi is independent of Ci . Denote with G
the (unknown) d.f. of each C. For the moment we restrict ourselves to the
case when no covariables are present.

9.1. RIGHT CENSORSHIP: THE KAPLAN MEIER ESTIMATOR

203

To derive such an estimator we shall make heavy use of the approach outlined
in the rst chapter. For this, recall the cumulative hazard function of F ,
given through
dF
d =
.
1 F
Then clearly
(1 G )dF
(1 G )dF
d =

,
(9.1.1)
(1 G )(1 F )
(1 H )
where
1 H (z) = (1 F (z))(1 G(z))
= P(Y z)P(C z) = P(Y z, C z) = P(Z z).
Here the last equality follows from the assumed independence of Y and C.
Recall that Z1 , . . . , Zn are observable. Hence H can be nonparametrically
estimated by the empirical d.f. of the Zi s.:
1
1{Zi z} ,
Hn (z) =
n
n

z R.

i=1

Next we consider the numerator in (9.1.1). Dene H 1 by


dH 1 = (1 G )dF.
Then we have the following
Lemma 9.1.1. For each z R,
P(Z z, = 1) = H 1 (z).
Proof. We have
P(Z z, = 1) = P(Z z, Y C) = P(Y z, Y C)

=
(1 G(y))F (dy).
(,z]


Also H 1 can be nonparametrically estimated through an Empirical SubDistribution:
n
1
Hn1 (z) =
1{Zi z,i =1} .
n
i=1

204

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Together Hn and Hn1 dene an estimator for , namely


dn
In an integrated form, we have

Hn1 (dy)
1 Hn (y)

n (z) =

dHn1
(1 Hn )

(,z]
n

1{Zi z,i =1}

i=1

1
,
n Ri + 1

where Ri is the rank of Zi among Z1 , . . . , Zn and we assumed for the moment


that all Zi are distinct.
The corresponding estimator of F is obtained from

1 Fn (t) =
[1 n {z}].

(9.1.2)

zt

This estimator was introduced in a landmark paper by Kaplan and Meier


(1958). For further analysis, note that n only jumps at the Zi for which
i = 1. The jump size equals
1
.
n {Zi } =
n Ri + 1
Plugging this into (9.1.2), we get for the Kaplan-Meier estimator
]

[
n Ri
i

1 Fn (t) =
=
1
.
n Ri + 1
n Ri + 1
Zi t

Zi t,i =1

After ordering the Zi s, we get


1 Fn (t) =

[
Zi:n t

]
[i:n]
.
1
ni+1

(9.1.3)

Here, [i:n] is the -concomitant of Zi:n , i.e., [i:n] = j , if Zi:n = Zj . From


(9.1.3) we nally get the mass given to Zi:n :
]
i1 [
[i:n]
[k:n]
Win =
1
.
ni+1
nk+1
k=1

When all data are uncensored, all equal one and Win collapses to 1/n.
Hence, when no censorship is present, the Kaplan-Meier estimator becomes
the ordinary empirical d.f. In the general case, the weights Win depend on
the location of the -labels and are therefore random.

9.1. RIGHT CENSORSHIP: THE KAPLAN MEIER ESTIMATOR

205

When F has discontinuities, the data may have ties. In this general case
the Kaplan-Meier estimator is computed as follows:
Since 1 Fn is the exponential of n , we have

1 Fn (t) = 1
(1 Fn )dn .
(,t]

From this we get the point mass at t:


1 Fn (t) 1
Fn {t} =
H {t}.
1 Hn (t) n

(9.1.4)

Since Hn1 attributes positive mass only to the uncensored data, the same is
true for Fn . Now, let t1 < . . . < tk be the pairwise distinct values among
Z1 , . . . , Zn . From (9.1.4) we obtain a recursive formula for the masses at ti .
For t1 we obtain
k1
Fn {t1 } = Hn1 {t1 } ,
n
where k1 is the number of Z-data with Zi = t1 and i = 1. Conclude that
k1
n k1
1 Fn (t1 ) = 1 Fn {t1 } = 1
=
n
n
n k1
n1n2
...
=
n n1
n k1 + 1
]
k1 [

[i:n]
=
1
.
ni+1
i=1

Practically, we have done the same for the k1 smallest uncensored Zi as for
the empirical d.f., when no censorship was present. It could be that also
some censored variables take on the value t1 . They do not contribute to the
last product. If we continue in this way we nd out that also in the case of
ties the formula
]
[
[i:n]

1 Fn (t) =
1
(9.1.5)
ni+1
Zi:n t

holds true. If there is a tj which is attained by both censored and uncensored


variables, the uncensored need to be ordered before the censored, while the
ordering among the censored and uncensored may be arbitrary. For the
weight Wjn attributed to tj we get
]
[
[i:n]
kj
Wjn =
1
.
n nHn (tj )
ni+1
Zi:n <tj

206

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

where again kj is the number of uncensored Zs attaining the value tj .


Next we consider an arbitrary function . For the associated KaplanMeier Integral dFn , we get

dFn =

Wjn (tj ).

j=1

In the analysis of classical empirical integrals,

1
dFn =
(Xi ),
n
n

i=1

we can make use of the fact that


X = F 1 (Ui )
so that

in distribution,

dFn =

F 1 dFn .

Here U1 , . . . , Un is a uniform sample. Note that with probability one the U s


have no ties. The existence of ties in the available X-data may be attributed
to the quantile transformation F 1 . Summarizing, replacing by F 1
we may assume w.l.o.g. that the data come from a (continuous) uniform
distribution with not ties present. To apply the same idea in the case of
right censored data we have to take into account censorship eects.
Lemma 9.1.2. Let Y F and C G be independent, and let D =
{a1 , a2 , . . .} denote possible discontinuities of F . Let V U [0, 1] be independent of Y and C. Put
{
F (Y )
if Y
/D
U =
F (Y ) + [F (a) F (a)]V if Y = a and a D
U = F (C).
Then we have:
(i) U and U are independent
(ii) U U [0, 1]
(iii) Y = F 1 (U )

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP

207

(iv) = 1{Y C} = 1{U U } =




Proof. Elementary.

When we apply the previous lemma to the pairs (Yi , Ci ), we obtain a sample
(Ui , Ui ) with the same censorship structure as the original sample. In the
U -sample no ties are present with probability one. For a general integrand
, we have
(Yi ) = (F 1 (Ui )), if i = i = 1.
The Kaplan-Meier weights remain unchanged. As a consequence

dFn =

Win (F 1 (Zi:n
))

i=1

with Zi = min(Ui , Ui ). Hence, in proofs, we may assume w.l.o.g. that no


ties are present.

9.2

Martingale Structures under Censorship

As in the previous section, let


Y F und C G
be two independent random variables such that
Z = min(Y, C) und = 1{Y C}
are observed. Again, denote
1 H(t) = P(Z > t) = (1 F (t))(1 G(t))

1
(1 G )dF.
H (t) = P(Z t, = 1) =
(,t]

Finally, set

(1 F )dG.

H (t) = P(Z t, = 0) =
0

(,t]

208

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

The corresponding estimators are given by


1
1{Zi t}
n
n

Hn (t) =
Hn1 (t) =
Hn0 (t) =

1
n
1
n

i=1
n

i=1
n

1{Zi t,i =1}


1{Zi t,i =0} ,

i=1

respectively. We have also seen that


dF d =

dH 1
,
1 H

while the empirical variant became


dn =

dHn1
.
1 Hn

In the following we derive the Doob-Meyer decompositions of Hn , Hn1 and


Hn0 . As before, it is enough to study sample size n = 1. The ltration has
to be chosen such that all three processes are adapted. Since
1{Zt,=0} = 1{Zt} 1{Zt,=1} ,
it suces to consider

(
)
Ft = 1{Zs} , 1{Zs,=1} : s t .

As before we again consider a nite grid = t1 < . . . < tn < t = tn+1 .


Then
1{Ztk ,=1} = 1{Ztk1 ,=1} + 1{tk1 <Ztk ,=1} .
The rst indicator is again measurable w.r.t. Ftk1 . For the second indicator
we obtain, by the Markov property,
)
(
)
(
E 1{tk1 <Ztk ,=1} |Ftk1 = E 1{...} |1{Ztk1 } , 1{Ztk1 ,=1} .
The -algebra generated by the two events consists of , , {Z tk1 , =
1}, {Z tk1 , = 0}, {Z > tk1 } and unions thereof. Hence

1{tk1 <Ztk ,=1} dP


E(1{...} | . . .) = 1{Z>tk1 }
= 1{Z>tk1 }

{Z>tk1 }

P(Z > tk1 )


1
k ) H (tk1 )
.
1 H(tk1 )

H 1 (t

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP

209

From this we get the Doob-Meyer decomposition in discrete time and, nally,
by going to the limit, in continuous time.
Theorem 9.2.1. Consider the process St = 1{Zt,=1} with the adapted ltration Ft = (1{Zs} , 1{Zs,=1} , s t). Then (St )t admits the innovation
martingale

1{xZ}
1
H 1 (dx)
Mt = 1{Zt,=1}
1 H(x)
(,t]

= 1{Zt,=1}

1{xZ}
F (dx).
1 F (x)

(,t]

Corollary 9.2.2. The process Hn1 is adapted to Ft = (1{Zi s} , 1{Zi s,i =1} :
i = 1, . . . , n, s t) and has the innovation martingale

1 Hn (x)
1
1
Mn (t) = Hn (t)
F (dx).
(9.2.1)
1 F (x)
(,t]

Remark 9.2.3. For Hn0 the innovation martingale becomes

1 Hn (x) 0
0
0
H (dx)
Mn (t) = Hn (t)
1 H(x)
(,t]

Hn0 (t)

1 Hn (x)
(1 F (x))G(dx).
1 H(x)

(,t]

If F is continuous, the compensator becomes

1 Hn (x)
G(dx).
1 G(x)
(,t]

Coming back to (9.2.1), in terms of dierentials, we obtain


dMn1 = dHn1

1 Hn
dF.
1 F

On t Zn:n we divide both sides by 1 Hn to get


dHn1
dF
dMn1
=

= dn d.
1 Hn
1 Hn 1 F

(9.2.2)

210

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Integration leads to

(n )(t) =

dMn
1 Hn

on t Zn:n .

(9.2.3)

(,t]

Equation (9.2.3) is the analog of Remark 8.7.3, when there is no censorship.


So far we have discussed martingale structures for Hn , Hn1 and Hn0 . Now we
study the Kaplan-Meier estimator Fn itself. For this, recall that
1 Fn is the exponential of n
1 F is the exponential .
Lemma 8.7.1 enables us to nd the solution of the equation

1 Z(s)
Zt =
[n (ds) (ds)].
1 {s}
(,t]

Since n is purely discrete, cn 0. It follows that

1 Fn (t)
st (1 n {s})
=1
.
Z(t) = 1
c
1 F (t)
st (1 {s}) exp( (t))
Hence, for F (t) < 1,
1 Fn (t)
1 F (t)

= 1

1 Z(s)
[n (ds) (ds)]
1 {s}

(,t]

= 1

1 Fn (s)
[n (ds) (ds)].
(1 F (s))(1 {s})

(,t]

Since
(1F (s))(1{s}) = (1F (s))(1

F {s}
) = 1F (s)F {s} = 1F (s)
1 F (s)

we obtain

1 Fn (t)
=1
1 F (t)
(,t]

1 Fn (s)
[n (ds) (ds)].
1 F (s)

(9.2.4)

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP

211

Equations (9.2.2) and (9.2.4) will enable us to further study the KaplanMeier estimator. First, from (9.2.4)

Fn (t) F (t)
=
1 F (t)

1 Fn (s)
[n (ds) (ds)].
1 F (s)

(,t]

Using (9.2.2) we get the following


Theorem 9.2.4. On t Zn:n and for F (t) < 1 we have

Fn (t) F (t)
=
1 F (t)

1 Fn (s)
M 1 (ds).
(1 F (s))(1 Hn (s)) n

(9.2.5)

(,t]

Theorem 9.2.4 constitutes an extension of Theorem 8.6.1 to the KaplanMeier estimator. Note that in the classical case 1 Fn = 1 Hn , so that
the (predictable) random ratio (1 Fn )/(1 Hn ) cancels out. Therefore,
with no censorship present, the restriction t Un:n was not necessary. For
the ratio of 1 Fn and 1 F a similar argument yields, for t Zn:n ,

1 Fn (t)
=1
1 F (t)

1 Fn (s)
M 1 (ds).
(1 F (s)) (1 Hn (s))(1 {s}) n

(,t]

(9.2.6)
Theorem 9.2.5. The process
n (t) =

1 Fn (t)
1 F (t)

n dened through
equals, on t Zn:n , the exponential of the process M
n =
dM

dMn1
.
(1 Hn )(1 {s})

The last result extends Corollary 8.7.2 in two ways:


Censorship is admitted
may have discontinuities

212

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Next we determine the covariance structure of Mn1 . The proof is similar to


that of Lemma 8.4.2. For (s) EM12 (s) we now obtain

1{xZ}

(s) = E 1{Zs,=1}
F (dx)
1 F (x)

= H1 (s) 2E

+ E

(,s]

(,s]

1{xZs,=1}

F (dx)
1 F (x)
2

1{xZ}

F (dx) ,
1 F (x)

(,s]

Proceeding as for Lemma 8.4.2, we obtain

H 1 {x}
1
F (dx).
(s) = H (s)
1 F (x)

(9.2.7)

(,s]

Theorem 9.2.6. The innovation martingale Mn1 has the covariance function
Cov(Mn (s), Mn (t)) =

1
(s) for s t
n

with as in (9.2.7).
For a continuous H 1 , we have = H 1 . When there is no censorship (9.2.7)
and Lemma 1.4.2 coincide.
To determine the limit distribution of Fn , since the weights Win are random,
the CLT for sums of independent random variables is not directly applicable.
In such a situation, a representation in terms of stochastic integrals, see
(9.2.5), may be helpful.
In the following we sketch the arguments. Starting with (9.2.5), the so-called
Kaplan-Meier process

n (t) = n1/2 [Fn (t) F (t)]


may be written as

n (t) = (1 F (t))
(,t]

1 Fn (s)
n (ds),
M
(1 F (s))(1 Hn (s))

9.2. MARTINGALE STRUCTURES UNDER CENSORSHIP

213

n = n1/2 Mn1 . The process M


n is a martingale with covariance .
with M
From the Glivenko-Cantelli Theorem we have, with probability one,
1 Hn 1 H uniformly.
The extension of the Glivenko-Cantelli Theorem to the Kaplan-Meier estimator is due to Stute and Wang, Ann. Statistics 1993:
1 Fn 1 F uniformly.
Hence, on an interval (, t0 ] with F (t0 ) < 1,
n (t) is asymptotically
equivalent to

1 F (s)
0
n (ds)
n (t) = (1 F (t))
M
(1 F (s))(1 H(s))
(,t]

= (1 F (t))

n (ds)
M
.
(1 F (s))(1 G(s))

(,t]

n has the same covariance structure as B , where B is a standard


M
Brownian Motion. Therefore, we may expect that

dB

n () (1 F ())
= B().
(1 F )(1 G )
(,]

is a centered Gaussian process. The


Since the integrand is deterministic, B
covariance equals

E[B(s)B(t)] = (1 F (s))(1 F (t))


2
(1 F ) (1 G )2
(,st]

= (1 F (s))(1 F (t))

(1

(,st]

F )2 (1

)2

[dH 1

H 1 {}
dF ].
1 F

If F and hence H 1 are continuous, the second integrating measure vanishes,


and we obtain

dF

E[B(s)B(t)] = (1 F (s))(1 F (t))


(1 F )2 (1 G )
(,st]

= (1 F (s))(1 F (t))
(,st]

The following result summarizes our ndings.

d
.
(1 H )

(9.2.8)

214

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Theorem 9.2.7. (Breslow-Crowley). The process


n converges in distri Here B
is a
bution on each interval (, t0 ] with H(t0 ) < 1 with limit B.
centered Gaussian process with covariance (if F is continuous) (9.2.8).
Corollary 9.2.8. Assume F is continuous. Then for all x R

P(sup |
n (t)| x) P(sup |B(t)|
x).
tt0

tt0

Proof. This assertion follows from the Breslow-Crowley result upon applying
the continuous mapping theorem.


9.3

Condence Bands for F under Right Censorship

In this section we demonstrate how the results of the previous section may be
applied in a statistical context. Again, assume that we observe (Zi , i ), 1
i n. We want to construct a condence band for the unknown d.f. F , i.e.,
for each t we aim at constructing an interval
(
)
x
x

In (t) = Fn (t) , Fn (t) +


n
n
such that with a predetermined probability 1 c we have
F (t) In (t) for all t t0 .
The quantity x may and will depend on t. Hence the width of the condence
band will change from t to t. To nd such xs, introduce the function

d
dH 1
C(t) =
=
.
1 H
(1 H )2
(,t]

(,t]

The function C is nondecreasing, and the limit of


n in the Breslow-Crowley
result may be written as
= (1 F )B C,
B
where again B is a Brownian Motion. Put
K(t) =

C(t)
1 + C(t)

9.3. CONFIDENCE BANDS FOR F UNDER RIGHT CENSORSHIP 215


so that
C(t) =

K(t)
.
1 K(t)

Then
= (1 F )B
B
=

1F
K
K
=
(1 K)B
1K
1K
1K

1F 0
B K,
1K

where B 0 is a Brownian Bridge. Thus, putting


(1 K)
n =

n,
1F
we get
n B 0 K in distribution.
From the Continuous Mapping Theorem,


(
)
(
)
(1 K(t))
n (t)
0

P sup
sup |B K(t)| x .
x P tt
1 F (t)
tt
0

Since along with C also K is monotone increasing in t,


(
)
(
)
P sup |B 0 K(t)| x = P
sup |B 0 (u)| x .
tt0

0uK(t0 )

For selected values of K(t0 ) and x, the last probabilities are tabulated.
Choosing x such that
(
)
P

|B 0 (u)| x

sup

= c,

0uK(t0 )

a pre-specied value, then




(
)
(1 K(t))
n (t)

P sup
x c.
1 F (t)
tt0
The same holds when we replace F with Fn and K by
Kn =

Cn
1 + Cn

with
Cn (t) =

(,t]

Summarizing we obtain

dHn1
(1 Hn )2

(9.3.1)

216

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Corollary 9.3.1. (Hall-Wellner Bands). For a given 0 < c < 1, let x be


chosen as in (9.3.1). Put
(
)
1/2 x(1 F
1/2 x(1 F
n (t))
n (t))
n
n
In (t) = Fn (t)
, Fn (t) +
.
1 Kn (t)
1 Kn (t)
Then we have
P (F (t) In (t) for all t t0 ) 1 c.
Remark 9.3.2. The approach of Hall and Wellner incorporates a time
transformation K which leads us to a time-transformed Brownian Bridge.
If one is not willing to follow this approach one needs to base the analysis
on the process
n /(1 Fn ). It follows from the Breslow-Crowley result and
the consistency of Fn , that


(
)
(
)


n (t)
P sup

P
sup
|B(C(t))|

x

n (t)
tt0 1 F
tt0
(
)
(
)
= P
sup |B(u)| x = P sup |B(uC(t0 ))| x
(
= P

0uC(t0 )

0u1

C(t0 ) sup |B(u)| x .


0u1

Replace x with x C(t0 ) resp. x Cn (t0 ). Then




(
)
(
)

(t)


n
P sup
x Cn (t0 ) P sup |B(u)| x .
n (t)
tt0 1 F
0u1
The resulting condence band is of the form
(
)

In (t) = Fn (t) xn1/2 (1 Fn (t)) Cn (t0 ), Fn (t) + xn1/2 (1 + Fn (t)) Cn (t0 ) .

9.4

Rank Tests for Censored Data

In Case-Control studies one is interested in a comparison of a new therapy,


say, with an existing one. For this one needs to compare two samples drawn
from each of the two populations. Under right-censorship this amounts to
comparing samples of the form
(Zi1 , i1 ), 1 i n1 and (Zi2 , i2 ), 1 i n2 .

9.4. RANK TESTS FOR CENSORED DATA

217

Each Z is of the form


Zi1 = Yi1 Ci1 resp. Zi2 = Yi2 Ci2 .
The lifetime-distributions F1 and F2 satisfying
Yi1 F1 und Yi2 F2
are the quantities of interest, while the censoring distributions G1 and G2
satisfying
Ci1 G1 and Ci2 G2
are nonparametric nuisance parameters. Denote with H1 and H2 the d.f.s
of Zi1 and Zi2 , respectively. As before, we obtain
1 H1 = (1 F1 )(1 G1 )
1 H2 = (1 F2 )(1 G2 ),
while the corresponding sub-distribution H 1 now becomes

H11 (t) = P(Z 1 t, 1 = 1) =


(1 G1 )dF1
(,t]

H12 (t) = P(Z 1 t, 2 = 1) =

(1 G2 )dF2 ,
(,t]

respectively. A possible hypothesis to be tested might be


H0 : F1 = F2 versus H1 : F1 = F2
or
H0 : F1 = F2 versus H1 : F1 F2 and F1 = F2 .
The distributions G1 and G2 may dier under H0 so that the observed Z 1
and Z 2 may have dierent distributions H1 and H2 even under the null
hypothesis. Hence a statistical test for H0 has to take care of signicant
dierences in the Zs only caused by the fact that G1 = G2 .
A class of tests which have found a lot of interest in practical applications
are so-called Linear Rank Tests. The word linear stands for sums or,
in our notation, for empirical integrals. A crucial part is played by a weight
function W which needs to be chosen in such a way that under the alternative

218

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

relevant departures in the samples are detected and upweighted resulting in


a test with hopefully large power.
To start with the theoretical quantities we aim at comparing two terms
which are identical under H0 , irrespective of whether G1 = G2 or not. Since
we shall restrict ourselves to integrals the problem becomes one of choosing
a weight function W
two measures and such that

W d = W d under H0 .
Focus on and rst. Putting
d =

(1 G1 )(1 G2 )(1 F2 )
dF1
1 H

d =

(1 G1 )(1 G2 )(1 F1 )
dF2 ,
1 H

and

where
H=
we get that

n1
n2
H1 +
H2 ,
n1 + n2
n1 + n2

W (H(x))[(dx) (dx)]

vanishes under the null hypothesis. Note that in I the weight function
becomes W H rather than W , with W being dened on the unit interval. The function H may be viewed as the d.f. of the pooled sample
Z11 , . . . , Zn11 , Z12 , . . . , Zn22 . The measures and are closely connected with
the cumulative hazard functions. Actually, we have
d =
d =

1 H2
dH11
1 H
1 H1
dH21 .
1 H

The nal test statistic is just the empirical analog of I. This leads to
[
]

2 (x)
1 (x)
1

H
1

H
1
1

1 (dx)
2 (dx) .
I W (H(x))
H
H

1 H(x)
1 H(x)

9.4. RANK TESTS FOR CENSORED DATA

219

is the empirical d.f. of the pooled Zs:


The function H
[n
]
n2
1

H(x)
=
1{Z 1 x} +
1{Z 2 x} .
i
i
n1 + n2
i=1

i=1

1 and H
2 are the empirical d.f.s of the Z-subsamples:
The functions H
n1
n2
1
1

H1 (x) =
1{Z 1 x} und H2 (x) =
1{Z 2 x} .
i
i
n1
n2
i=1

Conclude that
=
H

i=1

n1
1 + n2 H
2,
H
n1 + n2
n1 + n2

and henceforth
=
1H

n1
1 ) + n2 (1 H
2 ).
(1 H
n1 + n2
n1 + n2

Set
1 =

n1
n1 + n2

2 =

n2
.
n1 + n2

Then we obtain
[
]

1
1
1
n
1

H
1

H
1
1
1
1


.
dH
I = W H
dH
dH
1
1
2

2
n2 1 H
1H
Denote the data in the pooled Z-sample with T1 < T2 < . . . < TN , where
N = n1 + n2 . Let (k) be the label of Tk . If we introduce
{
1 if Tk comes from the rst sample
k =
,
0 if Tk comes from the second sample
then we easily express I as a sum:
[
]
N
k
N
nk1

I=
W ( )(k) k
,
n1 n2
N
N k+1
k=1

with
nk1 =

1{Tj Tk } j

j=1

being the number of data in the pooled sample, which exceed the k-th order
statistic and are from the rst sample.
Two famous tests:
W

= 1 : Mantel-Haenszel Test

= Id : Gehans Wilcoxon Test

220

9.5

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Parametric Modelling in Survival Analysis

As before we consider i.i.d. data Y1 , . . . , Yn from a d.f. F , which are at risk


of being censored from the right. In this section we shall discuss several
issues when F comes from a parametric model. Usually this model is given
through densities M = {f : }, i.e., F admits a density f such that
f = f0 for some 0 .
The true parameter 0 is unknown and needs to be estimated from the
data. If all Y s were observable, then the Likelihoodfunction
Ln () =

f (Yi )

i=1

is available. It is known since Wald (1949) that its maximizer is strongly


consistent for 0 under weak regularity assumptions on M.
In Survival Analysis it is more common to model F through its hazard
function. This is mainly because now the main interest is to model the risk
in the near future, which is better captured by than by f . Thus for a
given model M = { : } of hazard functions, we have, by continuity,
(x) =

f (x)
1 F (x)

Conclude that
f (Y ) = (Y )(1 F (Y ))

Y
(t)dt
= (Y ) exp

= (Y ) exp

1{tY }
F (dt) .
1 F (t)

The integral in the exponent equals the compensator of the Single-Event


Process at t = , when F = F . If a fully observable sample Y1 , . . . , Yn is
given, then the likelihood-function becomes

n
n

1
i=1 {tYi }
Ln () =
(Yi ) exp
F (dt) .
1 F (t)
i=1

9.5. PARAMETRIC MODELLING IN SURVIVAL ANALYSIS

221

So far we assume in this section that the Y s are i.i.d. and observable. As we
shall see, in the presence of covariates, that very often the assumption that
the Y s are identically distributed and come from a homogeneous population
cannot be justied. In other words, the hazard function may vary with i
for 1 i n. With the same arguments as before, we obtain for the
likelihood-function

Y
i
n

i (t)dt .
Ln () =
i (Yi ) exp
i=1

Example 9.5.1. A popular choice for modelling i is of the form


i (x) = (x)i (x).
Here, the functions are a model for the baseline risk of a population,
while the i describe an individual risk component. If, e.g., 1 (x) > 2 (x)
for all x, this implies that, whatever the true parameter would be, person
1 faces a larger risk than person 2.
In medical applications individual information often comes in through a
covariate Xi measured at the beginning of the study. In such a case the
function i does not depend on t.
Example 9.5.2. Maybe the best studied example is Cox-Regression.
Here the individual hazard rate is modelled as
i (x) = (x) exp( t Xi ).

(9.5.1)

Hence = (, ). The function is again called the Baseline-Hazard


Function.
There is one important comment to be made. Since in our setting the variable Xi is a random prognostic-factor, the function i in (9.5.1) equals the
hazard function of Yi conditional on the covariate Xi . In other words, i0 (x)
equals the hazard function associated with the conditional d.f. F (y|Xi ) of
Y given Xi . Therefore the function

Ln () =

i=1

(Yi ) exp( t Xi ) exp exp( t Xi )

Yi

(s)ds .

222

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

is called the Conditional Likelihoodfunction. The full likelihood equals


n () =
L

f (Xi , Yi ),

i=1

where now f denotes the joint density of (Xi , Yi ) under . We have


f (Xi , Yi ) = f1 (Xi )f2,1 (Yi |Xi ),
where in obvious notation
f1 =
f2,1

Density of Xi
Conditional density of Yi given Xi .

For f2,1 we have by model assumption

f2,1 (Yi |Xi ) = i (Yi ) exp

Yi

i (t)dt

with i given as in (9.5.1). Hence


n () = Ln ()
L

f1 (Xi ).

i=1

n and Ln coincide up to a
If the density f1 does not depend on , then L
factor not depending on . In this case their maximizers coincide. In the
n are
general case, f1 will depend on and the maximizers of Ln and L
distinct. The common name of Ln is also Partial Likelihood.
As another popular model we mention the following example.
Example 9.5.3. Now the individual risk acts as an additive component:
i (x) = (x) + i (x).
A possible choice for i (x) is, if x = Xi , t Xi .
So far our discussion focused on the likelihood approach. Interestingly
enough the compensator appeared in the exponent of Ln . A direct analysis which also found a lot of interest is the martingale itself. Our approach

9.5. PARAMETRIC MODELLING IN SURVIVAL ANALYSIS

223

can be easily extended to non-identically distributed random variables. In


such a situation the compensator equals, for = 0 ,
n

i=1

1{xYi } i (x)dx.

(,t]

Since 0 is unknown, we consider


n (t)

:= Fn (t) n

i=1

1{xYi } i (x)dx

(9.5.2)

(,t]

also for other s. The true parameter therefore may be characterized as


being the parameter which makes n () a martingale. The process (9.5.2)
is called the Martingale Residual Process.
So far we assumed that the Y s are observable. When censorship is present
we consider the process 1{Zi t,i =1} . Its compensator is given by

1{xZi }
Fi (dx) =
1 Fi (x)

1{xZi } i (x)dx.
(,t]

(,t]

The residual martingale becomes


n (t)

Hn1 (t)

i=1

(,t]

1{xZi } i (x)dx.

(9.5.3)

224

CHAPTER 9. INTRODUCTION TO SURVIVAL ANALYSIS

Chapter 10

Time To Event Data


10.1

Sampling Designs in Survival Analysis

It is worthwhile recalling the scenario which gave rise to data censored from
the right. After entry into a study an individual is observed over a period
of time until a certain event can be observed. Due to an early end of the
study or due to follow-up losses censorship may then occur. To determine the
exact value of Zi , however, it is necessary to monitor the history of a patient,
e.g., without gaps. In a real life situation things may dier, however, in as
patients appear for a check only at times T0 < T1 < T2 < . . . and monitoring
is only possible then. Such a situation gives rise to the following sampling
design.
Example 10.1.1. Let T0 < T1 < . . . be an increasing sequence of monitoring times. Suppose that at time Tj it became known that a default took
place at time R between Tj1 and Tj . The associated indicators
{
1 if
Ti1 < R Ti
i =
0 else
together with the monitoring times Ti constitute the available information
in the form (Tk , k ), k = 1, 2, . . . In the above example, j = 1 while all other
labels vanish. At time t, only those (Tk , k ) with Tk t are available.

T0

T1

T2

T3

Figure 10.1.1: Lifetime Data

225

T4

T5

226

CHAPTER 10. TIME TO EVENT DATA

Example 10.1.2. Imagine one is interested in the age Y a young child


develops a certain ability. Since a correct measurement of Y may require
some experience, a group of children is observed over time in a kindergarten.
Let U denote the age of a child at which it entered the group, and let V be
the time of its exit. In such a situation we are at risk of not observing Y ,
the quantity of interest, for two possible reasons. If Y < U , the ability was
already developed before entering the study, so that the available quantity
is U . Similarly, if V < Y . Consequently, the observable quantity becomes

U , if Y < U
Y , if U Y V
Z=

V , if V < Y
together with its label

1 , if Y < U
2 , if U Y V .
=

3 , if V < Y

This sampling design is called Double-Censorship.


Example 10.1.3. In epidemiological studies one often faces so-called
Truncation-Eects. For example, in an AIDS study one may be interested in the time elapsed from infection until diagnosis. Suppose that the
study ends at time c and available information is to be analyzed.
Incubation Period

Y1

Incubation Period Y2
Incubation Period Y3

w1

w2

w3

Figure 10.1.2: Truncated Data

If, e.g., infection is known to be caused by blood transfusion, then is known


for case one, as is Y1 . For obvious reasons, case three is not included, while

10.1. SAMPLING DESIGNS IN SURVIVAL ANALYSIS

227

case two has not been diagnosed at time c so it also cannot be included.
More formally, put
= Time of Infection
,
w = Time of Diagnosis
and let Y = w for the Incubation Period. A case is included if and
only if w c or, equivalently, i
Y U = c .

(10.1.1)

The variable U is called a Truncation Variable. If (10.1.1) is satised then


both Y and U are observed. Compare this situation with right censorship
when either Y or C are known. Under right truncation cases with a smaller
Y have a better chance to be included in the study. This is a kind of lengthbiasedness which should be respected in the statistical analysis.
Example 10.1.4. In the previous example a datum was available and included in the analysis whenever Y U . This required monitoring of all
cases diagnosed before c. In the present example imagine that the study
only begins at time c and that only those cases are of interest, where infection took place before c while not being diagnosed at c. In Figure 10.1.2
there is only one case, namely case two. Since Y > U it cannot be observed
at c. Since compared with the previous example c is now the beginning and
not the end of the study, we need to dene the sampling plan after c. If the
observational period is [c, ), then each case will be registered with probability one, subject to + Y = w c. For such a (Y, ) we write (Y 0 , 0 ).
We then have, for t c,
P( t, Y y) = P( t, Y y| + Y c).
Example 10.1.5. (Sampling in a Parallelogram). The situation now
is close to the previous one, with one important dierence. The observation
period starting at c is not [c, ) but terminates at c + a. Set T1 = c and
T2 = c + a. We thus observe and Y under the condition + Y T1 only
if + Y T2 . Graphically, this means that we observe (Y, ) only if it falls
into a parallelogram as depicted below.

228

CHAPTER 10. TIME TO EVENT DATA

( 0 , T1 )

Y
(0,0)

(T1 , 0 )

(T2 ,0 )

Figure 10.1.3: Sampling in a Parallelogram


Again the distribution of the observed (Y 0 , 0 ) is an appropriate conditional
distribution of (Y, ):
P(0 t, Y y) = P( t, Y y|c + a + Y c).
Example 10.1.6. In the situation described in Example 10.1.4 it may well
be that though + Y > c the variable Y is not observed due to right
censorship. Hence given + Y c only (, min(Y, C), ) with = 1{Y C}
are observable. This situation is called left-truncation and right-censorship.
Example 10.1.7. It may be that data are available only at time c but
ongoing observation is not possible. In this case Y is never observable. One
possibility is to study the age at time c. In other words, of interest is the
distribution
P(c t| + Y c).
Summary. In this section we reviewed several sampling situations in survival analysis. So far only right censorship was discussed in detail. When
truncation comes into play, empirical estimators are substitutes for conditional probabilities. The main question will be one of reconstructing unconditional probabilities from these estimators.

10.2

Nonparametric Estimation in Counting Processes

The Single-Event-Process
St = 1{Y t}

10.2. NONPARAMETRIC ESTIMATION IN COUNTING PROCESSES229


and its modication
St = 1{Zt,=1}
are two simple examples of a type of stochastic processes, which in its general form is called a Counting Process. Another famous example is the
Poisson Process. Not to forget the extensions of St to larger sample size.
The general concept of a counting process constitutes a perfect framework
for modeling Time to Event data.
In its most general form a counting process is based on a sequence T1 < T2 <
T3 < . . . of increasing random variables such that Ti . This assumption
guarantees that in any compact interval we only have nitely many jumps.
If we only have nitely many points to distribute, the sequence is assumed
to be nite. The associated counting process becomes
Nt =

1{Ti t} .

i=1

It can be shown that every (Nt )t admits a compensator and thus an innovation martingale. In the examples discussed so far the compensator admitted
an intensity of the form
x 1{xY } (x) resp. x 1{xZ} (x),
where

dF
= dx.
1 F

For a time-homogeneous Poisson-Process the intensity is a constant , while


a time-heterogeneous Poisson-Process has a deterministic intensity = (x).
These examples motivated researchers to study point-processes with compensators which are multiplicative in the following sense.
Denition 10.2.1. A counting process is said to admit a multiplicative
compensator, if there is a deterministic function = (x) and a predictable
process Y = Y (x) such that

Mt := Nt

(x)Y (x)dx
(,t]

is a martingale.

(10.2.1)

230

CHAPTER 10. TIME TO EVENT DATA

In the case of the Single-Event process, Y (x) = 1{xY } and (x) is the
hazard function of F .
In this section we study the problem of estimating the function

(t) =
(x)dx
(,t]

First work on this problem was due to Nelson (1969) and Aalen (1975).
Therefore the resulting estimator is called the Nelson-Aalen Estimator.
To start with, rewrite (10.2.1) in terms of dierentials:
dMt = dNt Y dt
or
Y dt = dNt dMt .

(10.2.2)

(t)dt = Y 1 (t)dNt Y 1 (t)dMt

(10.2.3)

If Y (t) > 0, then

and therefore

(t) =

Y 1 (x)M (dx).

(x)N (dx)

(,t]

(,t]

Now, (Mt )t is martingale and Y is predictable. It follows that the second


process is a centered martingale. Neglecting this noisy part, the estimator
of becomes

(t)
Y 1 (x)N (dx) =
Y 1 (Ti ).
Ti t

(,t]

If Y (x) may attain the value zero we need to introduce the function J(t) =
1{Y (t)>0} and consider the equation
J(t)(t)dt =

J(t)
J(t)
dNt
dMt .
Y (t)
Y (t)

The estimator

(t)
=

(,t]

J(x)
dNx
Y (x)

10.3. NONPARAMETRIC TESTING IN COUNTING PROCESSES 231


therefore is an estimator of

(t) =

J(x)(x)dx.
(,t]

We have

(t) E(t)
=

(x)P(Y (x) > 0)dx.


(,t]

Therefore, in general, (t) is biased. The bias becomes small when P(Y (x) =
0) is small for x t.
When we have independent replications of (Nt )t the procedure needs to be
slightly modied. In this case the arguments for (10.2.2) and (10.2.3) now
yield
)
)
( n
)
( n
( n

Mit
Nit d
Yi (t) dt = d
(t)
i=1

i=1

i=1

and nally

(t)
=
(,t]

)
( n

J(x)
n
d
Nix .
i=1 Yi (x)
i=1

For right-censored data,


Yi (x) = 1{Zi x} and Ni (x) = 1{Zi x,i =1} .

Conclude that

(t)
=

J(x)
H 1 (dx).
1 Hn (x) n

(,t]

The correction through J(x) is only necessary for x > max(Zi : 1 i n).
But Hn1 has no mass there so that after all

(t) =
H 1 (dx).
1 Hn (x) n
(,t]

10.3

Nonparametric Testing in Counting Processes

In this section we apply the concept developed so far for testing hypotheses.
For this, imagine we observe counting processes N1 , N2 , . . . , Nk with k xed.

232

CHAPTER 10. TIME TO EVENT DATA

Each Nj may itself be a sum of individual counting processes, as, e.g.,


N (t) =

1{Zi t,i =1} .

i=1

The best is to view each Nj as the result of several measurements under


therapy j, 1 j k. We assume that each Nj admits a multiplicative
intensity j Yj , i.e., the innovation martingale of Nj equals

Mj (t) = Nj (t)
j (x)Yj (x)dx.
(,t]

Denote with
j (t) =

j (x)dx
(,t]

the cumulative hazard function associated with j . One question which has
been studied in detail in the literature was how to test for the hypothesis
H 0 : 1 = . . . = k = 0
with 0 unspecied.
In the following we assume that N1 , . . . , Nk are independent. The rst step
in our analysis is to form the Grand Sum Process
N =

Nj .

j=1

Similarly
Y =

Yj

j=1

Denote with Ft = (F1t , . . . , Fkt ), where (Fjt )t is the ltration pertaining to


the j-th process. By the assumed independence, the innovation martingale
of N becomes M dened through

dM = dN
j Yj dt
j=1

dN (Y ) dt.

10.3. NONPARAMETRIC TESTING IN COUNTING PROCESSES 233


This leads to
dN = (Y ) dt + dM .
For each single j we similarly obtain
1ik

dNi = i Yi dt + dMi ,
respectively
dMj
dNj
= j dt +
Yj
Yj

(if Yj (t) > 0)

(10.3.1)

The crucial argument comes now: Under H0 ,


(Y ) = 0 Y
whence

dN
dM
= 0 dt +
.
Y
Y

(10.3.2)

Under the null hypothesis, the drift parts in (10.3.1) and (10.3.2) coincide so
that dNj /Yj and dN /Y only dier in their martingale parts. The dierence
between these two processes we can weight appropriately and then integrate.
Then we come up with a linear statistic
[
]

Nj (dx) N (dx)
Kj (x)Jj (x)

.
Zj (t) =
Yj (x)
Y (x)
(,t]

Under the null hypothesis, each Zj is almost centered. We may next


consider the vector-valued process
Zt = (Z1 (t), . . . , Zk (t))
and dene a 2 -type statistic
1 Ztt ,
2 Zt
t
t is an estimator of the covariance of Zt . In the case that each Nj
where
is a sum of nj independent processes we can do some asymptotic analysis
and show that in the limit 2 has a 2 -distribution with k 1 degrees of
freedom (without proof).
Most popular weight functions are of the type
Kj (x) = K(x)Yj (x).

234

CHAPTER 10. TIME TO EVENT DATA

Here K is a predictable process only depending on Nj and Y . For such a


Kj we get

Yj (x)
Zj (t) =
K(x)Nj (dx)
K(x)
N (dx)
Y (x)
(,t]

(,t]

K(x)Mj (dx)

=
(,t]

(,t]

K(x)

l=1 (,t]

Yj (x)
M (dx)
Y (x)

]
Yj (x)
Ml (dx),
K(x) lj
Y (x)
[

Here lj is the Kronecker-symbol. In the literature two of several choices for


K are
K(x) = 1{Y (x)>0}
K(x) = Y (x)
We study Zj a little bit in greater detail for right-censored data. In this
case

K(x)Hn1j (dx)

Zj (t) = nj

l=1

(,t]

nj nl

K(x)

N
(,t]

1 Hnj (x) 1
H (dx).
(x) nl
1H

the term simplies to become


with N = n1 + n2 + . . . + nk . For K = 1 H

(x))H 1 (dx)
(1H
nj

Zj (t) = nj

l=1

(,t]

nj nl

(1Hnj (x))Hn1l (dx).

N
(,t]

All integrals are empiricals which can be readily computed.


Now, if nj such that
nj
j for 1 j k,
N
then
Zj (t)
lim
= j
nj N

(1H (x))Hj1 (dx)


(,t]

l=1

j l
(,t]

(1Hj (x))Hl1 (dx).

10.4. MAXIMUM LIKELIHOOD PROCEDURES

235

Under the null hypothesis the limit should be zero. Actually,


dHl1 = (1 Gl )dFl = (1 Hl )dl = (1 Hl )d0 ,
whence

l dHl1 = (1 H )d0 .

l=1

Conclude
k

l=1

(1

Hj )dHl1

(1 Hj )(1 H )d0

(,t]

(,t]

(1 H )dHj1 .

=
(,t]

10.4

Maximum Likelihood Procedures

Recall that when we observe independent identically distributed random


variables Y1 , . . . , Yn with hazard function , = 0 , then the likelihood
function becomes

Ln () =
(Yi ) exp
(t)dt .
i=1

(,Yi ]

Taking logarithms leads to

ln (Yi )
ln Ln () =
1{tYi } (t)dt .
i=1

(10.4.1)

For a general counting process N with multiplicative intensity we get

ln Ln () = ln (t)N (dt) Y (t) (t)dt,


(10.4.2)
The analysis of the MLE is more or less parallel to the classical theory. For
right-censored data (10.4.2) takes on the form

ln Ln () =
(t)dt .
i ln (Zi )
i=1

(,Zi ]

236

CHAPTER 10. TIME TO EVENT DATA

Of more interest is the case when depends on covariates. The best studied
case is
(x) i (x) (x) exp[ t Xi ].
(10.4.3)
Since

i (x)
j (x)

[
]
= exp t (Xi Xj )

does not depend on x, this model is called the Cox Proportional Hazards
Model. The partial Log-Likelihood Function becomes

Zi
n

(
)
i ln (Zi ) + i t Xi exp t Xi
ln Ln () =
(t)dt . (10.4.4)
i=1

If the baseline-function does not depend on a nite-dimensional parameter but may be any function, we arrive at a semiparametric model. In this
case j needs to be replaced with = (t), and maximization takes place
w.r.t. and .
The following presents a detailed analysis of the Cox model. For a reference,
see Tsiatis (1981). Assume that
i (x) (x) = (x) exp[ t Xi ]
and that
Yi and Ci are independent conditionally on Xi

(10.4.5)

This implies that the conditional d.f.s of Y and C:


F (y|x) P(Y y|X = x) and G(y|x) P(C y|X = x)
satisfy
1 H(y|x) = (1 F (y|x))(1 G(y|x)),
where
H(y|x) = P(Z y|X = x).
The hazard function of F (y|x) is by assumption equal to (10.4.3), namely
(y) exp( t x). There will be no conditions on G(|x). Similar to the unconditional case, we have to introduce the (conditional) sub-distribution
H 1 (y|X) = P(Z y, = 1|X = x).

10.4. MAXIMUM LIKELIHOOD PROCEDURES

237

Because of independence
[
]
H 1 (y|x) = E 1{Zy,Y C} |X = x

=
[1 G(z |x)]F (dz|x)
(,y]

[1 H(z |x)](z) exp( t x)dz

=
(,y]

= exp( t x)

[1 H(z |x)](z)dz.

(10.4.6)

(,y]

As before, denote with


H 1 (y) = P(Z y, = 1)
the unconditional sub-distribution and let c = c(x), x Rp , be the density
of the p-dimensional covariate X. Then

1
H (y) =
H 1 (y|x)c(x)dx

=
exp( t x)[1 H(z |x)](z)c(x)dzdx.(10.4.7)
(,y]

Lemma 10.4.1. The function H 1 is dierentiable in y, and its derivative


equals

[H (y)] = (y) [1 H(y |x)] exp( t x)c(x)dx.


Proof. This follows from (10.4.7).
In the following, let be any function dened on Rp . Set
E (y) E[(X)1{Zy} ] = E[(X)P(Z y|X)]

=
(x)[1 H(y |x)]c(x)dx.
Finally, put
E1 (y) = E[(X)1{Zy,=1} ] = E[(X)H 1 (y|X)]

=
(x)H 1 (y|x)c(x)dx.

238

CHAPTER 10. TIME TO EVENT DATA

By (10.4.6)

(x)[1 H(z |x)] exp( t x)(z)dzc(x)dx.

E1 (y) =
(,y]

Lemma 10.4.2. The function E1 is dierentiable with derivative

[E (y)] = (x)[1 H(y |x)](y) exp( t x)c(x)dx.




Proof. As before.
If we take for the function
0 (x) = exp( t x)
we get two important relations.
Corollary 10.4.3.
(y) =

[H 1 (y)]
.
E0 (y)

(10.4.8)

Equation (10.4.8) provides us with a representation of the unknown baseline


function in terms of two quantities which will be estimable. The proof of
(10.4.8) is a direct consequence of Lemma 10.4.1 and the denition of E0 .
Furthermore, we get for any
[E1 (y)] =

[H 1 (y)] E0 (y)
E0 (y)

(10.4.9)

which is an easy consequence of Lemma 10.4.1, Lemma 10.4.2 and the denitions of the involved functions.
Its time to go back to the partial log-likelihood function (10.4.4). Write
Xi = (Xi1 , . . . , Xip )t .
If we take the partial derivative of (10.4.4) w.r.t. j , 1 j p, we obtain

n
ln Ln

j
j
t
=
(t)dt .
(10.4.10)
i Xi Xi exp( Xi )
j
i=1

(,Zi ]

10.4. MAXIMUM LIKELIHOOD PROCEDURES

239

Denote with j the j-th projection of Rp onto R. With = j , (10.4.10)


takes on the form

n
ln Ln

t
=
(t)dt .
(10.4.11)
i (Xi ) (Xi )(exp Xi )
j
i=1

(,Zi ]

The right-hand side also makes sense for s other than the true parameter. Therefore, from now on, we denote the true parameter with 0 . From
(10.4.8) and (10.4.9) we get, by Fubini,

{
}
E (Xi )0 (Xi )
(t)dt
=
(t)E (Xi )0 (Xi )1{tZi } dt

(,Zi ]

=
=

(t)E0 (t)dt =

E1 ()

[E1 (t)] dt

E1 ()

]
= E (X)1{=1} .
Conclude
Lemma 10.4.4. At = 0 we have
[
]
ln Ln
E
= 0 for 1 j p.
j

The lemma asserts that = 0 is a solution of a certain equation. It suggests


that 0 should be estimated by solving
ln Ln
= 0 for 1 j p.
j
Unfortunately (10.4.11) contains the unknown function . A possible way
out of this dilemma is to use (10.4.8). From this one gets

1
(t)dt =
H 1 (dt).
E0 (t)
(,Zi ]

(,Zi ]

H 1 and E0 may be replaced with


1
1{Zi y,i =1}
n
n

Hn1 (y) =

i=1

240

CHAPTER 10. TIME TO EVENT DATA

and

E(y)
=
exp( t Xi )1{Zi y}
n
n

i=1

with still containing the parameter . This leads to a term lj () replacing


ln Ln
j in the following way:
[
]
n
n

1
1 (Zk )
lj ()
i Xij Xij exp( t Xi )
k 1{Zk Zi } E
n
=

i=1
n

i=1
n

k=1

i Xij
i Xij

i=1

1
n

n
n

1 (Zk )
k Xij exp( t Xi )1{Zk Zi } E

k=1 i=1
n
j
n
t

i=1 Xi exp( Xi )1{Zi Zk }

k
.
n
t
i=1 exp( Xi )1{Zi Zk }
k=1

We need to solve
lj () = 0 for 1 j p
Denote with = (1 , . . . , p )t its solution.
The rest of this section is a little technical. The lemmas will be needed to

prove consistency of .
Lemma 10.4.5. With probability one, we have
sup |Hn1 (y) H 1 (y)| 0

(10.4.12)

(y) E (y)| 0
sup |E

(10.4.13)

yR

yR

and
1 (y) E1 (y)| 0
sup |E

(10.4.14)

yR

und E
1 are the estimators of E and E 1 , respectively.
Here E

Proof. Follows from the Strong Law of Large Numbers and its extension to
uniform convergence.

Lemma 10.4.6. For each , with probability one

Ej (y) 1
1
1
lim lj () = Ej ()
H (dy),
n n
E (y)
where (x) = exp( t x).

10.4. MAXIMUM LIKELIHOOD PROCEDURES

241

Recall that the right-hand side in Lemma 10.4.6 equals zero for = 0
Theorem 10.4.7. (Tsiatis). With probability one,
lim = 0 .

The last point to discuss is estimation of . We already showed that


(t)dt = E1
dH 1 .
0
From this the empirical estimate of becomes

1 dHn1
(t)
E

=
=

(,t]
n

1
n

1 (Zi )1{Z t}
i E
i

i=1

n
k=1

i=1

i 1{Zi t}
k )1{Z
exp(X

.
k Zi }

From the previous results we obtain


Corollary 10.4.8. For each t R we have with probability one,

lim (t)
= (t)

We nally discuss the problem of estimating the case-specic survival probabilities


P(Y > t|X = x) = 1 F (t|x).
Under the Cox-Model

1 F (t|x) = exp

(s) exp( t x)ds

(,t]

= exp exp( t x)

(s)ds .

(,t]

Hence the estimator becomes

1 F (t|x) = exp[ exp(t x)(t)].

242

10.5

CHAPTER 10. TIME TO EVENT DATA

Right-Truncation

Let Y1 , . . . , YN be a sequence of i.i.d. random variables from some unknown


d.f. F . Under right-truncation one observes Yi only if it falls below a
threshold Zi , 1 i N . Here, also Z1 , . . . , ZN are i.i.d. from some unknown
d.f. G such that for each 1 i N
Yi and Zi independent

(10.5.1)

If Yi Zi then both Yi and Zi are observed. Denote with


n :=

1{Yi Zi }

i=1

the (random) number of actually observed pairs (Yi , Zi ) with Yi Zi . The


number N typically is unknown so that the truncation probability 1
with
= P(Y1 Zi ),
cannot be easily estimated by
N
1
1
=
1{Yi >Zi } .
N
i=1

Denote the actually observed pair with (Yi , Zi ),i.e.,


(Yi , Zi ) = (Yi , Zi ) if Yi Zi .

(10.5.2)

By the constraint in (10.5.2) we only observe a pair (Yi , Zi ) if it falls above


y = z. We thus see that again we have only information in partial form. To
perform a proper statistical analysis of (Yi , Zi , 1 i n, say, we need to
study the relation between the observed and the original pairs. First, for
each real t,
{Y t, Y Z} = {Y t, Y Z}.
i.e., taking intersections with {Y Z}, we know that Y t i Y t.
Furthermore, the events we can observe are at time t of the type
Gn (t) = ({Yi < s < Zi }, {s Yi Zi } : t s, 1 i n) .
Note that the ltration is nonincreasing as t . Hence, rather with martingale, we have to deal with reverse-time martingales. We rst derive the
Doob-Meyer decomposition of the process
t 1{tY Z} .

10.5. RIGHT-TRUNCATION

243

As usual we rst consider the discrete case. So, let


t = tn+1 < tn < tn1 < . . . < t1 < = t0
be a nite grid. Then we have
1{tk Y Z} = 1{tk1 Y Z} + 1{tk Y <tk1 <Z}
+ 1{tk Y Ztk1 ,Y <tk1 } .
The rst term is predictable. Because of the reverse Markov-property the
conditional expectation of the second equals
[
]
E 1{tk Y <tk1 <Z} |G1 (tk1 )
[
]
= E 1{tk Y <tk1 <Z} |1{tk1 Y Z} , 1{Y <tk1 <Z} 1{Y Ztk1 ,Y <tk1 } .
Similarly, we obtain
[
]
E 1{tk Y Ztk1 ,Y <tk1 } G1 (tk1 )
= 1{Y Ztk1 ,Y <tk1 }

P(tk Y Z tk1 , Y < tk1 )


.
P(Y Z tk1 ), Y < tk1

Summing up, we get


[
]
E 1{tk Y Z} |G1 (tk1 ) = 1{tk1 Y Z}
P(tk Y < tk1 < Z)
P(Y < tk1 < Z)
P(tk Y Z tk1 , Y < tk1 )
+ 1{Y Ztk1 ,Y <tk1 }
.
P(Y Z tk1 , Y < tk1 )
+ 1{Y <tk1 <Z}

In the Doob-Meyer decomposition we get for the martingale part


Mk = Mk1 + 1{tk Y <tk1 ,Y Z}
P(tk Y < tk1 < Z)
P(Y < tk1 < Z)
P(tk Y Z tk1 , Y < tk1 )
1{Y Ztk1 ,Y <tk1 }
P(Y Z tk1 , Y < tk1 )
1{Y <tk1 <Z}

By induction,
k1

P(tj+1 Y < tj < Z)


Mk=1{tk Y Z}
1{Y <tj <Z}
P(Y < tj < Z)

(10.5.3)

j=0

k1

P(tj+1 Y Z tj , Y < tj )

1{Y Ztj ,Y<tj }


(10.5.4)
.
P(Y Z tj , Y < tj )
j=0

244

CHAPTER 10. TIME TO EVENT DATA

Now, since Y and Z are independent we have


P(tj+1 Y < tj < Z)
P(Y < tj < Z)

=
=

[F (tj ) F (tj+1 )](1 G(tj ))


F (tj )(1 G(tj ))
F (tj ) F (tj+1 )
.
F (tj )

Hence the sum in (10.5.4) is the F -integral of a certain step function which,
as the grid gets ner and ner, tends to the limit

F (du)
1{Y u<Z}
F (u)
[t,)
When F and G have no atoms in common, then (10.5.4) tends to zero.
Without further mentioning we assume that this holds.
Lemma 10.5.1. The process
t 1{tY Z}
has the innovation martingale (in reverse time)

F (du)
Mt = 1{tY Z}
1{Y u<Z}
.
F (u)
[t,)

Corollary 10.5.2. The process


1
1{tYi Zi }
n
n

Hn1 (t) =

i=1

has the innovation martingale

Mn (t) =

Hn1 (t)

Cn (u+)
F (du).
F (u)

[t,)

where

1
Cn (x) =
1{Yi uZi } .
n
n

i=1

This Cn is an estimator of the function


C(x) = P(Y u Z|Y Z) = 1 F (u)(1 G(u))

10.5. RIGHT-TRUNCATION

245

which plays an important role in the analysis of truncated data. Actually,


1
1{Yi uZi } .
n
n

Cn (x) =

i=1

is a ratio of a sample mean estimating F (1 G) at x and n/N which is an


(unknown) estimator of
P(Y Z).
We now discuss he problem of estimating the unknown d.f. of Y . First we
are looking for a relation which may be helpful to identify F . Denote with
F the d.f. of the actually observed Y 0 s. Then
F (y) = P(Y y) = P(Y y|Y Z).
This function can be estimated through
1
1{Y 0 y} .
=
i
n
n

Fn (y)

i=1

Since we observe both Yi0 and Zi0 , also the function


H (y, z) = P(Y y, Z z|Y Z).
may be estimated, namely by
1
1{Y 0 y,Z 0 z} .
i
i
n
n

Hn (y, z) =

i=1

Now, because Y and Z are independent,

H (y, z) =

y z
1{uv} F (du)G(dv)


yz

= 1

[G(z) G(u)]F (du).

Putting z = , we get

F (y) =

[1 G(u)]F (du).

(10.5.5)

246

CHAPTER 10. TIME TO EVENT DATA

thus the unknown is contained in both C and F0 . There is some hope that
taking appropriate ratios we may arrive at estimable quantities in which
has cancelled out. Now, rewrite (10.5.5) to get
dF = (1 G )dF
and therefore
dF =

dF
dF
= 1
.
(1 G )
(1 G )

Introduce

C(x)
= P(Y x Z|Y Z) = 1 F (x)(1 G(x)).
then

dF
dF
=
.
(10.5.6)
F
C
Equation (10.5.6) is the required identiability equation. Both terms on the
right-hand side are estimable. For example, if F has a density f , then


f (x)
dF

=
dx = ln F (x) = ln F (t)

F
F (x)
t

and therefore

F (t) = exp

dF
.
F

(10.5.7)

We need (10.5.7) fo rthe general case, i.e., also when F has atoms. The
product integration formula for survival functions is obtained by multiplying
the variable of interest with minus one. So, write
F (t) = P(Y t) = P(Y t) = P(X s)
where X = Y and s = t. Furthermore
P(X s) = lim P(X > u).
us

Let, for a moment, G denote the d.f. of X, let denote the pertaining
cumulative hazard function, i.e.

dG
(x) =
.
1 G
(,x]

10.5. RIGHT-TRUNCATION

247

Then
1 G(u) = P(X u) = P(Y u) = F (u)
and therefore
[
]
[
]
G(du)
1
1
= E 1{Xx}
= E 1{Y x}
F (u)
F (X)
F (Y )

x
(x) =

F (dy)
.
F (y)

(10.5.8)

[x,)

Hence x is a discontinuity point of if and only if x is a discontinuity


point of F . For such an x
F {x}
.
F (x)

{x} =
Conclude that
F (t) =
=

lim [1 G(u)] = lim exp[c (u)]

ut

ut

lim exp[c (u)]

F (x)
F (x)

ut

[1 {x}]

xu

xu

= exp[c (t)]

F (y)
F (y)

y>t

Since from (10.5.8)


c (t)
c (t) =

with

F (dy)
F (y)

(t)
=

(10.5.9)

[t,)

is left-hand continuous and nonincreasing. Furtherupon noticing that

more, () = 0. Summarizing, we get


from (10.5.9) we have
Lemma 10.5.3. With
c (t)]
F (t) = exp[

F (y)
y>t

F (y)

c (t)]
= exp[

[1 {y}].

y>t

We are now in the position to derive the estimator of F .

248

CHAPTER 10. TIME TO EVENT DATA

1. Step: Since, by (10.5.6)

F (dy)
=
F (y)

(t)
=
[t,)

F (dy)
.
C(y)

[t,)

and F and C are estimated through


1
1{Yi y}
n
n

Fn (y) =

i=1

and

1
Cn (y) =
1{Yi yZi } .
n
n

i=1

n we set
For

Fn (dy)
Cn (y)

n (t)

=
=

[t,)
n

1
n

1{tYi } Cn1 (Yi )

i=1

1{tYi }

j=1 1{Yj Yi Zj }

i=1

Note that the denominator is always 1.


n is purely discrete,
2. Step: In the next step we estimate F . Since
Lemma 3.5.3 yields

n {y}].
[1
Fn (t) =
y>t

If the

Yi

are pairwise distinct


Fn (t) =

[
Yi >t

]
1
1
.
nCn (Yi )

(10.5.10)

This estimator is called the Lynden-Bell estimator. Recall (10.5.9) which


in terms of dierentials yields
= dF.
F d

10.5. RIGHT-TRUNCATION

249

Integration leads to

n (dy) = 1 F (t).
F (y)

(10.5.11)

(t,)

and the associated F . For


n and Fn we therefore
(10.5.11) is true for any
get

n (dy) = 1 Fn (t).

Fn (y)
(10.5.12)
(t,)

We conclude our discussion with the estimation of . We have seen that


dF = (1 G )dF.
Integration gives

(1 G )dF.

Since Z is truncated through Y from the right, we can use a Lynden-Bell


n . The estimator of will be
estimator for G, say G

)dF.

= (1 G

250

CHAPTER 10. TIME TO EVENT DATA

Chapter 11

The Multivariate Case


11.1

Introduction

Let X = (X1 , . . . , Xm ), m 1, be a random vector with distribution function F , i.e.,


F (x1 , . . . , xm ) = P(X1 x1 , . . . , Xm xm )
denotes the probability that the Xi jointly fall below given thresholds xi , 1
i m. In many real life situations the vector X comprises the individual
lifetimes of the m components of a complex system. For example, in a
nancial portfolio of bonds, Xi usually denotes the time until default of the
i-th asset. In a medical setup imagine that a patient is constantly monitored
after surgery and further breakdowns may be caused by so-called competing
risks. Similarly in reliability theory, i.e., in a technical environment. In
all these examples the Xi may be viewed as individual times-to-event and
are therefore nonnegative. Consequently F is concentrated on the m-fold
product of the positive real line so that
P(X1 0, . . . , Xm 0) = 1.
In the analysis of survival data it may often happen that, due to time limitations, the Xi are not always observable. Rather, there will be a censoring
variable Yi , the time spent under investigation, so that only
Zi = min(Xi , Yi ),

1 i m,

and
i = 1{Xi Yi } ,
251

252

CHAPTER 11. THE MULTIVARIATE CASE

the censoring status or label of the datum, are available. Given a number
of independent replications of
Z = (Z1 , . . . , Zm ) and = (1 , . . . , m )
it is then the goal of statistical inference to estimate the unknown distribution function (d.f.) F .
While for m = 1, i.e., for real-valued survival data, the analysis already
started, in a nonparametric framework, with the landmark paper of Kaplan and Meier (1958), the multivariate case was only studied since the mid
1980s. An early reference is Dabrowska (1988) whose estimator, however,
was not a bona de estimator but could give negative (empirical) probability
masses to certain events. Van der Laan (1995) derived implicit estimators
which were asymptotically ecient under strong assumptions like complete
availability of the censoring variables. Gill et al. (1995) showed that the
estimators obtained until then were ecient under complete independence
and continuity but inecient otherwise. As a conclusion, the problem to
nd ecient bona de estimators for F under censorship in a general scenario is still open when m 2. It is the goal of this chapter to provide
such an estimator, i.e., to extend the Kaplan-Meier (KM)-estimator to the
multivariate case. The outline of this chapter is as follows. In the rest of
this section we discuss some identiability issues. In Section 11.2 we further motivate our approach while in Section 11.3 we derive the multivariate
KM-estimator. Section 11.4 provides its eciency while in Section 11.5 we
report on some simulation results.
In the rest of this section we rst show why the multivariate case is so much
dierent from the univariate situation. For this, we remind the reader how
the univariate KM-estimator is usually derived. When m = 1, the survival
function
F (x) = P(X x)
satises the trivial equation
F (x) = 1 F (x),
where
F (x) = lim F (y)
yx

is the left-hand limit of F at x. Moreover, introduce H, the d.f. of Z =


min(X, Y ), and the joint d.f. of Z and :
H(x) = P(Z x) and H 1 (x) = P(Z x, = 1).

11.1. INTRODUCTION

253

Under independence of X and Y ,

dH 1 = GdF
and 1 H = (1 F )(1 G),

(11.1.1)

with G(x) = P(Y x) denoting the (unknown) d.f. of Y . In conclusion we


obtain

F (y)

F (x) = 1 F (x) = 1
F (dy)
F (y)
[0,x)

=1

F (y)(dy),

(11.1.2)

[0,x)

where d = dF
is the hazard measure associated with F . In view of (11.1.1),
F
(11.1.2) becomes

H 1 (dy)

F (x) = 1
F (y)
.
(11.1.3)
H(y)
[0,x)

There is one word of caution necessary. Namely, (11.1.3) is true only if

G(y)
> 0 for all y < x. Actually, suppose that G(y)
= 0 for all y in a
left neighborhood (x , x) of x but dF > 0 there. This implies that with
probability one all Y s are less than or equal to x . Hence possible Xs
which exceed x are necessarily censored. As a conclusion F can not
be fully reconstructed from a set of censored data. Stute and Wang (1993)
contains a detailed discussion of this issue. To avoid this here we assume
throughout without further mentioning that the support of F is included in
that of G.
There are more important facts about (11.1.2) and (11.1.3). First, equation
(11.1.2) is an inhomogeneous Volterra integral equation such that, given
d =

dH 1
,
H

its solution q satisfying q(0) = 1 is unique and equals q = F . In other words,


the hazard measure d uniquely determines F resp. F through (11.1.2). An
explicit representation is obtained through
F (x) = exp[c (x)]

0t<x

[1 d {t}],

(11.1.4)

254

CHAPTER 11. THE MULTIVARIATE CASE

the famous product-limit formula. Here c and d denote the continuous


and the discrete parts of , respectively. The KM-estimator is then obtained
by their empirical analogs leading to
by rst replacing H 1 and H
dH 1
dn = n
Hn
and then applying (11.1.4) to n .
We now come to the multivariate case. For ease of discussion, we only
consider the case m = 2. The following example shows that things have
changed.
Example 11.1.1. Let F1 and F2 be two absolutely continuous survival
functions on the positive real line with densities f1 and f2 . Set F (x1 , x2 ) =
F1 (x1 )F2 (x2 ). The hazard function of F is obviously
f1 (x1 )f2 (x2 )
(x1 , x2 ) =
1 (x1 )2 (x2 ),
F (x1 )F (x2 )
say. Since, for any a > 0,
(x1 , x2 ) = (a1 (x1 ))(a1 2 (x2 ))
the family of bivariate survival functions

x
x
2
1
Fa (x1 , x2 ) = exp a1 (t1 )dt1 exp a1 2 (t2 )dt2
0

have the same hazard function as F .


The message of Example 11.1.1 is simple but striking, namely that in the
multivariate case the hazard measure does not uniquely specify the survival
function and hence F . There are more important dierences. For example,
if one tries to mimic the trivial derivation of (11.1.2) in the multivariate
case, one nds out that this is impossible simply because typically F and F
dont sum up to one. Actually, for m = 2, F (x1 , x2 ) + F (x1 , x2 ) equals the
probability that the vector X falls into the area southwest or northeast to
(x1 , x2 ), the other two areas being totally neglected.
Summarizing our ndings obtained so far, a simple extension of (11.1.2)(11.1.4) to the multivariate case will be hopeless. Since it makes no sense

11.1. INTRODUCTION

255

to start the estimation procedure before solving the identiability question,


we have to discuss the rst issue in greater detail.
Again, for ease of representation and readability, we concentrate on bivariate
data. Recall X = (X1 , X2 ) and, independently, Y = (Y1 , Y2 ) from G. Also,
as before,
Zi = min(Xi , Yi ) and i = 1{Xi Yi } , i = 1, 2.
Setting
H 11 (t1 , t2 ) = P(Z1 t1 , 1 = 1, Z2 t2 , 2 = 1)
and
H(t1 , t2 ) = P(Z1 t1 , Z2 t2 )
we obtain for the survival functions
F (x1 , x2 ) = P(X1 x1 , X2 x2 )
and
1 , x2 ) = P(Y1 x1 , Y2 x2 ),
G(x
as in the univariate case,

F = H.

dH 11 = GdF
and G
Conclude that

H 11 (dt1 , dt2 )
F (t1 , t2 )
.
H(t1 , t2 )

F (x1 , x2 ) =

(11.1.5)

[x1 ,)[x1 ,)

As in the univariate case we have to assume that


supp F supp G

(11.1.6)

(11.1.5) exhibits that F is a solution of a homogeneous


Given H 11 and H,
Volterra equation rather than a heterogeneous one as in (11.1.3). If one
looks at the right-hand side of (11.1.5) and interprets the integral as an
operator acting on F , (11.1.5) states that F is an eigenfunction with eigenvalue one. There is little hope, though, in view of Example 11.1.1, that
the eigenspace is one-dimensional, so that F is the only solution satisfying
F (0, 0) = 1. Another complication comes in since compared with (11.1.3)
we now integrate over the north-eastern area which very often (e.g., in the
Lebesgue-dominated case) has innite mass under the hazard measure d.
The following construction serves to replace d by a nite measure.

256

CHAPTER 11. THE MULTIVARIATE CASE

Denition 11.1.2. For any bivariate vector X and 0 < < 1 let
{
X
with probability 1
X =
,
x with probability
where x is a vector, possibly (, ), which exceeds all (x1 , x2 ) (componentwise) in the support of F .
Note that in the discussion below as well as in the estimation process the
choice of x is not important. It only serves as an intellectual tool for
introducing a defective distribution on R2 with survival function
{
(1 )F (x1 , x2 ) + if (x1 , x2 ) in R2

.
F (x1 , x2 ) =

if x = (x1 , x2 ) = x
Of course
F (x) = lim F (x) uniformly in x = (x1 , x2 ) R2 .
0

(11.1.7)

When it comes to estimation we may let = n depend on sample size n in


such a way that n 0 as n . It is also interesting to note that in the
univariate case the Kaplan-Meier estimate is also defective whenever the
for the largest Z equals zero. Then the point x carrying the remaining
mass is from time to time called the cemetry.
Now, denote with d the hazard measure associated with F . We have
{
1
on x
d =
,
(11.1.8)
(1)dF
on R+ R+
(1)F +
a nite measure. Finally, observe that the homogeneous equation

q (x) = 1{tx} q (t) (dt),


(11.1.9)
where t = (t1 , t2 ) x = (x1 , x2 ) means componentwise, admits the solution
q (x) = F (x).

(11.1.10)

In the following section we shall show that for a given > 0, F is the only
solution of (11.1.9) satisfying q (0, 0) = 1 and q (x ) = . In other words,
for defective distributions, the survival function is reconstructable from its
hazard function . Also we need to nd a one-to-one relation between F
and serving the same purpose as (11.1.4) in the univaritate case. Finally,
recall (11.1.7) to get the original target F back from the F s.

11.2. IDENTIFICATION OF DEFECTIVE SURVIVAL FUNCTIONS257

11.2

Identication of Defective Survival Functions

Consider the eigenproblem (11.1.9) with dened as in (11.1.8). Since d


is unknown in practice, we need to study (11.1.9) also for other distributions,
e.g., for estimators of d . Therefore, in the following, we let P be any nite
distribution giving mass one to x , and consider the integral equation

q(x) = 1{tx} q(t)P (dt).


(11.2.1)
We then have the following result.
Theorem 11.2.1. Let P be any given nite measure with mass one at
x . Furthermore, let 0 < < 1 be a given positive number. Then there
exists exactly one nonnegative q = q (P ) with q (0, 0) = 1 and q (x ) =
satisfying (11.2.1). Moreover this q admits the Neumann representation
[
]

q (x) = 1 +
. . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )
i=1


1+
i=1 . . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )


(11.2.2)
=
1+
i=1 . . . 1{ti ti1 ...t1 0,ti =x } P (dti ) . . . P (dt1 )

Proof. It is easy to check that the q from (11.2.2) is nonnegative and satises
q (0, 0) = 1. Also the series in the numerator vanishes at x = x so that
q (x ) = . We shall see from the proof below that both series converge.
Actually, the positive has been introduced only for one reason, namely
to enforce convergence in the Neumann representation. Finally, check that
(11.2.2) is indeed a solution of (11.2.1). Hence it remains to prove that q
from (11.2.2) is the only solution of (??). So, let q be any other nonnegative
solution of (11.2.1) satisfying the two constraints q(0, 0) = 1 and q(x ) = .
By (11.2.1), it is necessarily non-increasing in x and bounded from above
by 1. If we iterate (11.2.1), we obtain

q(x) = q(x ) + 1{tx,t=x } q(t)P (dt)


(11.2.3)

= q(x ) +
1{t2 t1 x,t1 =x } q(t2 )P (dt2 )P (dt1 )

= q(x ) + q(x ) 1{t1 x,t1 =x } P (dt1 )

+
1{t2 t1 x,t2 =x } q(t2 )P (dt2 )P (dt1 ).

258

CHAPTER 11. THE MULTIVARIATE CASE

If we repeat this several times we get for any xed k N:


{
}

k1

q(x) = q(x ) 1 +
. . . 1{ti ti1 ...x,ti =x } P (dti ) . . . P (dt1 )

i=1

...

1{tk tk1 ...x,tk1 =x } q(tk )P (dtk ) . . . P (dt1 )


}
}
{
{
k1
k1

bi + ck .
q(x ) 1 +
bi + ck = 1 +
i=1

i=1

Clearly, all bk s and ck s are nonnegative. In particular, the sum of the bs


is non-decreasing in k. Conclude that
{
}

1+
bi q(x).
i=1

Since is positive, the series is nite. Especially, we get bk 0 as k .


But

{
}
0 ck q(0) . . . 1{tk tk1 ...x,tk1 =x } P (dtk ) . . . P (dt1 ) q(0)bk1 P (R+ )2 + 1 .
Therefore, ck 0 as k and thus
}
{

q(x) = 1 +
bi
{
= 1+

i=1

...

1{ti ti1 ...t1 x, ti =x } P (dti ) . . . P (dt1 ) .

i=1

Since q(0, 0) = 1, we get in particular


{
}

1= 1+
. . . 1{ti ...t1 0,ti =x } P (dti ) . . . P (dt1 ) ,
i=1

so that in summary we have obtained (11.2.2).


In view of Theorem 11.2.1, we know that for dP = d the only solution of
(11.1.9) equals q = q ( ) = F . Moreover, (11.2.2) yields the Neumann
representation of F . Next, we study (11.2.1) for
dP dP0 =

(1 )dH 11
+
+
+ on R R
(1 )H

11.2. IDENTIFICATION OF DEFECTIVE SURVIVAL FUNCTIONS259


(and one on x ) in greater detail. Let q0 = q (P0 ) be the corresponding q.
Since, by

F = H,

dH 11 = GdF
and G
we have
dP0 =

(1 )GdF
(1 )dF

= d ,

(1 )GF +
(1 )F +

(11.2.2) yields
q0 (x) q (x) for all x.
To obtain a lower bound we need to strengthen the identiability condition
(11.1.7) a little bit, namely that
supp F supp G strictly.
In technical terms this means that for each x in the support of F we have

G(x)
>0

(11.2.4)

for an appropriate (small) 0 < < 1. This means for a vector X = (X1 , X2 )
taking its value in the extreme north-eastern corner, that there is a little
though positive chance of not being censored. Under (11.2.4), we have, for
0 < < < 1,
(1 )dF
(1 )dF

(1 )F + /G
(1 )F + /
(1 /)dF

.
(1 /)F + /

dP0 =

Hence by the Neumann representation (11.2.2) we get


q0 (x) q/ (x) for all x.
Taking into account (11.1.7), we thus obtain the following
Theorem 11.2.2. Under (11.2.4), we have, uniformly in x,
F = lim q ( ) = lim q (P0 ).
0

(11.2.5)

260

11.3

CHAPTER 11. THE MULTIVARIATE CASE

The Multivariate Kaplan-Meier Estimator

In this section we derive and study the multivariate extension of the KaplanMeier estimator. Again, we restrict ourselves to dimension two. We start
by their empirwith equation (11.1.5) and replace the unknown H 11 and H
ical counterparts
1
1j 2j 1{Z1j t1 ,Z2j t2 }
n
n

Hn11 (t1 , t2 ) =

n (t1 , t2 ) = 1
H
n

j=1
n

1{Z1j t1 ,Z2j t2 } ,

j=1

where from now on we denote with (Z1j , Z2j , 1j , 2j ), 1 j n, a sample


of independent replicates of Z = (Z1 , Z2 ) and = (1 , 2 ) available to the
analyst. See Section 11.1 for the notion of Z and . To obtain a defective
estimator, we choose = 1/(n+1) so that the empirical equivalent of (11.2.1)
becomes

1
Hn11 (dt1 , dt2 )
q (x1 , x2 ) =
+
1{t1 x1 ,t2 x2 } q (t1 , t2 )
.
n + 1 )(t1 , t2 )
n+1
(H
[x1 ,)[x2 ,)
n
(11.3.1)
Denote with
Fn = q (Pn0 )
the unique nonnegative solution of (11.3.1) satisfying q (0, 0) = 1, where
n + 1/n). Fn will be the bivariate analog of the univaridPn0 = dHn11 /(H
ate Kaplan-Meier estimator. Extensions to higher dimension are straightforward at the expense of some more notation. As to computational aspects, one possibility would be the Neumann representation (11.2.2) with
dP = dPn0 . We prefer to solve (11.3.1) directly. Forgetting the leading
summand 1/(n + 1) for a moment, equation (11.3.1) says that Fn is the
eigenfunction (with eigenvalue one) of an appropriate (empirical) operator
equation.
To begin with the computation, since Hn11 is a discrete measure supported
by certain (Z1i , Z2i ), 1 i n, so is Fn . Let pi denote the mass given to
(Z1i , Z2i ) under Fn . These are the quantities we are interested in because
they uniquely determine Fn . Conclude that
Fn (Z1i , Z2i ) =

j=1

aij pj +

1
,
n+1

11.3. THE MULTIVARIATE KAPLAN-MEIER ESTIMATOR


where aij is the indicator
{
1
aij =
0

261

if Z1j Z1i and Z2j Z2i


.
otherwise

Since in this section n is xed, we may extend this denition to


ai,n+1 = 1 for 1 i n or i = n + 1
to describe the position of x in the sample. Also let pn+1 = 1/(n + 1)
denote the mass of x . Finally, let
bi =

1i 2i
Hn11 (Z1i , Z2i )
1 = nH

Hn (Z1i , Z2i ) + n
n (Z1i , Z2i ) + 1

n + 1 ). Let (Z1,n+1 , Z2,n+1 )


be the mass given to (Z1i , Z2i ) under dHn11 /(H
n
correspond to x , and put bn+1 = 1. Then equation (??) with x1 = Z1i
and x2 = Z2i may be rewritten as
n+1

aij pj =

j=1

n+1

Fn (Z1k , Z2k )aik bk

k=1

n+1

k=1

aik bk

n+1

akl pl =

l=1

n+1

[n+1

l=1

k=1

]
aik bk akl pl .

For i = n + 1, i.e., at x = x , the equation is also true, since both sides


equal 1/(n + 1). Hence, if we introduce the vector p = (p1 , . . . , pn , pn+1 )t
and the matrices
A = (aij )1i,jn+1 and B = diag(b1 , . . . , bn+1 ),
then the equation (11.3.1) becomes
Ap = ABAp.

(11.3.2)

Hence the vector Ap is an eigenvector of the matrix AB, with eigenvalue


one. To simplify things, we may order the rst components Zij in increasing
order to come up with the associated order statistics
Z11:n . . . Z1n:n .
After ordering the aij become

0
1
aij =

1 or 0

if
if
if

j<i
j=i ,
j>i

(11.3.3)

262

CHAPTER 11. THE MULTIVARIATE CASE

assuming for the time being that no ties are present. An ordering of the
second components would lead to the same numerical results. Now, under
(11.3.3), the matrix A becomes an invertible upper triangular matrix, and
equation (11.3.2) boils down to
p = BAp.

(11.3.4)

Of course, now pj is the mass given to (Z1j:n , Z[2j:n] ), the second Z being
the concomitant of the rst. Furthermore, upon setting j = 1j 2j , we have
BA = ((bi aij ))1i,jn+1
with

i aij
bi aij = n+1
.
k=i aik

In particular, BA too is upper-triangular with diagonal elements


i
bi = n+1
k=i

aik

1 i n + 1.

Note that bn+1 = 1 but 0 bi < 1 for 1 i n, simply because aii =


1 = ai,n+1 . We are now in the position to formulate the main result of this
section.
Theorem 11.3.1. The equation (11.3.4) has a unique admissible solution
p = (p1 , . . . , pn+1 )t , i.e., all pi satisfy pi 0 and
n+1

pi = 1.

i=1

Proof. From (11.3.4) we obtain for each 1 i n + 1:


pi =

n+1

bi aij pj .

j=1

Since aij = 0 for i > j, it follows that


p i = bi p i +

n+1

bi aij pj .

j=i+1

Recall pn+1 = 1/(n + 1). For 1 i n we get


pi = ci

n+1

j=i+1

aij pj ,

(11.3.5)

11.3. THE MULTIVARIATE KAPLAN-MEIER ESTIMATOR

263

with

bi
, 1 i n,
1 bi
being well-dened because of bi < 1. Since all bi and aij are nonnegative,
so are the pi s. Furthermore, by backward recursion based on (11.3.5), all
p
i s are unique up to a common factor. Hence by dividing each pi through

n+1
j=1 pj , we obtain the admissible solution.
ci =

Note that the backward recursion (11.3.5) leads to an explicit representation


of pi through pn+1 (before normalizing): For 1 i n we have
[
n
n
n

pi = ci pn+1 1 +
aij cj +
aij ajk cj ck
j=i+1

j=i+1 k=j+1

+ . . . + ai,i+1 ai+1,i+2 . . . an1,n ci+1 ci+2 . . . cn ci di pn+1


and therefore

n+1

[
pi = pn+1 1 +

i=1

]
ci di .

i=1

To get a proper distribution pn+1 therefore needs to be updated and becomes


]1
[
n

.
pn+1 = 1 +
ci di
i=1

(11.3.5) then applies to compute the other pi s from the set of cs and as,
i.e., from the data. As mentioned earlier, in the univariate case, the KaplanMeier estimator is always defective and thus unprotected to loss of masses
when the of the largest Z equals zero. Hence it is interesting to look at
our approach also in dimension one. In this case aij = 1 for j i, after
ordering. Conclude that
bi =

i
,
n+2i

1 i n + 1.

The backward recursion (11.3.5) simplies dramatically and becomes


pi = ci

n+1

pj ,

1 i n.

j=i+1

From this we obtain


pi =

ci
bi+1

pi+1

264

CHAPTER 11. THE MULTIVARIATE CASE

and thus, because of bn+1 = 1,


ci ci+1 ci+2
. . . pn+1
bi+1 bi+2 bi+3
n

(1 + cj )
= ci pn+1

pi =

j=i+1

for 1 i n. The updated pn+1 equals


1
(1
+ cj )
j=1

pn+1 = n

so that at the end of the day, for 1 i n,


p i = i

ci

j=1 (1

+ cj )

= bi

i1

[1 bj ].

j=1

These are exactly the Kaplan-Meier weights given to the sample (Zi , i ), 1
i n, enriched by x . The choice of x is not important when we restrict
our estimator to the data. Furthermore, as was shown by Stute and Wang
(1993) and Stute (1995), the fundamental large sample results, the Strong
Law of Large Numbers and the Central Limit Theorem, continue to hold for
KM-integrals and are not aected by the possible (negligible) defectiveness
of the estimator. Similarly for the multivariate case, where (before normalization) the mass 1/(n + 1) given to x is positive but small as n gets
larger.

11.4

Eciency of the Estimator

While in the previous section the focus was on the construction of a nonparametric estimator of a multivariate survival function and some of its
numerical properties, this section is devoted to its statistical behavior. For
this, let P be again any nite measure on the Euclidean space enriched by
x and giving mass one to x . Recall that

q(x) = 1{tx} q(t)P (dt)


(11.4.1)
admits exactly one nonnegative solution satisfying q(0, 0) = 1 and q(x ) =
. Moreover, q equals


1+
i=1 . . . 1{ti ti1 ...t1 x,ti =x } P (dti ) . . . P (dt1 )


(11.4.2)
q(x) =
1+
i=1 . . . 1{ti ti1 ...t1 0,ti =x } P (dti ) . . . P (dt1 )

11.4. EFFICIENCY OF THE ESTIMATOR

265

Since q(x) is uniquely determined through P , we may view q(x) as the value
of a statistical functional q(x) = Tx (P ) evaluated at P . Equation (??)
exhibits that Tx is a ratio of two innite order U -functionals. Similar to
Sering (1980) one can show that Tx is Gateaux-dierentiable. The lefthand side of (11.4.1) yields not only a function in x but also a distribution
which is again denoted with q. By taking increments in (11.4.1) we obtain
for any rectangle I

q(I) =

1I (t)q(t)P (dt).

(11.4.3)

From this we get, by summing up both sides of (11.4.3), that

dq = (t)q(t)P (dt)

(11.4.4)

for any elementary function . Finally, by going to the limit, (11.4.4) holds
also for any bounded continuous function , and nally for any bounded
measurable . The boundedness is needed to guarantee, for any nite P ,
the existence of the integrals. Relevant q obtained and discussed so far are
the ones associated with
dP = dP0 =

dP = d ,

(1 )dH 11
+
(1 )H

and
dP =

dPn0

(1
(1

1
11
n+1 )dHn

n+1 )Hn

1
n+1

dHn11
.
n + 1
H
n

The true but unknown hazard measure dP = dF/F does not belong to this
collection since in many cases it puts innite mass to the Euclidean space.
To also include integrals w.r.t. the last P , we assume that the function
is such that F is bounded away against zero on the support of . This
guarantees that all integrals and moments to appear below exist. Important
examples of such s are the indicators of the rectangles [0, x1 ] [0, x2 ] with
x1 T1 and x2 T2 such that F (T1 , T2 ) > 0. Of course, for such a ,

dF = F (x1 , x2 ).
Recall that for sample size n the mass sitting at x equals 1/(n+1). With
this = n , we have
Fn (x) = Tx (Pn0 ).

266

CHAPTER 11. THE MULTIVARIATE CASE

Furthermore, set
Fn0 (x) = Tx (P0 ),
a non-random survival function. To
study the eciency of our estimator

Fn , we consider a linear functional dFn of Fn with satisfying the above


support condition. Then, by (11.4.4),

dFn = (t)Fn (t)Pn0 (dt)


and

dFn0 =

(t)Fn0 (t)P0 (dt).

Conclude that

dFn dFn = (t)[Fn (t) Fn0 (t)]Pn0 (dt)

+ (t)Fn0 (t)[Pn0 (dt) P0 (dt)] = I + II.


The second integral equals
[
]

11
11
dH
dH
n
0
II = Fn

n + 1
+1
H
H
n
n
[
]

1
Fn
11
11
11
11

HdHn Hn dH + (dHn dH ) .
=
n + 1 )(H
+ 1)
n
(H
n
n
Standard empirical process theory thus yields, in view of Theorem 11.2.2,

F
1/2
1/2
11
11

n II = n
2 [HdHn Hn dH ] + oP (1)
H
}
{

F
F
1/2
11
11
11

=n
+ oP (1)
(dHn dH ) +
2 [H Hn ]dH
H
H
{
}

= n1/2
(dHn11 dH 11 )

[Hn H]dF + oP (1).


G
H
The term in brackets is a sum of centered i.i.d. random variables which after
,
multiplication with n1/2 is asymptotically normal. Because of dH 11 = GdF
it simplies to
[
n dF ]
dHn11 H

.
(11.4.5)
H

11.4. EFFICIENCY OF THE ESTIMATOR

267

Conclude that
n1/2 II =

dn + oP (1),

where the associated centered empirical process n dened through


[
11
n dF ]
H
1/2 dHn
dn = n
.
H

G
It will also be relevant for I. Actually, since both Fn and Fn0 pertain to the
same , namely = 1/(n + 1), we obtain from (11.2.2)
Fn (t)Fn0 (t)

1
=
n+1
i=1

...

1{ti ...t1 t,ti =x }

Pn0 (dti ) . . . Pn0 (dt1 )


P0 (dti ) . . . P0 (dt1 )

The linear expansion of the right hand side is obtained through the Hajek
n
.
projection; see Sering (1980). More precisely, dene
n through d
n = d
F
Then we have, up to an oP (1) term,
[
]
n1/2 Fn (t) Fn0 (t)

i
1
. . . 1{ti ...t1 t,ti =x } P0 (dti ) . . .
n (dtk ) . . . P0 (dt1 )
=
n+1
i=1 k=1

1
=
. . . 1{tk ...t1 t,tk =x }
n+1
k=1
{
}

0
0
1+
. . . 1{ti ...tk ,ti =x } P (dti ) . . . P (dtk+1 )
n (dtk ) . . . P0 (dt1 )
i=k+1

...

Ttk (P0 )1{tk ...t1 t,tk =x }


n (dtk ) . . . P0 (dt1 ).

k=1

From Theorem 11.2.2 and the tightness of


n the last series becomes, up to
an oP (1) term,

k=1

...

F (dtk1 )
F (dt1 )
F (tk )1{tk ...t1 t,tk =x }
n (dtk )
...
.
F (tk1 )
F (t1 )

On the support of , by the Glivenko-Cantelli Theorem, the distribution


function of Pn0 converges uniformly to that of dF/F , with probability one.

268

CHAPTER 11. THE MULTIVARIATE CASE

Summarizing, we have up to an oP (1) term,


1/2

I=

...

k=1

(t)
1{t ...t1 t,tk =x } n (dtk ) . . .
F (t) k
F (dtk1 )
F (dt1 )
...
F (dt).

F (tk1 )
F (t1 )

Taking (11.4.5) into account, we have thus obtained


[
n

dFn

1/2

dFn0

dn + oP (1)

(11.4.6)

with
(s) = (s) +

...

k=1

F (dtk )
F (dt1 )
(t1 )1{stk ...t1 }
...
.
F (tk )
F (t1 )

(11.4.7)

Hence the Gateaux dierential of a linear functional of Fn belongs to the


closed linear span of the empirical process n , which obviously belongs to
the closed linear span of the model at F . For eciency, since dn is a
linear functional, it suces to consider sample size n = 1. Now

1 dF ]
dH111 H
d1 =
H

1 2 (Z1 , Z2 )
(z1 , z2 )
=
1{Z1 z1 ,Z2 z2 }
F (dz1 , dz2 ).
G(Z1 , Z2 )
H(z1 , z2 )
[

In follows that
[

d1 |X1 , X2 = (X1 , X2 )

(z1 , z2 )
F (dz1 , dz2 ).
1{X1 z1 ,X2 z2 }
F (z1 , z2 )

If we plug (11.4.7) into the last expression, the terms in the series cancel
out and the conditional expectation drops down to
[
E

]
d1 |X1 , X2 = (X1 , X2 )

(11.4.8)

11.5. SIMULATION RESULTS

11.5

269

Simulation Results
Fehat

Fbar

1.0
0.8

0.8
0.6
6

0.4
5

0.2

1
2

2
3
x1s
eq

1
5

4
0

x2s

eq

0.0

2
3
eq

x1s

1
5

n=100, theta=1.125, tao=0.125,epsilon=(n+1)^(1)

eq

0.2

x2s

t
Feha

0.4

Fbar

0.6

n=100, theta=1.125, tao=0.125,epsilon=(n+1)^(1)

270

CHAPTER 11. THE MULTIVARIATE CASE

Bibliography
[1] Dabrowska, D. M., Kaplan-Meier estimate on the plane, Ann. Statist.,
16, 1475-1489 (1988).
[2] Gill, R. D., van der Laan, M. J. and Wellner, J. A., Inecient estimators
of the bivariate survival function for three models, Ann. Inst. H. Poincare
Probab. Statist., 31, 545-597 (1995).
[3] Prentice, R. L., Moodie, F. and Wu, J., Hazard-based nonparametric
survivor function estimation, J. R. Stat. Soc. Ser. B. Stat. Methodol., 66,
305-319 (2004).
[4] Sering, R. J., Approximation theorems of mathematical statistics, Wiley, New York, (1980).
[5] Stute, W., The central limit theorem under random censorship, Ann.
Statist., 23, 422-439 (1995).
[6] Stute, W. and Wang, J. L., The strong law under random censorship .
Ann. Statist., 21, 1591-1607 (1993).
[7] van der Laan, M. J., Ecient estimation in the bivariate censoring model
and repairing NPMLE, Ann. Statist., 24, 596-627 (1996).
[8] van der Vaart, A., On dierentiable functions, Ann. Statist., 19, 178-204
(1991).

271

272

BIBLIOGRAPHY

Chapter 12

Nonparametric Curve
Estimation
12.1

Nonparametric Density Estimation

Let X1 , . . . , Xn be a sample of independent random variables from a d.f. F


with density f = F . In this section we discuss a nonparametric estimator
of f which has been rst proposed by Rosenblatt (1956) and Parzen (1962)
and which in the sequel had been extensively studied in the literature.
We already mentioned that Fn does not allow for a straightforward estimation of f . Noting, however, that
f (x) = F (x) = lim
a0

F (x + a2 ) F (x a2 )
a

we could consider the right-hand side, for a xed bandwidth a > 0, and
estimate the ratio through
fn (x) =

Fn (x + a2 ) Fn (x a2 )
.
a

Since Fn is not dierentiable at the data, it is not permitted to take the


limit a 0.
For any such bandwidth, we have
Efn (x) =

F (x + a2 ) F (x a2 )
a
273

274

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

and
Varfn (x) =

1 (
a
a )(
a
a )
F
(x
+
)

F
(x

)
1

F
(x
+
)
+
F
(x

)
na2
2
2
2
2
f (x)
.
na

The formula for Efn (x) reveals that in general fn (x) is a biased estimator
of f (x). Put
Biasfn (x) := Efn (x) f (x).
If f is twice continuously dierentiable, then Taylors formula yields
Biasfn (x) =

a2
f (x) + O(a3 ).
24

An overall measure for the deviation between fn and f at x is the Mean


Squared Error
M SEfn (x) := Bias2 fn (x) + Varfn (x).
Neglecting error terms we get
M SEfn (x) =

a4
c2
f (x)
= c1 a4 +
.
[f (x)]2 +
2
24
na
na

This expression is minimized for


[

c2
a=
4c1 n

]1/5

= cn1/5 .

The unpleasant feature of this result is that the constant c depends on f


and is therefore unknown.
A way out of this dilemma is to parametrize the bandwidth as tn1/5 and
look at fn as a stochastic process in t. This will be studied in a forthcoming
section.
Our results on the oscillation modulus immediately apply to get an almost
sure bound for the stochastic error.
Actually, for any bandwidth a = an tending to zero,
(
[ (

an )]
an )
nan [fn (x) Efn (x)] = an1/2 n x +
n x
2
2

12.1. NONPARAMETRIC DENSITY ESTIMATION

275

which in absolute values is uniformly in x bounded from above by


a1/2
n (an )
n

(
=O

ln a1
n

)
.

Hence for the optimal rate an = tn1/5 ,


(
sup |fn (x) Efn (x)| = O
x

ln n
n2/5

)
.

The analytic error (bias) equals as we have seen before, O(a2n ) = O(n2/5 ).
In this section we extend fn to a more general class of estimators. Put
1
K(y) = 1{ 1 y< 1 } ,
2
2
2
Then we have
1
fn (x) =
a

xy
a

y R.

(12.1.1)

Fn (dy).

(12.1.2)

The function K is nonnegative and has Lebesgue integral

K(y)dy = 1,
i.e., K is a Lebesgue density. Equation (12.1.2) also applies to other densities
K as well. If we replace the indicator by a dierentiable K, also the resulting
fn will be dierentiable allowing for estimation of higher order derivatives
of F . The function K is called the kernel and the resulting fn is a kernel
estimator.
As for the naive kernel (12.1.1), the optimal choice of an depends on
unknown quantities. Introducing the function
Ka (y) =

1 (y )
K
,
a
a

y R,

the kernel estimator equals Ka Fn , where denotes convolution. Most


of the results obtained so far extend to a general K, under some mild tail
conditions on K(x), as |x| .

276

12.2

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Nonparametric Regression: Stochastic Design

An important issue in Multivariate Statistics is that of investigating the


dependence structure of the components of multivariate data. This question
was initiated by F. Galton who coined the term regression. Further, let
(X, Y ) be a random vector in Rd+1 . A classical model assumption is that,
up to a stochastic error, Y depends linearly on X:
Y = X T + , E(|X) = 0, Var = 2 < .

(12.2.1)

Here, is the unknown parameter of interest. Apart from (12.2.1) it is often


assumed that has a normal distribution. In this case, many distributions
of quantities of interest (e.g., of the LSE of ) may be explicitly computed.
Otherwise, asymptotic distributional results may be obtained under appropriate assumptions. Estimation and testing procedures are based on an i.i.d.
sample (Xi , Yi ), 1 i n, from the same distribution as (X, Y ). Much less
has been known for a long time when the model assumptions (12.2.1) can
not be justied. For a general approach to this question, observe that
E[Y |X] = E[X T |X] + E[|X]
= XT
i.e.
E[Y |X = x] = xT .
In other words, knowing is tantamount to knowing the regression function
m(x) = E[Y |X = x].
(12.2.2)
Nonparametric regression is concerned with estimation of m without making
any model assumptions like (12.2.1). As becomes clear from (12.2.2), the
regression function m is the factorization of E[Y |X] w.r.t. X. It is known
from measure theory that existence of m follows from the Radon-Nikodym
Theorem but no explicit representation of m is available in general.
Estimation of m from an i.i.d. sample was independently initiated by
Nadaraya (1964) and Watson (1964). To point out their idea, suppose that
x is an X-atom, i.e., P(X = x) > 0. We then know that (a version of) m(x)
is given by

1
m(x) =
Y dP.
(12.2.3)
P(X = x)
{X=x}

12.2. NONPARAMETRIC REGRESSION: STOCHASTIC DESIGN 277


The SLLN implies that with probability one
n

Yi 1{Xi =x}

i=1

and
1

Y dP
{X=x}

1{Xi =x} P(X = x).

i=1

Thus, dening
n
n
Yi 1{x} (Xi )
i=1 Yi 1{Xi =x}
mn (x) = n
= i=1
,
n
i=1 1{Xi =x}
i=1 1{x} (Xi )
we get
mn (x) m(x) with probability one.
If x is not an X-atom (12.2.3) is no longer true. Nadaraya and Watson
proposed (when d = 1) to replace the one point set {x} with an interval
(x an , x + an ]:
n
Yi 1{xan <Xi an }
mn (x) = ni=1
(12.2.4)
i=1 1{xan <Xi x+an }
(= 0 if the denominator is zero)
Here an > 0 is again a bandwidth (or window) tending to zero as n gets
large. an will play the same role as the bandwidth in the kernel density
estimator. Similarly, also here, we are likely to lose the local information
contained in the data and increase the bias if we choose an too large (oversmooth), and have too much variance if an is too small (undersmooth).
Choice of a proper bandwidth will therefore be an important issue also here.
Notice that mn of (12.2.4) may be written as
)
(
n
xXi
Y
K
i=1 i
an
(
)
mn (x) =
n
xXi
K
i=1
an

(12.2.5)

with
K(y) = 1[1,1) (y).
Needless to say, that (12.2.5) may be extended to more general
kernels K.
Note also that unlike the density case, we need not require K(u)du = 1

278

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

(the case K = 0 is always ruled out). Also, because m may also take
on negative values, there are no psychological (or emotional) objections to
kernels attaining negative values.
If x is not an atom of X, more sophisticated arguments will be needed for
showing, e.g., consistency of mn (x). Here is a heuristic argument which
uses the fact that m(X) = E[Y |X]. Only the means of the numerator and
the denominator are considered.
We have
[

(
[
(
)]
)]
n
x Xi
1
xX
1
E
Yi K
=E
YK
nan
an
an
an
i=1
(
)
}
{
xX
1
K
E[Y |X]
= E {E[. . . |X]} = E
an
an
{
(
)}
1
xX
= E
m(X)K
.
an
an
Assuming that X has a density f , the last integral becomes
(
)

xy
1
an
m(y)f (y)K
dy m(x)f (x) K(u)du
an
as an 0, under appropriate regularity conditions on f and m. The denominator of (12.2.5) formally equals the nominator if we set Yi
1. Thus
the expectation of the denominator is likely to converge to 1f (x) K(u)du.
Provided that f (x) > 0, the ratio therefore in the limit becomes m(x), as
desired.
Originally, mn was motivated upon assuming that (X, Y ) has a (bivariate)
density g. In this case

yg(x, y)dy
m(x) =
,
(12.2.6)
f (x)

where
f (x) =

g(x, y)dy

is the marginal density of X. Denote with


Hn (u, v) = n1

i=1

1{Xi u,Yi v} ,

u, v R

12.2. NONPARAMETRIC REGRESSION: STOCHASTIC DESIGN 279


the bivariate empirical d.f. of (Xi , Yi ), 1 i n. Set
(
) (
)

xu
yv
2
gn (x, y) = an
K
K
Hn (du, dv)
an
an
(
) (
)
n

x Xi
y Yi
2 1
K
= an n
K
.
an
an
i=1

Then gn constitutes the extension of the Rosenblatt-Parzen kernel density


estimator to the bivariate case, with kernel
K0 (u, v) = K(u)K(v).
Check that, by Fubini,

2
ygn (x, y)dy = an

2
= an

= a1
n

1
= an

) (
)
xu
yv
yK
K
Hn (du, dv)dy
an
an
(
) (
)

xu
yv
yK
K
dyHn (du, dv)
an
an
)
(

xu
(an w + v)K(w)dwHn (du, dv)
K
an
(
)
xu
vK
Hn (du, dv)
an
(
)
n

x Xi
1 1
= an n
Yi K
,
an
i=1

provided that K(w)dw = 1 and wK(w)dw = 0. If in (??) g and f are


replaced with gn and
)
(
n

x Xi
1 1
,
fn (x) = an n
K
an

i=1

respectively, then

mn (x)

ygn (x, y)dy


fn (x)

becomes (12.2.5), as expected.


Remark 12.2.1. Setting

(
K

Win (x) =
n

xXi
an

j=1 K

xXj
an

280

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

we have
mn (x) =

Win (x)Yi .

(12.2.7)

i=1

12.3

Consistent Nonparametric Regression

Stone (1977) considered estimation of m through estimators of the form


(12.2.7) for general weights {Wn }, where Wn = Wn (x) = (W1n (x), . . . , Wnn (x))
and Win (x) = Win (x, X1 , . . . , Xn ):
mn (x) =

Win (x)Yi .

i=1

Let (X, Y ) Rd+1 , where d 1. will denote some norm on Rd .


Denition 12.3.1. A sequence {mn } is said to be consistent in Lr , r 1,
if
mn (X) m(X) in Lr ,
(12.3.1)

i.e.
E

|mn (x) m(x)|r (dx) 0.

Here is the distribution of X and m(x) = E[Y |X = x] is the true regression


function. Without further mentioning it is assumed that E|Y |r < .
Consider the following set of assumptions
(i)

[
E

]
|Win (X)|f (Xi ) CEf (X)

i=1

for every f 0 and n 1, some C 1.


(ii) For some D 1

(
P

)
|Win (X)| D

=1

i=1

(iii)

i=1

|Win (X)|1{Xi X>} 0 in probability, each > 0

12.3. CONSISTENT NONPARAMETRIC REGRESSION


(iv)

281

Win (X) 1 in probability

i=1

(v)
max |Win (X)| 0 in probability

1in

Theorem 12.3.2. (Stone). Under (i) - (v),


mn (X) m(X) in Lr .
To prove the theorem, we need a series of lemmas.
Lemma 12.3.3. Under (i) (iii), assume that E|f (X)|r < . Then
[ n
]

r
E
|Win (X)||f (Xi ) f (X)| 0.
i=1

Proof. For > 0 given, let h be a continuous function on Rd with compact


support such that
E|f (X) h(X)|r .
By (i),

]
|Win (X)||f (Xi ) h(Xi )|

CE|f (X) h(X)|r C.

i=1

From (ii),
[
E

]
|Win (X)||f (X) h(X)|r DE|f (X) h(X)|r D.

i=1

Altogether this shows that we need to prove the lemma only for continuous
f with compact support. Set M = f . Since f is uniformly continuous,
given > 0, we man nd some > 0 such that
x x1 implies |f (x) f (x1 )|r .
By (ii),
[ n
]
{ n
}

E
|Win (X)||f (Xi ) f (X)|r
(2M )r E
|Win (X)|1{Xi X>}
i=1

i=1

+ D.

282

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Use (ii) and (iii) to conclude that the last expectation converges to zero.
This completes the proof of the lemma.

Lemma 12.3.4. Under (i) (iii), let {Wn } be a sequence of nonnegative
weights such that for nonnegative constants Mn and Nn
(
)
n

P Mn
Win (X) Nn 1.
i=1

Let f be a nonnegative Borel-function on Rd such that Ef (X) < . Then


lim inf E
n

Win (X)f (Xi ) (lim inf Mn )Ef (X)


n

i=1

and
lim sup E
n

Win (X)f (Xi ) (lim sup Nn )Ef (X)


n

i=1

Proof. Set
An =

Mn

}
Win (X) Nn

i=1

Assume w.l.o.g. Mn D. Then


Mn D1An

Win (X) Nn + D1An

i=1

and therefore

Mn Ef (X) DE1An f (X) E

]
Win (X)f (X)

i=1

Nn Ef (X) + DE1An f (X).


Now, P(An ) 1 implies E1An f (X) 0. Consequently
[ n
]

lim inf E
Win (X)f (X) (lim inf Mn )Ef (X).
n

i=1

Apply Lemma 12.3.3 to prove the lim inf part of the lemma. A similar
argument yields the lim sup part.


12.3. CONSISTENT NONPARAMETRIC REGRESSION

283

Lemma 12.3.5. Under (i) (iii), assume that for some constants Mn and
Nn
)
(
n

2
P Mn
Win (X) Nn 1
i=1

Then for every nonnegative -integrable function f


[ n
]
)
(

2
lim inf E
Win
(X)f (Xi ) lim inf Mn Ef (X)
n

and

[
lim sup E
n

i=1

i=1

]
2
Win
(X)f (Xi ) (lim sup Nn )Ef (X).
n

Proof. Apply the last lemma to {Wn2 }, with C and D replaced by CD and
D2 , respectively.

Lemma 12.3.6. Under (i) - (iii), for every Borel-function f ,
n

|Win (X)|1{|f (Xi )f (X)|>} 0 in probability

i=1

for each > 0.

Proof. Let > 0 be given. Assume rst that f is bounded. By Lemma


12.3.3,
[ n
]

E
|Win (X)||f (Xi ) f (X)| 0
i=1

and therefore
n

|Win (X)|1{|f (Xi )f (X)|>} 0 in probability.

i=1

For an arbitrary f , we may choose an M > 0 such that P(|f (X)| > M ) ,
a given small positive constant. Set f = (f M ) (M ). Since
{f(Xi ) = f (Xi )} {|f (Xi )| > M },

284

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

we obtain
n

|Win (X)|1{|f (Xi )f (X)|>}

i=1

i=1
n

|Win (X)|1{|f(Xi )f(X)|>}


|Win (X)|1{|f(Xi )|>M or

|f (X)|>M }

i=1

The rst sum converges to zero in probability, by boundedness of f. The


expectation of the second sum is less than or equal to (use (i) and (ii))
(C + D)P(|f (X)| > M ) (C + D).

Lemma 12.3.7. Under (i) - (iv), assume that f (X) Lr , some r 1.
Then
n

Win (X)f (Xi ) f (X) in Lr .


i=1

Proof. By (ii),
n
r




Win (X) 1 (1 + D)r with probability one.



i=1

Thus, by (iv),

[(
r
)
]
n




E
Win (X) 1 f (X) 0.

(12.3.2)

i=1

Apply (ii), Lemma 12.3.3 and Holders inequality to obtain


n
r




E
Win (X)[f (Xi ) f (X)] 0.

(12.3.3)

i=1

Clearly, (12.3.2) and (12.3.3) imply the assertion of the lemma.

We are now in the position to give the


Proof of Theorem 12.3.2. Assume (i) (v), and let r 1. Since E|Y |r <
by assumption,
E|m(X)|r = E|E(Y |X)|r E|Y |r <

12.3. CONSISTENT NONPARAMETRIC REGRESSION

285

by Jensens inequality. Now,


n

[
Win (X)Yi m(X) =

i=1

]
Win (X)m(Xi ) m(X)

i=1
n

Win (X)Zi ,

i=1

where
Zi = Yi E[Yi |Xi ] = Yi m(Xi ).
The term in brackets converges to zero in Lr by Lemma 12.3.7. For the
second sum consider r = 2 rst. It follows from the very denition of m and
the independence of the data, that
[
E

]2
Win (x)Zi

E[Win (x)Wjn (x)Zi Zj ]

i=1 j=1

i=1

n
n

n
n

E [E(Win (x)Wjn (x)Zi Zj |X1 , . . . , Xn )]

i=1 j=1

n
n

E [Win (x)Wjn (x)E(Zi Zj |X1 , . . . , Xn )]

i=1 j=1
n

i=1

i=1

2
E[Win
(x)E(Zi2 |Xi )] =

2
E(Win
(x)h(Xi )],

say. It follows that


[
E

]2
Win (X)Zi

i=1

[
=E

]
2
Win
(X)h(Xi )

i=1

By Lemma 12.3.5, the limsup of the right-hand side is less than or equal to
(lim sup Nn )Eh(X),
n

where Nn is such that


P

( n

i=1

)
2
Win
(X) Nn

1.

(12.3.4)

286

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

By (ii) and (v)

2
Win
(X) 0 in probability,

i=1

i.e., for Nn we may take Nn , where > 0 is any small number. Expression (12.3.4) thus becomes Eh(X). Since is arbitrary, this proves the
theorem.

Remark 12.3.8. Observing that
Win (X) = Win (X; X1 , . . . , Xn )
is a measurable function of the i.i.d. random vectors X, X1 , . . . , Xn , we get
E[|Win (X)|f (Xi )] = E[|Win (Xi ; X1 , . . . , X, . . . Xn )|f (X)]
so that (i) amounts to
[{ n
}
]

E
|Win (Xi ; X1 , . . . , X, . . . , Xn )| f (X) CEf (X),
i=1

any f 0. In particular, (i) is satised whenever

(i)*

|Win (Xi ; X1 , . . . , x, . . . , Xn )| C, each x Rd ,

(12.3.5)

i=1

holds.
Remark 12.3.9. Let g be a Borel-measurable function such that g(Y ) Lr .
Replacing Yi by g(Yi ), we obtain
mn (x) =

Win (x)g(Yi ),

i=1

which, under (i) - (v), is consistent for


m(x) = E[g(Y )|X = x].
As an example, take g = 1(,t] , then
mn (x) = mn (t; x) =

Win (x)1{Yi t}

i=1

and
m(x) = m(t; x) = P(Y t|X = x),
the empirical and the true conditional d.f. at t given X = x.

12.4. NEAREST-NEIGHBOR REGRESSION ESTIMATORS

287

Similarly, combining mn for g(y) = y 2 and g(y) = y, we obtain


[ n
]2
n

n2 (x) =
Win (x)Yi2
Win (x)Yi
i=1

i=1

as an estimator for the conditional variance


E[Y 2 |X = x] {E[Y |X = x]}2 Var(Y |X = x).

12.4

Nearest-Neighbor Regression Estimators

In this section we shall introduce and discuss a family of weights satisfying


(i)* from the last section. Also, all of the other conditions in Theorem 12.3.2
will be seen to hold. To be specic, let X1 , . . . , Xn be a random sample in Rd
with distribution . Fix x Rd . Set di := xXi , a nonnegative random
variable. Clearly, d1 , . . . , dn are i.i.d. Denote with d1:n d2:n . . . dn:n
the ordered di s.
Denition 12.4.1. For 1 k n, Xi is called the k-nearest neighbor
(k-NN) of x i di = dk:n . If the ds have ties, we may break them by
attaching independently a uniform Zi to Xi and then replace Xi by (Xi , Zi ),
thus embedding Xi into Rd+1 . Xi is called to be among the k-NN if di =
dr:n , r k.
Denition 12.4.2. For any , supp() is the smallest closed set S for which
(S) = 1.
Remark 12.4.3. In Topological Measure Theory it is shown that supp()
exists.
In the following k = kn may and will depend on n as n .
Lemma 12.4.4. Assume kn /n 0 as n . Then for each x supp()
we have
dkn :n = dkn :n (x) 0 with probability one,
i.e. the kn -NN converges to x P-a.s.
Proof. For each > 0, denote with
S (x) = {y : x y < }

288

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

the open -ball with center x. For x supp(), we have (S (x)) > 0. In
fact, from (S (x)) = 0 we obtain (A) = 1, where A = supp() \ S (x) is
a closed set strictly contained in supp(), a contradiction.
Now, from the SLLN,
n (S (x)) (S (x)) a > 0
Hence,

P-a.s.

1{Xi S (x)} na/2 kn

i=1

eventually with probability one. Conclude dkn :n < . This proves the lemma.

In the following, replace x by X, so that dkn :n becomes dkn :n (X).
Lemma 12.4.5. Under kn /n 0 as n ,
dkn :n (X) 0 with probability one
Proof. Set
A = {(, x) Rd : dkn :n (x) 0}.
Then A A Bd . From Lemma 12.4.4
P(Ax ) = 1 for each x supp().
By independence of X and the training sample,

P (dkn :n (X) 0) =
P(Ax )(dx) = 1.
supp()


We now introduce the kn -NN weights. For x Rd , set
{
1/kn if Xi is among the kn NN of x
Win (x) =
.
0
otherwise
Lemma 12.4.6. Under kn /n 0, we have for each > 0
n

|Win (X)|1{Xi X>} 0 in probability.

i=1

This is condition (iii) in Theorem 12.3.2.

12.4. NEAREST-NEIGHBOR REGRESSION ESTIMATORS

289

Proof. Set
An := {dkn :n (X) > }.
Lemma 12.4.5 yields P(An ) 0 as n . Furthermore,
n

|Win (X)|1{Xi X>} 1An ,

i=1

whence the result.


Since Win (x) 0,

Win (x) = 1

i=1

and

max Win (x) = kn1 0

1in

(provided that kn )

also conditions (ii), (iv) and (v) are satised. It remains to verify condition
(i) resp. (i)*:
n

(i)*

Win (Xi ; X1 , . . . , x, . . . , Xn ) C, each x Rd .

i=1

Notice that the left-hand side of (i)* equals kn1 times the number of is for
which x is among the kn -NN of Xi .
We rst treat the case kn = 1.
Lemma 12.4.7. (Bickel and Breiman). For each d 1 and any norm ,
there exists (d) < such that it is possible to write Rd as the union of
(d) disjoint cones C1 , . . . , C with 0 as their common peak such that if
x, y Cj (x, y = 0), then x y < max( x , y ), j = 1, . . . , (d).
Proof. Set
S(0, 1) = {x Rd : x 1},
the unit sphere in Rd . Since S(0, 1) is compact we can nd disjoint sets
C1 , . . . , C(d) such that

(d)

j=1

Cj = S(0, 1) and x y < 1 for x, y Cj .

290

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Let
Cj = {x : x Cj , 0},

1 j (d).

Suppose x =
x, y = y with x
, y Cj . W.l.o.g., assume . Then,
x y =

x
y

{(
)
}

1
y + x
y < y .

Corollary 12.4.8. For any set of n distinct points in Rd , say x1 , . . . , xn , x1


can be the NN (k = 1) only of at most (d) points. Hence, for kn = 1, (i)*
holds with C = (d).

Proof. Assume x1 = 0 w.l.o.g. Lemma 12.4.7 yields that in each Cj there


exists at most one point for which x1 is a NN. In fact, for two points x, y
from the same Cj such that x y , we have x y < y , i.e. 0
cannot be a NN of y.

Corollary 12.4.9. For a general kn , x1 can be among the kn NN only of
at most kn (d) points. Hence (i)* again holds true with C = (d).

Proof. Similar to the last Corollary.

Altogether, we have shown that the kn -NN weights satisfy the assumptions
of Theorem 12.3.2. Hence we may conclude
Theorem 12.4.10. Under kn /n 0 and kn , the kn -NN regression
estimate
n

mn (X) =
Win (X)Yi
i=1

fullls
mn (X) m(X) in Lr .
Notice that Theorem 12.4.10 holds under no assumption on the underlying
distribution of (X, Y ) (up to E|Y |r < ). Therefore we say, that mn (X) is
universally consistent.

12.5. NONPARAMETRIC CLASSIFICATION

12.5

291

Nonparametric Classication

Suppose that Y is some variable attaining only nitely many values, say
{1, . . . , m}. Rather than Y , we observe some random vector X, which typically is correlated with Y . The problem of classication is one of specifying
the value of Y on the basis of X.
In other words, we are looking for a function (or rule) g of X such that
g(X) may serve as a substitute for Y . Denote with
P(g(X) = Y ) the probability of misclassication.
g0 is optimal if the probability of misclassication is minimal:
inf P(g(X) = Y ) L = P(g0 (X) = Y ).
g

In the following we shall derive the optimal rule. Now,


L = 1 sup P(g(X) = Y ).
g

But

P(g(X) = Y ) =
=

P(Y = i|X)dP =

i=1

{g(X)=i}

i=1

P(g(X) = i, Y = i)

i=1

i=1

pi (x)(dx)

{x:g(x)=i}

max pj (x)(dx) =
j

max pj (x)(dx),
j

(12.5.1)

{x:g(x)=i}

where as in the previous section denotes the distribution of X and


pi (x) = P(Y = i|X = x)
denotes the so-called a posteriori probability of {Y = i}. Notice that we
obtain equality in (??) if g is such that
g(x) = i i pi (x) = max pj (x).
j

(12.5.2)

292

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

The rule g0 dened by (12.5.2) is called the Bayes-Rule. For this g0 , we


have

P(g0 (X) = Y ) = max pj (x)(dx)


j

and therefore
L = 1 E( max pj (X)) the Bayes Risk.
1jm

This is the minimal probability of misclassication attainable.


The point now is that for (12.5.2) we need to know pj , 1 j m. Since
in practice this is not possible, they need to be estimated from a learning
sample (Xi , Yi ), 1 i n. Set
pjn (x) =

Win (x)1{Yi =j} ,

1 j m.

i=1

Under the conditions of Theorem 12.3.2, for each 1 j m,


pjn (X) pj (X) in Lr as n .

(12.5.3)

According to (12.5.2) it is tempting to consider the empirical Bayes rule


gn (x) = i i pin (x) = max pjn (x),
j

as a substitute of g0 . From Theorem 12.4.10 we already know that (12.5.3)


holds for the kn -NN rule. In this case pjn (x) is the relative number of Y s
being equal to j subject to the condition that the pertaining X is among
the kn -NN of x. The resulting gn is also called the majority vote.
Denition 12.5.1. A sequence {gn }n of rules is said to be consistent in
Bayes risk if
Ln P(gn (X) = Y ) L as n .
Theorem 12.5.2. Suppose that gn is an empirical Bayes rule such that
(12.5.3) holds. Then {
gn } is consistent in Bayes risk.
Proof. Set
en (X) = max |pjn (X) pj (X)|
1jm

From (12.5.3) we obtain


Een (X) 0 as n .

(12.5.4)

12.5. NONPARAMETRIC CLASSIFICATION

293

Now, by denition of L ,
L P(
gn (X) = Y |X1 , Y1 , . . . , Xn , Yn )
Integrate out to get

P-a.s.

L P(
gn (X) = Y ).

Furthermore,
P(
gn (X) = Y |X, X1 , Y1 , . . . , Xn , Yn )
m

=
1{gn (X)=i} pi (X)
i=1

1{gn (X)=i} pin (X) +

i=1

1{gn (X)=i} [pi (X) pin (X)]

i=1

= max pjn (X) +


j

1{gn (X)=i} [pi (X) pin (X)]

i=1

max pjn (X) en (X) max pj (X) 2en (X)


j

whence

P(
gn (X) = Y )

max pj (x)(dx) 2Een (X)


j

= 1 L 2Een (X).
From (12.5.4) we may infer
lim sup P(
gn (X) = Y ) L ,
n

as desired.

Remark 12.5.3. The consistency in Bayes risk of the kn -NN rule was rst
proved (under further regularity conditions) by Fix and Hodges (1951).
Remark 12.5.4. Much attention has been also given to the case k 1,
i.e. Y is classied as to be the value of that Yi for which Xi is the nearest
neighbor of X. It may be shown that
L = lim P(gn (X) = Y )
n

exists, with

L LL

)
m

2
L .
m1

Classical paper: Cover and Hart (1967), IEEE Transactions on Information


Theory 13, 21-27.

294

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

An important issue in pattern classication is that of estimating the unknown probability of misclassication. Ln is the overall probability of misclassication, i.e., it constitutes the mean risk averaged over all outcomes of
X and Y and the training sample.
A more informative quantity which represents the probability of misclassication given the particular sample X, (Xi , Yi ), 1 i n, on hand, is the
conditional probability
P(
gn (X) = Y |X, X1 , Y1 , . . . , Xn , Yn ) = Ln1 .
Another quantity which is of interest when only the training sample is known
but X has not been sampled so far, is
P(
gn (X) = Y |X1 , Y1 , . . . , Xn , Yn ) = Ln2 .
Note that both Ln1 and Ln2 are random. Recall that in the proof of Theorem
12.5.2 we have seen that
m

1 Ln1 =
1{gn (X)=i} pi (X)
(12.5.5)
i=1

max pj (X) 2en (X).


j

Integrating with respect to X leads to


1 Ln2 = P(
gn (X) = Y |X1 , Y1 , . . . , Xn , Yn ) =

1{gn (x)=i} pi (x)(dx)

i=1

max pj (x)(dx) 2 en (x)(dx)


j

= 1 L 2 en (x)(dx).

(12.5.6)

Theorem 12.5.5. Under the assumptions of Theorem 12.5.2,


Ln2 L in probability.
Proof. Similar to the proof of Theorem 12.5.2.

Theorem 12.5.5 asserts that Ln2 is also a consistent estimate of L . Though


Ln2 is measurable w.r.t. the training sample, it cannot be computed due to
the fact that the pi , 1 i m, in

pi (x)(dx)
1 Ln2 =
i=1

{
gn =i}

12.5. NONPARAMETRIC CLASSIFICATION

295

are not known. It is tempting, however, to consider


n2 =
1L

i=1

pin (x)(dx)

{
gn =i}

n2 is called the apparent error rate.


instead. L
Corollary 12.5.6. Under the assumptions of Theorem 12.5.2,
n2 L in probability.
L

Proof. In view of Theorem 12.5.5 it remains to prove that


n2 0 in probability.
Ln2 L
But
n2 |
|Ln2 L

i=1

|pi (x) pin (x)|(dx)

|pi (x) pin (x)|(dx).

i=1

{
gn =i}

The last term converges to zero in the mean and hence in probability.

n2 is computable only when is known. If is unknown, L


n2
Notice that L
is redened so as to become
n2 = n1
1L

j=1

max pin (Xj ).

1im

n2 underestimates Ln2 .
In the literature it is often reported that L
n2 :
We now introduce two variants of L
CV the cross-validated estimate of Ln2 : for this, compute pi,n1 as beL
n2
fore, but this time for (X1 , Y1 ), . . . , (Xj1 , Yj1 ), (Xj+1 , Yj+1 ), . . . , (Xn , Yn )
(j)
i.e., for the whole sample with (Xj , Yj ) deleted. Write pi,n1 for this p and
put
n

(j)
CV
1

1 Ln2 = n
max pi,n1 (Xj ).
j=1

1im

296

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

H the half-sample estimate of Ln2 : choose 1 k < n, and compute


L
n2
pi,nk from the sample (X1 , Y1 ), . . . , (Xnk , Ynk ). Use Xnk+1 , . . . , Xn for
estimation of . This results in
n

1
H
1L
n2 = k

max pi,nk (Xj ).

j=nk+1

1im

H , we do not require Ynk+1 , . . . , Yn .


Note that for L
n2
CV and L
H which are more widely used in the
Note 12.5.7. Versions of L
n2
n2
literature are (in obvious notation)
CV = n1
L
n2

1{gni (Xi )=Yi }

i=1

and

1
H
L
n2 = k

1{gn,nk (Xj )=Yj } .

j=nk+1

12.6

Smoothed Empirical Integrals

Consider an i.i.d. sequence X1 , X2 , . . . from a density f . For a given bandwidth a > 0, set
(
)
)
(

n
1
x Xi
1
xy
fn (x) =
K
=
Fn (dy).
K
na
a
a
a
i=1

Set

t
Fn (t) =

fn (x)dx,

t R,

the smoothed empirical d.f. For a given score-function , the smoothed


empirical -integral equals

I =
(x)Fn (dx) = (x)fn (x)dx
(
)

xy
1
(x)K
dxFn (dy) a (y)Fn (dy).
=
a
a

We have
E(I) =

a (y)F (dy) = a

12.6. SMOOTHED EMPIRICAL INTEGRALS

297

and

[a a ]2 F (dy) = 2 (a)

nVar(I) =

provided a is square-integrable.
In the following Lemma we formulate some conditions on f and under
which the above integrals converge to those with in place of a . For this,
note that
(
) ]
[

xy
1
f (y)K
dy dx.
a = (x)
a
a
Under some mild conditions on K the inner integrals converge to f (x) for
all x up to a Lebesgue null set. To show

a dF
as a 0
(12.6.1)
we need to justify the assumptions in the Lebesgue dominated convergence
theorem. Whenever K(x) = T ( x ), T nonincreasing, by Wheeden and
Zygmund, p. 156,
sup |f Ka (x)| cf (x)
(12.6.2)
a>0

for some generic constant c. Here, for any integrable function g,

g (x) = sup
|g(y)|dy
Q (Q) Q
denotes the Hardy-Littlewood maximal function of g, Q is an interval
centered at x and (Q) is the Lebesgue measure of Q. This denition extends
to the multivariate case, in which the supremum is taken over all cubes with
center x and edges parallel to the coordinate axes.
As a conclusion (12.6.2) implies (12.6.1) provided that f is Lebesgueintegrable.
As to

[a a ] F (dy)
2

[ ]
2 F (dy),

when (12.6.1) is satised, we need to show

2a (y)F (dy) 2 (y)F (dy).

(12.6.3)

298
We have

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION



[2a 2 ]dF
|a ||a + |dF


a 2 ( a 2 + 2 )

the 2 being taken in L2 (F ). First, by Cauchy-Schwarz,


(
)


1
xy
2
2
a (y)F (dy)
(x)K
dxF (dy)
a
a

=
2 (x)f Ka (x)dx c 2 (x)f (x)dx < ,

provided 2 f L1 (R). Finally, using K(x)dx = 1 and Cauchy-Schwarz,


(
)


1
xy
2
2
(a ) dF
[(x) (y)] K
dxF (dy)
a
a
(
)

1
xy
2
=
[(x) (y)] f (y)K
dydx.
a
a
One can show that
)
(

1
xy
2
dy 0
[(x) (y)] f (y)K
a
a
for all x up to a Lebesgue-null set, since f L1 (R) and 2 f L1 (R).
Application of the dominated convergence theorem follows as before. In
summary, we get
Lemma 12.6.1. Provided 2 f L1 (R) and f L1 (R), under standard
regularity assumptions on K,

E(I) dF

and
nVar(I)

[ ]
2 dF

as a 0.
Now, consider the stochastic process

Zn (a) Zn = n1/2 a (y)[Fn (dy) F (dy)].

12.6. SMOOTHED EMPIRICAL INTEGRALS

299

For a > 0 xed, the CLT easily yields


Zn (a) N (0, 2 (a)),

(12.6.4)

in distribution.
Setting 0 (y) = (y), (12.6.4) also holds for a = 0 when L2 (F ). If we
let a depend on n in such a way that a = an 0 as n , then

Var(Zn (an ) Zn (0)) [a ]2 dF 0,


as was shown before.
As a conclusion we get the following
Theorem 12.6.1. Under the assumptions of Lemma 6.6.1, with an 0 as
n ,

1/2
n
an (y)[Fn (dy) F (dy)] N (0, 2 (0))
in distribution.
Under an appropriate smoothness assumption on f , for K even,


a2
a (y)F (dy) = (x)F (dx) +
z 2 K(z)f (x)dzdx + O(a3 ).
2
We see that

1/2

(y)[Fn (dy) F (dy)] N (0, 2 (0))

provided that n1/2 a2n 0. Under dierent conditions on an it may happen


that the normal limit has a non-zero expectation.

300

CHAPTER 12. NONPARAMETRIC CURVE ESTIMATION

Chapter 13

Conditional U -Statistics
13.1

Introduction and Main Result

Let (X, Y ) be a random vector in Rd+1 . The regression function


m(x) = E[Y |X = x]
then is a convenient tool to describe the dependence between X and Y .
Since in practice m will be unknown, it needs to be estimated from a sample
(Xi , Yi ), 1 i n. Starting with Nadaraya (1964) and Watson (1964), by
now there is a huge literature on nonparametric estimation of m. Stone
(1977) investigated estimates of the form
mn (x) =

Win (x)Yi .

i=1

He could show that under very broad assumptions on the weights, for p 1,
E [|mn (X) m(X)|p ] 0.

(13.1.1)

In particular, he was able to verify these assumptions for N N -type estimators. Window estimators were dealt with independently by Devroye and
Wagner (1980) and Spiegelman and Sacks (1980). Devroye (1981) considered Lp -convergence at a point:
E [|mn (x0 ) m(x0 )|p ] 0.

(13.1.2)

Both (13.1.1) and (13.1.2) have important implications on the Bayes-risk


consistency of empirical Bayes rules in discrimination.
301

302

CHAPTER 13. CONDITIONAL U -STATISTICS

There are several situations in which one is not only interested in the relation between a single X and the corresponding Y , but in the dependence
structure of few Xs and a function of the associated Y s. To be precise, let
h = h(Y1 , . . . , Yk ) be a p-times integrable function of Y1 , . . . , Yk , p 1. h is
called a kernel, and k is its degree. The Y s need not be real but may be
random vectors in any Euclidean space. Put, for x = (x1 , . . . , xk ),
m(x) = E [h(Y1 , . . . , Yk )|X1 = x1 , . . . , Xk = xk ] .
The problem of estimating m was studied in Stute (1991). As estimators we
considered generalizations of the Nadaraya-Watson estimator, namely

k
h(Y1 , . . . , Yk )
j=1 K[(xj Xj )/bn ]
(13.1.3)
mn (x) =
k

j=1 K[(xj Xj )/bn ]


Here K is a smoothing kernel and bn > 0 is a bandwidth tending to zero at an
appropriate rate. Summation extends over all permutations = (1 , . . . , k )
of length k, i.e., over all pairwise distinct 1 , . . . , k taken from 1, . . . , n.
Clearly, setting
k
j=1 K[(xj Xj )/bn ]
Wn (x) W (x) = k

j=1 K[(xj Xj )/bn ]


and Y = (Y1 , . . . , Yk ), we have
mn (x) =

W (x)h(Y ).

(13.1.4)

Statistics of the type (13.1.4) may be called conditional (or local) U -statistics,
because in spirit they are similar to Hoedings (1948) U -statistic, extending
the sample mean.
The analog of (13.1.1) for conditional U -statistics has been obtained in Stute
(1994). In the present paper we consider Lp -convergence at a point x0 , say, of
a window estimator, i.e., of the form (13.1.3) with K = 1[1,1]d . Extensions
to more general kernels will be dealt with in Remark 13.1.2 below. The
examples in Stute (1991) could also be mentioned here, but we prefer to
discuss another example, which may be of independent interest. See Sections
13.1 and 13.2 for details.
In what follows denotes the sup-norm on an Euclidean space. For the
window estimator,
1{X xbn }
W (x) =
,
1{x xbn }

13.2. DISCRIMINATION

303

where X = (X1 , . . . , Xk ). Let denote the distribution of X.


Theorem 13.1.1. Assume that bn 0 and nbdn . Then for . . . almost all x0
E[|mn (x0 ) m(x0 )|p ] 0

as

n .

Remark 13.1.2. The assertion of Theorem 13.1.1 may be easily extended


to kernels K on Rd satisfying
c1 1{xr1 } K(x) c2 1{xr2 } ,
for some 0 < c1 , c2 < , 0 < r1 < r2 < , provided that
lim sup
b0

(x : x xi br2 )
<
(x : x xi br1 )

(13.1.5)

for 1 i k. In particular, (13.1.5) is satised if xi is a -atom or the restriction of to some neighborhood of xi is dominated by Lebesgue-measure.

13.2

Discrimination

In discrimination an unobservable random variable Y taking values from


a nite set {1, 2, . . . , M }, say, has to be estimated from some (hopefully)
correlated X. The optimal predictor g0 (X) minimizing the probability of
error P(g(X) = Y ) is given by
g0 (x) = arg max pj (x),
1jM

where
pj (x) = P(Y = j|X = x),

1 j M.

g0 is called the Bayes-rule and


[

L = P(g0 (X) = Y ) = 1 E

]
j

max p (X)

1jM

is the associated Bayes-risk. Since in practice the a posteriori probabilities


pj are seldom known, they need to be estimated from a training sample
(Xi , Yi ), 1 i n. Replacing Yi by 1{Yi =j} , the local average estimators
become
n

mjn (x) =
Win (x)1{Yi =j} .
i=1

304

CHAPTER 13. CONDITIONAL U -STATISTICS

Put
gn (x) = arg max mjn (x).
1jM

(13.1.1) then is useful to show Bayes-risk consistency of gn :


P(gn (X) = Y ) L .
See Stone (1977) or Chapter 12. On the other hand, (13.1.2) turns out to
be useful to handle the probability of error given the training sample:
P(gn (X) = Y |X1 , Y1 , . . . , Xn , Yn ).
See Devroye (1981).
Now, in applications, it often happens that more than one Y need to be
classied. Suppose that X10 , . . . , Xk0 are independent copies of X. The corresponding Y10 , . . . , Yk0 are unobservable. They may be discrete or continuous
random vectors in any Euclidean space. Rather than predicting their values
item by item, one may, e.g., be interested to compare their (expected) values
on the basis of the input vectors X10 , . . . Xk0 .
To be precise, let h be a function of degree k taking values from 1, 2, . . . , M .
The sets
Aj = {(y1 , . . . , yk ) : h(y1 , . . . , yk ) = j} ,

1 j M,

then form a partition of the feature space. h(Y10 , . . . , Yk0 ) = j then is tantamount to (Y10 , . . . , Yk0 ) Aj which is to be considered as one of M concurrent situations of interest. Take, for example, k = 2 and let the Y s be
real-valued. We may then wonder whether Y10 Y20 or not. Putting
{
1 if y1 y2
h(y1 , y2 ) =
0 if y1 > y2
we arrive at a discrimination problem for h(Y10 , Y20 ). Given (X10 , X20 ) and a
training sample (Xi , Yi ), 1 i n, a decision has to be made as to whether
h = 1 or 0. This example will be considered in the simulation study of
Section 13.3. Others are mentioned in Stute (1994).
It is easily seen that now the Bayes-rule equals
g0 (x) = arg max mj (x),
1jM

13.2. DISCRIMINATION

305

where
(
)
mj (x) = P h(Y10 , . . . , Yk0 ) = j|X10 = x1 , . . . , Xk0 = xk .
The Bayes-risk becomes
[

L =1E

]
max m (X) ,
j

1jM

with X = (X10 , . . . , Xk0 ). The empirical Bayes-rule is given by


gn0 (x) = arg max mjn (x),
1jM

with
mjn (x) =

W (x)1{h(Y )=j} .

One can show that under mild assumptions on the weights one obtains
Bayes-risk consistency:
P(gn0 (X) = h(Y )) L .
In this section we discuss the limit behavior of the probability of error given
a particular training sample at hand:
Ln := P(gn0 (X) = h(Y )|X1 , Y1 , . . . , Xn , Yn ).
Theorem 13.2.1. Assume bn 0 and nbdn . Then, as n ,
Ln L

in the mean.

Similar to classical discrimination,


1 Ln =

mj (x)(dx1 ) . . . (dxk ).

j=1
{gn0 =j}

Since the mj are unknown, we are tempted to estimate Ln by the apparent


n given as
error rate L
n =
1L

mjn (x)(dx1 ) . . . (dxi )

j=1
{gn0 =j}

max mjn (x)(dx1 ) . . . (dxk ).

1jM

306

CHAPTER 13. CONDITIONAL U -STATISTICS

Corollary 13.2.2. Under the assumptions of Theorem 13.2.1,


n L
L

in the mean.

In the most general situation, also will be unknown. We may then replace
n by the empirical measure n of X1 , . . . , Xn . So we
in the denition of L
come up with the following estimate of 1 Ln :
n = nk
1L

1i1 ,...,ik n

max mjn (Xi1 , . . . , Xik ).

1jM

n:
Two modications are suggested in order to reduce the bias of L
1. Summation should only be extended over all = (i1 , . . . , ik ) with
pairwise distinct ir s.
2. Given such a multi-index , mjn should be replaced by mj
n computed
from the sample (X1 , Y1 ), . . . , (Xn , Yn ) with (Xir , Yir ), 1 r k,
deleted.
Hence we obtain
n :=
1L

1
max mj (X ),
n(n 1) . . . (n k + 1) 1jM n

the Jackknife-corrected estimate of 1 Ln .


Example 13.2.3. Suppose that the Y s are real-valued and have a continuous distribution. This assumption is not very crucial. It is only to guarantee
that Y10 , . . . , Yk0 have no ties. Any randomization will do the job. Put
h(y1 , . . . , yk ) = j i yj is the maximum among y1 , . . . , yk .
Then h is well dened on the range of (Y10 , . . . , Yk0 ). Of course, M = k
and the problem becomes one of determining the index of that Y among
Y10 , . . . , Yk0 which is the largest. Assume for simplicity that X10 , . . . , Xk0 are
pairwise distinct, and let := mini=r Xk0 Xr0 > 0. For bn < /2 and
1 j k, compute the number of s such that Yj is the maximum among
(Y1 , . . . , Yk ) subject to Xi Xi0 bn for 1 i k. Then gn0 (X) is
just a majority vote among these numbers.

13.3. SIMULATION STUDY

13.3

307

Simulation Study

In our simulation study we consider data generated from the linear regression
model Y = X + , in which X is uniformly distributed on the unit interval
and is standard normal, both independent of each other. For variouss,
training samples (Xi , Yi ), 1 i n, were drawn. Given two input variables
X10 and X20 it is required to make a decision as to whether Y10 exceeds
Y20 or not. See Example 13.2.3 with k = 2. The rule gn0 was applied
n was computed
for various bandwidths. Finally, the jackknife-corrected L
for each training sample. Table 13.3.2 below presents the estimated mean
and variance based on 1000 replicates. For the sake of comparison the
true values of L are listed in Table 13.3.1. It becomes apparent that L
attains its maximum for = 0 and decreases as || increases. This should
also be intuitively clear, since for small || the input variables contain less
information on the Y s than for large ||s. In the extreme, when = 0, Y
does not depend on X at all so that X10 and X20 become uninformative as
to our multi-input classication problem. By symmetry, L = 0.5.
Next we present the summary statistics for sample size n = 10(20) for some
selected values of bn and .
To measure the performance of the rule gn0 , it is also instructive to compare
n with the true error rate Ln . This is done for n = 10 and = 1.
L and L
(See Table 13.3.3.)
The true error rate is always bigger than L (by denition of L ). It is pretty
n for 0.6 bn 0.8. Also Ln seems less vulnerable
well approximated by L
to an improper choice of bn .
In Table 13.3.4 we list, for n = 10 and various s, the bandwidths leading
n which in the mean are closest to L .
to an L
We see that bn decreases as increases. Hence for large s ecient discrimination between the Y s is based on smaller neighborhoods of the input
variables. After a moment of thought this observation becomes quite natural since for these values of , Y is strongly correlated with X so that small
neighborhoods contain enough information on the Y s. On the other hand,
small neighborhoods have the advantage to properly divide the training sample into disjoint parts such that each part contains information which is only
relevant for the corresponding pair (X 0 , Y 0 ).

308

CHAPTER 13. CONDITIONAL U -STATISTICS

Needless to mention that the optimal bandwidth decreases as sample size


increases.

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.3
1.5
1.8
2.0

L
0.500
0.491
0.481
0.472
0.463
0.453
0.444
0.435
0.426
0.417
0.408
0.383
0.366
0.343
0.328

Table 13.3.1:

13.3. SIMULATION STUDY


bn
0.1 0.8
0.7
0.6
0.5
0.5 0.8
0.7
0.6
0.5
1.0 0.8
0.7
0.6
0.5
1.5 0.8
0.7
0.6
0.5
0.4
0.3
2.0 0.8
0.7
0.6
0.5
0.4
0.3

309

n
Estimated Mean of L
0.48(0.48)
0.46(0.47)
0.44(0.46)
0.42(0.45)
0.47(0.48)
0.45(0.47)
0.43(0.46)
0.41(0.44)
0.46(0.47)
0.44(0.46)
0.42(0.44)
0.40(0.43)
0.46
0.44
0.42
0.40
0.37
0.34
0.46
0.43
0.40
0.37
0.34
0.31

n
Estimated Variance of L
0.0004(0.00009)
0.0006(0.0002)
0.0010(0.0003)
0.0020(0.0005)
0.0005(0.0001)
0.0010(0.0003)
0.0020(0.0006)
0.0020(0.0009)
0.0006(0.0003)
0.0010(0.0006)
0.0020(0.001)
0.0030(0.001)
0.0008
0.0017
0.0025
0.0030
0.0040
0.0040
0.009
0.0018
0.0026
0.0036
0.0040
0.0052

Table 13.3.2:

0.8
0.7
0.6
0.5

L
0.408
0.408
0.408
0.408

n
Estimated Mean of L
0.46
0.44
0.42
0.40
Table 13.3.3:

Estimated Mean of Ln
0.46
0.45
0.45
0.45

310

CHAPTER 13. CONDITIONAL U -STATISTICS

0.1
0.5
1.0
1.5
2.0

bn
0.8
0.7
0.5
0.4
0.3

Table 13.3.4:

Vous aimerez peut-être aussi