0 vues

Transféré par mikefirmino

- Technique of Collecting Data
- Welsh and Knight Magnitude-based Inference
- Handbook 2011 Web
- tabel-durbin-watson.pdf
- we18
- Class 2 Sampling
- US Federal Trade Commission: FTCDataQualityActPetition
- Lancaster University - MS Data Science Handbook
- VI-01 Intro and Periodogram_2007
- IK-27-11-2150Relations of the parameters of the /-K distribution for irradiance fluctuations to physical parameters of the turbulence
- [7]Statistical Modelling of Spectrum Occupancy
- 1992 Smith Gelfand
- Environmental Contour Lines a Method for Estimatin
- Logistic Parameters Estimation
- 1981-3821-bpsr-9-3-0178
- UMUCStats-Syllabus
- Factorial Designs
- EP1110-2-7_01May1988
- JAVA- Data Streaming Alert Intrusion Detection
- Exam C_1106

Vous êtes sur la page 1sur 173

1

Rodrigo Labouriau

December, 1996

Agriculture and Fisheries and Department of Theoretical Statistics, Aarhus University.

1

2

Acknowledgement

I have learned from many people to do research: my supervisor Professor Ole E. Barndorff-

Nielsen, my previous suprevisor at IMPA and co-worker Professor Bent Jørgensen, Pro-

fessor Richard Gill, Professor Amari, from my childhood Professor Luiz Fernando Gouvêa

Labouriau and Professor Maria Lea Salgado Labouriau, among others. To all of them I

would like to thank.

My research on estimating functions, one of the main topics of this thesis, was started

under supervision of Professor Bent Jørgensen at IMPA. When it was not possible to

continue my formation at IMPA Professor Barndorff-Nielsen offered me to stay under his

umbrella at the University of Aarhus. There my work gained special momentum specially

when working with Professor Barndorff-Nielsen and Professor Amari.

This work was conducted initially with finantial support of the Conselho Nacional de

Pesquisa - CNPq (Brazil), latter on I have received a grant from the european program

Human and Capital Mobility (HCM) at the MRI under supervision of Professor Richard

Gill (University of Utrecht). Partial finantial support was also provided by Fundação

Apolodoro Plausônio. I would like to thank the Research Centre Foulum, specially the

Department of Biometry and Informatic, for providing facilities and for beeing flexible in

allowing me time off for preparation of this thesis. In particular Aage Nielsen was always

supportive and a good friend.

Contents

1 Introduction 7

1.1 Semiparametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Classical optimality theory and estimating functions for semipara-

metric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.2 Some classes of semiparametric models . . . . . . . . . . . . . . . . 10

1.2 Description of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Basic set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Path and Functional Differentiability 27

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Differentiable paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.1 General definition of path differentiability . . . . . . . . . . . . . . 30

2.2.2 Hellinger and weak path differentiability . . . . . . . . . . . . . . . 31

2.2.3 Lq path differentiability . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.4 Essential path differentiability . . . . . . . . . . . . . . . . . . . . 35

2.2.5 Tangent spaces and tangent sets . . . . . . . . . . . . . . . . . . . . 36

2.3 Functional differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.1 Definition and first properties of functional differentiability . . . . . 39

2.3.2 Asymptotic bounds for functional estimation . . . . . . . . . . . . . 44

2.4 Asymptotic bounds for semiparametric models . . . . . . . . . . . . . . . . 49

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Basic definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.1 Estimating and quasi-estimating functions . . . . . . . . . . . . . . 57

3.2.2 First characterization of regular estimating functions . . . . . . . . 58

3

4 CONTENTS

functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Optimality theory for estimating functions . . . . . . . . . . . . . . . . . . 63

3.3.1 Classic optimality theory . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.2 Lower bound for the asymptotic covariance of estimators obtained

through estimating functions . . . . . . . . . . . . . . . . . . . . . . 64

3.3.3 Attainability of the semiparametric Cramér-Rao bound . . . . . . . 67

3.4 Further aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.4.1 Optimal estimating functions via conditioning . . . . . . . . . . . . 70

3.4.2 Generalized estimating functions . . . . . . . . . . . . . . . . . . . 73

els 75

4 Semiparametric Location and Scale Models 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Semiparametric Location-Scale Models . . . . . . . . . . . . . . . . . . . . 79

4.3 Calculation of the nuisance tangent space . . . . . . . . . . . . . . . . . . . 82

4.4 Calculation of the efficient score function . . . . . . . . . . . . . . . . . . . 84

4.4.1 The case where the first two standardized cumulants are fixed . . . 85

4.4.2 The case where the first three standardized cumulants are fixed . . 87

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.6 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6.1 The Laplace transform and polynomial approximation in L2 . . . . 92

4.6.2 Calculation of the L2 - nuisance tangent space at (0, 1, a) . . . . . . 98

4.6.3 Calculation of the tangent space at an arbitrary point . . . . . . . . 103

4.6.4 Calculation of the first four orthogonal polynomials in L2 (a) . . . . 108

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Semiparametric Models with L2 Restrictions . . . . . . . . . . . . . . . . . 113

5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3.1 Location-scale models . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3.2 Multivariate location-shape models . . . . . . . . . . . . . . . . . . 117

5.3.3 Covariance selection models . . . . . . . . . . . . . . . . . . . . . . 119

5.3.4 Linear structural relationships models (LISREL) . . . . . . . . . . . 122

5.3.5 Growth curve models . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.4 Efficient Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

CONTENTS 5

5.4.2 Efficient score functions . . . . . . . . . . . . . . . . . . . . . . . . 129

5.4.3 Characterization of the class of regular estimating functions . . . . 130

5.4.4 Robustness of estimators derived from regular estimating functions 134

5.4.5 Attainability of the semiparametric Cramèr-Rao bound via estimat-

ing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.4.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.5 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.5.1 Technical proofs related with L2 - restricted models . . . . . . . . . 142

5.5.2 Proof of a theorem related with location and scale models . . . . . 145

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.2 Extended L2 - restricted models . . . . . . . . . . . . . . . . . . . . . . . . 149

6.3 Partial parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.4 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.4.1 Technical proofs related with extended L2 - restricted models . . . . 160

6.4.2 Technical proofs related with partial parametric models . . . . . . . 164

6 CONTENTS

Chapter 1

Introduction

A common problem in scientific research is to draw conclusions about some phenomena

on the basis of some observations. Statistics provides tools to support the procedures for

drawing such conclusions. In statistical models the observations are thought of as origi-

nating from a random mechanism governed by a certain probability law. The underlying

probability law is usually unknown and assumed to belong to a certain class of laws given

and termed the statistical model or simply the model. The problem of modeling, in this

context, can be viewed as composed of a deductive process, where the class of supposed

possible underlying laws is determined, and an inductive process, where one tries to gain

information about the subjacent law on the basis of the observations. In this thesis we

treat some techniques related with the inductive process mentioned, i.e. given certain

classes of laws, we will develop some techniques for inferring about the specific law that

generated the observations.

It occurs very often that the given class of subjacent laws (i.e. the model) is indexed

by a finite number of parameters. Those classes are called parametric models. In many

situations the parameters reflect the state of nature of some phenomena under study.

It is then of interest to obtain information about these characteristics or parameters. In

these cases, one may take advantage of the well developed inferential theory of parametric

statistical models. However, there are situations where, due to the nature of the problem

or due to our lack of knowledge, one cannot deduct a reasonable parametric model.

A way to circumvent the lack of adequate parametric models is to use nonparametric

models. Those models are, as the name says, composed by large classes of distributions

that cannot be indexed by a finite number of parameters. Typical examples are models

that only require the distributions to be symmetric or continuous. Nonparametric models

are statistical useful tools; however, when using them one can typically draw inference only

7

8 CHAPTER 1. INTRODUCTION

on some very basic aspects of the phenomena in study. In other words, the specificity of

interpretation of parametric models is necessarily lost when using classic nonparametric

models. That is the price one sometimes has to pay for dealing with extremely large

families of distributions.

Recently some attention has been devoted to a kind of nonparametric model that can

be placed in an intermediate position between the two extreme situations described above.

They are called by the suggestive name of semiparametric models. Roughly speaking,

semiparametric models are families of distributions indexed by a parameter that can

be decomposed into two sub-parameters: the first sub-parameter is called the interest

parameter and belongs to a finite dimensional space, the second sub-parameter is termed

the nuisance parameter and lives in a infinite dimensional space. Semiparametric models

are genuine nonparametric models in the sense that they cannot be indexed by any finite

dimensional parameter. What distinguish the semiparametric models from the classic

nonparametric models is that in the first we can identify a finite dimensional interest

parameter. This identification of the interest parameter is in principle guided by our

interest. In this way, we incorporate in the model parameters that reflect characteristics

of the phenomena under study via the interest parameter and keep the flexibility of the

nonparametric models to adapt to the unknown peculiarities of the phenomena studied.

Another distinctive characteristic of the theory of semiparametric models are the meth-

ods of statistical inference that are used. Typically when dealing with semiparametric

models one tries to take advantage of the (partially) parametric structure of the model.

We discuss this point more precisely below.

semiparametric models

There is a developed branch of the theory of parametric models that treats models with

(finite dimensional) nuisance parameters. Due to the nature of semiparametric models, it

is natural to hope that the techniques for dealing with parametric models with nuisance

parameters could be somehow adapted to semiparametric models. Unfortunately this is

only partially true. However there are some classic notions from the parametric theory,

such as the efficient score function and the efficient Fisher information, that possess an

adapted version in the modern theory of semiparametric models. The generalization of

these structures to semiparametric models generated a beautiful optimality theory that in

a certain sense is analogous to some classic theories for parametric models. We will review

this theory in the first part of this thesis. The theory we refer to here is called optimality

theory for regular asymptotic linear estimating sequences (see chapter 2 for details). It

should be noted that in spite of the elegance of the optimality theory mentioned above,

it involves usually non trivial work, even in some rather simple semiparametric models.

1.1. SEMIPARAMETRIC MODELS 9

In the second part of this thesis we carry out all the computations necessary to apply

the optimality theory of regular asymptotic linear estimating sequences to some classes

of semiparametric models.

In this thesis we study also the use, in a context of semiparametric models, of a rela-

tively well developed technique for parametric models: the so called estimating functions.

In the approach of estimating functions we consider estimators which can be expressed

as solutions of an equation such as

Ψ(x; θ) = 0 . (1.1)

Here Ψ is a function of the given data, say x, and the parameter, say θ, of a certain

statistical model. We call Ψ an estimating function, also known as an inference function

(the precise definition will be given later). Following the same procedure as in the clas-

sical theories, one introduces some constraints in the class of inference functions to be

considered, a criterion for ordering the estimators obtained from the estimating functions

is given in the restricted class and the uniformly best estimator is chosen. It is clear that

in most of the classical “well-behaved” cases, the maximum likelihood estimator is given

by the solution of an estimating equation. Moreover, the criterion for ordering inference

functions is closely related to the asymptotic variance of the associated estimators. In

that way, the approach of estimating equations can be viewed as a generalization of the

maximum likelihood theory. Due to the optimal behavior of the maximum likelihood in

“regular” cases, it is not surprising that the optimal inference function will give us ex-

actly the maximum likelihood estimator. However, there are some situations in which the

maximum likelihood theory fails and the estimating equation theory works well. More-

over, the theory of estimating equations provides alternative justifications for important

statistical techniques for parametric models, such as conditional inference (see Jørgensen

and Labouriau, 1995).

Estimating functions have been used for inference with parametric models for a long

time. The earliest mention of the idea of estimating equations is probably due to Fisher

(1935) (he used the term “equation of estimation”). A remarkable example of an early

non-trivial use of inference functions can be found in Kimball (1946), where estimating

equations were used to give confidence regions for the parameters of the family of Gumbel

distributions (or extreme value distributions). There, the idea of “stable” estimating

equations, i.e. inference functions whose expectations are independent of the parameter,

was introduced, anticipating the theory of sufficiency and ancillarity for inference functions

proposed by McLeish and Small (1987) and Small and McLeish (1988). The theory of

optimality of inference functions appears in the pioneering paper of Godambe (1960). In

the same year Durbin (1960) introduced the notion of unbiased linear inference function

and proved some optimality theorems particularly suited to applications in time series

analysis. Since that time, the theory of inference functions has been developed a great

10 CHAPTER 1. INTRODUCTION

deal, both by Godambe (c.f. Godambe, 1976, 1980, 1984; Godambe and Thompson, 1974,

1976), and by others in different contexts and with different names and approaches. We

mention, for instance, the so-called theory of M -estimators developed in the Seventies in

order to obtain robust estimators, and the quasi-likelihood methods used in generalized

linear models. As one can see, the theory of inference functions was not only inspired by

an alternative optimality theory for point estimation. One could say that there is now a

firm and well established theory of inference functions for parametric models, with many

branches, some of them based on very deep mathematical foundations.

Estimating functions have been applied with relative success to estimation under para-

metric models with nuisance parameters (see Jørgensen and Labouriau, 1995 and the

references therein). It is then natural to ask whether this technique produces reasonable

results in a context of semiparametric models. We will show in this thesis that in fact the

class of estimators derived from estimating functions is rather limited for semiparametric

models. However, estimating functions will prove to be of utility as auxiliary for obtaining

efficient regular asymptotic linear estimating sequences.

We present next an example of a semiparametric model. Consider the following class of

probability distributions on the real line.

( )

Pθz

P = Pθz : ( · ) = z( · − θ), θ ∈ Θ = IR , z ∈ Z . (1.2)

dµ

Here Z is the class of probability densities z : IR −→ IR such that (1.3)-(1.6) given below

hold.

∀x ∈ IR, z(x) > 0; (1.3)

Z

z(x)dµ(x) = 1; (1.4)

IR

Z Z

xz(x)dµ(x) = 0 or equivalently xz(x − θ)dµ(x) = θ (1.5)

IR IR

and

Z

x2 z(x)dµ(x) < ∞. (1.6)

IR

Conditions (1.3) and (1.4) ensure that z is a density of a probability measure with support

equal to the whole real line. From condition (1.5) the parametrization (θ, z) 7→ Pθz is

identifiable, i.e. this map is one-to-one.

1.1. SEMIPARAMETRIC MODELS 11

Clearly, the model described above is a semiparametric extension of the socalled lo-

cation model. This model is an example contained in the main class of models studied

in this thesis. The models in the class referred to are constructed by assuming that the

expectation of a number of given square integrable functions are equal to some fixed func-

tions of a finite dimensional interest parameter. In the example above this corresponds to

condition (1.5). These models are termed “L2 - restricted models” and they incorporate

semiparametric extensions of many important statistical models such as: multivariate

location and shape models, growth curve models, linear relationships models (LISREL)

and some graphical models.

Two other closely related classes of models are studied also. The first is obtained

by considering L2 - restricted models that coincide entirely with a parametric model

(parametrized exclusively by the interest parameter) in some given region of the sam-

ple space. The distributions are assumed to vary freely (only subject to the L2 - restricted

model constraints) out of those regions. Here an example is the semiparametric location

model described above with the additional condition that the densities of the distribu-

tions in the model coincide with the density φ of a normal distribution with mean θ and

variance σ 2 := IR x2 φ(x)dµ(x) in the interval [θ − 2σ, θ + 2σ]. These models are called

R

“partial parametric models”. The best known examples are the trimmed models and the

location model with free tails.

The second class of models are constructed by adding to a L2 - restricted model con-

straints obtained by assuming that a (non-linear) function of the expectation of some

square integrable functions equals a function of the interest parameter. These models are

referred to as “extended L2 - restricted models”. For example, one could introduce addi-

tionally the assumption that the coefficient of variation is constant in the semiparametric

location model described above, i.e. insert the condition

qR

IR x2 z(x)dµ(x)

R = k.

IR xz(x)dµ(x)

Examples of extended L2 - restricted models are regression models with link and some

types of covariance selection models.

12 CHAPTER 1. INTRODUCTION

The thesis is divided in two parts. The first part treats some topics of the estimation the-

ory for semiparametric models in general. There the classic optimality theory is reviewed

and exposed in a suitable way for the further developments given after. Further the the-

ory of estimating functions is developed in details in a context of semiparametric models.

There does not to the knowledge of the author, exist any such systematic treatment of

estimating functions for semiparametric models in the literature.

The second part studies some classes of semiparametric models described previously.

The material contained in this part of the thesis constitutes an original contribution.

There can be found the detailed characterization of the class of regular estimating func-

tions, a calculation of efficient regular asymptotic linear estimating sequences (i.e. the

classical optimality theory) and a discussion of the attainability of the bounds for the con-

centration of regular asymptotic linear estimating sequences by estimators derived from

estimating functions.

There follows a more detailed description of the contents each chapter of the the-

sis. The chapters 5 and 6 are intentionally described in a more detailed way since they

constitute the main contribution of the thesis.

Chapter 1 This chapter contains some introductory material, an overview of the thesis

and some notational conventions.

Chapter 2 In this chapter the classic optimality theory for non- and semiparametric

models is studied. A range of notions of path differentiability and tangent spaces are

introduced and their inter-relations studied. Next it is studied some concepts of statistical

functional differentiability. Here the differentiability is considered relatively to a pointed

cone contained in the tangent space and not relatively to the whole tangent space, as is

current in the literature. These cones are referred as the tangent cones. The optimality

theory of differentiable functionals is reviewed next. Again, the results are stated relative

to the tangent cone and not with respect to the whole tangent space, as is usual. The

estimation of the interest parameter of semiparametric models is studied by applying

the optimality theory to a specially designed functional called the interest parameter

functional, which associates to any probability measure in the model in play the value of

the interest parameter associated to it. An increasing range of tangent cones is considered.

Here, the larger is the tangent cone used, the sharper is the bound for the concentration

of regular estimators obtained. However, too large tangent cones may imply that the

interest parameter functional is differentiable only under somehow stringent regularity

1.2. DESCRIPTION OF THE THESIS 13

conditions on the model. It is shown how the imposition of such conditions usually done

in the literature can be avoided by using adequate choices of tangent cones. The bound

for the concentration for regular estimating sequences obtained with this choice of the

tangent cone is referred to as the semiparametric Cramér-Rao bound.

Chapter 3 The chapter extends the theory of estimating functions classically consid-

ered for parametric models to a context of semiparametric models. A class of regular

estimating functions (REF) is defined and characterized in two alternative forms. The

first characterization says essentially that the components of any REF are in the intersec-

tion (over the values of the nuisance parameter) of the so called (strong or L2 ) nuisance

tangent spaces. This result is original, even though it can be found already in Jørgensen

and Labouriau (1995, chapter 4). The second characterization, shown to be equivalent

to the first, is a modification of the (informal) characterization recently given by Amari

and Kawanabe (1996) based on differential geometric considerations. The first character-

ization is used to obtain an optimality theory of REFs by using a projection technique.

It is proved that the semiparametric Cramér-Rao bound coincides with the bound for

the concentration of estimators based on REF if and only if the orthogonal complement

of the nuisance tangent space does not depend on the nuisance parameter. This result

is original and is used in the subsequent chapters to check in a range of semiparametric

models whether the semiparametric Cramér-Rao bound is attained by estimators based

on REFs. The chapter closes with a discussion of a generalization of the notion of REFs

in which the dependence on the nuisance parameter is allowed.

Most of the material presented in this chapter is result of the work of the author in

the last years. The result concerning the coincidence of the semiparametric Cramér-Rao

bound and the bound for the concentration of estimators based on REFs is original.

Chapter 4 The chapter studies a one dimensional semiparametric extension of the lo-

cation and scale model. The goal there is to gain intuition, and restrictions are introduced

largely in order to make the basic theory work in a simple way. The treatment of the

location and scale model is refined latter in chapter 5. The basic computations necessary

for obtaining efficient regular asymptotic linear estimating sequences (RALES) are given

in detail for location and scale models where a number of standardized cumulants are

fixed. The main point there is the calculation of a version of the efficient score function.

The efficient score function does not depend on the nuisance parameter in the case where

only the first two standardized cumulants are fixed and its roots are the sample mean

and the sample standard deviation. In that case, the efficient score function is a regular

estimating function (REF) and its roots provide efficient RALES, or in other words attain

14 CHAPTER 1. INTRODUCTION

the semiparametric Cramér-Rao bound. In the case where more than two standardized

cumulants are fixed the semiparametric Cramér-Rao bound is not always attained by

estimators based on REFs. Moreover, the efficient score function does depend on the

nuisance parameter but only through an intermediate finite dimensional parameter, sug-

gesting the use of some “plug-in” estimating procedure for obtaining efficient estimation.

The location and scale models considered in this chapter do not incorporate the assump-

tion of symmetry of the distributions, as occurs in the previous treatments of similar

models found in the literature. However, strong conditions on the behavior of the tails

of the distributions and the Laplace transform of some given functions of the density are

imposed. This will allow the calculations to be performed with the help of polynomial

expansions, a technique used in the early stages of the work. These technical restrictions

are eliminated in chapter 5 in a more general context. An auxiliary technical condition

for having the class of polynomials dense in L2 is given in the appendices of the chapter.

The chapter reports essentially a joint work with Professor Shun-Ichi Amari (University

of Tokyo) and Professor Ole E. Barndorff-Nielsen (University of Aarhus), developed in a

preliminary stage of the thesis work.

Chapter 5 This chapter studies the main classes of models considered in the thesis,

the L2 - restricted semiparametric models which we define next. The results presented in

the chapter are original and will be presented next with relatively more detail than the

previous chapters.

Consider a measurable space (X , B) on which a σ- finite measure λ and a family of

probability measures P are defined. The sample space X is assumed to be a locally

compact Hausdorff space, typically a Euclidean space.

The family P is parametrized in the following way

P = {Pθz : θ ∈ Θ , z ∈ Z} . (1.7)

Here Θ is an open set of IRq and Z is a class of arbitrary nature, typically infinite

dimensional. We consider z as a nuisance parameter and θ as a parameter of interest

which we want to estimate. It is assumed that the parametrization of P is identifiable,

i.e. the mapping (θ, z) 7→ Pθz is a bijection between Θ × Z and P. Each element of the

family P is dominated by λ and has support equal to the whole sample space X . We can

then represent P alternatively by

( )

∗ dPθz

P = p( · ; θ, z) = (·) : θ ∈ Θ, z ∈ Z . (1.8)

dλ

It is assumed without loss of generality that the versions of the Radon-Nikodym derivative

used in (1.8) to define p( · ; θ, z) : X −→ IR+ are strictly positive, i.e. for all x ∈ X , θ ∈ Θ

and z ∈ Z, p(x; θ, z) > 0.

1.2. DESCRIPTION OF THE THESIS 15

submodels. For each θ0 ∈ Θ define the submodel

Pθ0 = {Pθ0 z : z ∈ Z} . (1.9)

We introduce also the notation Pθ∗0 to denote the class of densities of the probability

measures of Pθ0 .

Now we can introduce the kind of models studied in the chapter. Suppose that there

are k, m ∈ N ∪ {0}, l ∈ N ∪ {0, ∞}, functions f1 , . . . , fk : X −→ IR and functions

g1 , . . . , gm : X × Z −→ IR such that for each θ0 ∈ Θ the submodel Pθ∗0 is the class of

functions p : X −→ IR+ for which the conditions (1.10)-(1.15) hold. The conditions

referred to are, for j = 1, . . . , k and i = 1, . . . , m:

∀x ∈ X , p(x) > 0; (1.10)

p is of class C l ; (1.11)

Z

p(x)λ(dx) = 1; (1.12)

X

Z

fj2 (x)p(x)λ(dx) < ∞; (1.13)

X

Z

fj (x)p(x)λ(dx) = Mj (θ0 ); (1.14)

X

Z

for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (1.15)

X

where Mj (θ0 ) ∈ IR and Bi (θ0 ) is a set given real open. We refer to the models of the form

described above as L2 - restricted semiparametric models or simply L2 - restricted models.

The conditions (1.10) and (1.12) ensure that p is a probability density of a distribution

with support equal to the whole sample space X . Conditions (1.13)-(1.15) are used to

restrict each submodel Pθ0 (and consequently shrink the model P). These condition

could be used to express a partial a priori knowledge about the phenomena we study or

to ensure some desirable mathematical characteristics of the model, such as identifiability

and regularity of the partial score functions. For instance, the conditions (1.15) can be

used to ensure that the partial score functions are in L2 . Condition (1.11) can be assumed

to hold apart from a λ- null set.

The following examples of L2 - restricted models are considered in detail in the thesis:

location and scale (without the restrictive conditions on the tails and existence of the

16 CHAPTER 1. INTRODUCTION

Laplace transform used in chapter 4), multivariate location and shape models, covariance

selection models defined on the location and shape model, linear structural relationship

models (LISREL) and some growth curve models with modeled covariance structure.

The covariance selection models are distinguished among the other examples because the

functions used to restrict are not polynomials.

It is shown that all the notions of nuisance tangent spaces considered in the first

part coincide and are equal to the orthogonal complement of the L2 - closure of the space

spanned by the functions f1 − E(f1 ), . . . , fk − E(fk ). More precisely, define

The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z) and it is equal

to the nuisance tangent space. Note that Hk depends only on θ, which in the light of

the theory developed in chapter 3 implies that the semiparametric Cramér-Rao bound

coincides with the bound for REFs.

The next step is to calculate the efficient score function by projecting the partial

score function onto the orthogonal complement of the nuisance tangent space. To do so,

consider the result of a Gram-Schmidt orthonormalization process, in the space L2 (Pθz ),

applied to the functions 1, f1 , . . . , fk and denoted by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form

an orthonormal basis on Hk = TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for

i = 1, . . . , q,

k

E

hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,

X

l/θ i ( · , θ, z) = (1.16)

j=1

where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be

written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,

i=1,...,k is a k × q matrix.

The covariance matrix of the efficient score function is given by

n o

Covθz lE ( · ; θ, z) = A(θ, z) AT (θ, z) . (1.18)

The inverse of the matrix A(θ, z) provides a lower bound for the asymptotic variance

of regular asymptotic linear estimating sequences i.e. the semiparametric Cramér-Rao

bound. Here we use the partial order of matrices (i.e. , A ≥ B iff A − B is positive

semidefinite).

1.2. DESCRIPTION OF THE THESIS 17

The class of regular estimating functions is studied next. Using the first characteri-

zation of estimating functions given in chapter 3 it is shown that any regular estimating

function can be represented in the following way,

Ψ( · ; θ) = α(θ){f ( · ) − M (θ)} , (1.19)

where α(θ) is a q×k matrix, f ( · ) = (f1 ( · ), . . . , fk ( · ))T and M (θ) = (M1 (θ), . . . , Mk (θ))T .

It can be shown from the properties of the regular estimating functions that the only pos-

sible root of any REF, under a repeated sampling scheme with sample x = (x1 , . . . , xn )T ,

is the solution of the system

0 = IP n f (x) − M (θ̂) .

In other words, there is one and only one moment estimator associated with any REF. The

basic properties of this moment estimator, such as consistency and asymptotic normality,

are studied. The Hampel influence function of the moment estimator referred to is given

by

Hence the Hampel influence function of an estimator derived from a regular estimating

function is bounded if and only if the function f is bounded. That is θ̂n is resistent to

gross errors,i.e. B- robust, if and only if the function f is bounded. On the other hand

the Hampel influence function is continuous (on x) if and only if f is continuous, and

that is the condition for having bounded sensibility to lateral shifts, i.e. V - robustness.

The attainability of the semiparametric Cramér-Rao bound is discussed by comparing

the asymptotic variance associated with the moment estimator derived from the REFs

with the inverse of the covariance of the efficient estimating function. It is proved that

under a semiparametric L2 - restricted model the semiparametric Cramér-Rao bound is

attained by estimators derived from REFs if and only if

∇M θz (θ)T = A(θ, z) . (1.20)

Using this result the attainability of the semiparametric Cramér-Rao bound is discussed in

many examples of L2 - restricted models. In particular it is shown that the semiparametric

Cramér-Rao bound is not attained by any estimator derived from REF in some cases of

the location-scale model with the third standardized cumulant fixed.

The chapter presents also a complete determination of the class of REFs for the

semiparametric location scale models with some standardized cumulants fixed.

Chapter 6 In this chapter two extensions of the L2 - restricted models are studied. We

term these: extended L2 - restricted models and partial parametric models.

18 CHAPTER 1. INTRODUCTION

model with the additional restrictions given by (1.21) and (1.22) below. Assume that

there are r, s ∈ N ∪{0}, functions b1 , . . . , br : X −→ IR in L2 (p), h1 , . . . , hs : X ×Z −→ IR,

b : IRr −→ IR and a continuous function h : IRs −→ IR such that

Z Z

b b1 (x)p(x)λ(dx), . . . , br (x)p(x)λ(dx) = G(θ) (1.21)

X X

and

Z Z

∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (1.22)

X X

The model described above is called an extended L2 - restricted model. Excluding the

trivial choices for g and h, the conditions (1.21) and (1.22) cannot be expressed in terms of

the conditions of the L2 - restricted models. Hence the model described above is indeed an

extension of the class of L2 - restricted models. Examples of these models are: covariance

selection models on which the covariance matrix is not part of the interest parameter,

regression models with link function and some proportional odd models.

It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote

Hk (θ, z) = Hk = span {f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk )}

and

( )

f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),

(Hk + Br )(θ, z) = (Hk + Br )span .

b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )

Moreover, let Hk⊥ (θ, z) = Hk⊥ and (Hk + Br )⊥ (θ, z) = (Hk + Br )⊥ denote the orthogonal

complement of Hk (θ, z) and (Hk + Br )(θ, z) in L20 (Pθz ), respectively. Note that Hk (θ, z)

in fact does not depend on z, and we write sometimes simply Hk (θ).

It is difficult, in general, to calculate the nuisance tangent spaces of extended L2 -

restricted models, but the following result is proved. Under weak regularity conditions

(that hold in all the examples considered) one has:

i) For each θ ∈ Θ,

2 1

∩z∈Z TN⊥ (θ, z) = ∩z∈Z TN⊥ (θ, z) = Hk (θ) .

then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,

k

X

ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .

j=1

1.2. DESCRIPTION OF THE THESIS 19

The result is used to characterize the class of REFs for the covariance selection models

on which the covariance matrix is not part of the interest parameter. Moreover, the

estimators derived from REFs are shown to be necessarily moment estimators of the

same type as the moment estimators derived from REFs for the case of the L2 - restricted

model considered in chapter 5. The following result proved in the chapter can be used

to calculate precisely the nuisance tangent spaces for the example of regression with link

function. If additionally to the previous assumptions one assumes that the function b is

injective, then

TN (θ, z) = [span{f1 − E(f1 ), . . . , fk − E(fk ), g1 − E(g1 ), . . . , gs − E(gs )}]⊥ .

imposing a restriction of a different nature different from that determined by the re-

strictions considered previously in the thesis. The restriction says essentially that the

distribution is known (or known to be in a certain parametric model) in a certain region

of the sample space. A typical example is the trimmed model.

The precise definition is given next. Consider the class of probability measures

P = {Pθz : θ ∈ Θ ⊆ IRq , z ∈ Z} = ∪θ∈Θ Pθ

on (X , A), as before. Suppose that for each θ ∈ Θ there is a function fθ : X −→ IR+ and

a measurable set Iθ such that the class of densities Pθ∗ is given by

Pθ∗ = {p : X −→ IR such that (1.10)-(1.15) and (1.23) hold} .

The additional condition in the definition above says that the density p is equal to fθ in

the set Iθ . More precisely, consider the condition

∀x ∈ Iθ , p(x) = fθ (x), λ − a.e. . (1.23)

The family P is termed a partial parametric model. Note that if (1.23) is suppressed from

the definition above, the model become a L2 - restricted model.

The following sequence of results extends the theory developed in chapter 5 to L2 - re-

stricted models for the context of partial parametric models. It is convenient to introduce

the following notation, for each θ ∈ Θ and z ∈ Z,

Iθc = X \ Iθ ;

n o

Hk⊥ (θ, z) = Hk⊥ = ν ∈ L20 (Pθz ) : ν ∈ Hk⊥ (θ, z) and supp(ν) ⊆ Iθc ;

n o⊥

Hk (θ, z) = Hk = Hk⊥ (θ, z) .

20 CHAPTER 1. INTRODUCTION

Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions

that are almost surely equal and adopt the convention that supp(ν) ⊆ A means that

ν( · )χAc ( · ) = 0, λ-almost everywhere.

Theorem 1 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z

and m ∈ [1, ∞],

m

i) T N (θ, z) = Hk⊥ (θ, z);

m⊥

ii) T N (θ, z) = Hk (θ, z). Moreover,

" #

{fi ( · )χIθc ( · ) − Ei (θ)} : i = 1, . . . , k}

Hk (θ, z) = span ,

∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

R

where Ei (θ) = Iθc

Note that the function Ei (θ) in fact does not depend on the nuisance parameter and we

write sometimes simply Hk (θ).

and m ∈ [1, ∞],

m⊥

i) ∩z∈Z T N (θ, z) = Hk (θ);

ii) Any regular estimating function Ψ, with components ψ1 , . . . , ψq , can be written in the

form, for i = 1, . . . , q,

k

X

ψi ( · ) = ξi ( · ; θ) + αij (θ){fi ( · )χIθc ( · ) − Ei (θ)} , (1.24)

i=1

The regular estimating function Ψ in part ii) of the corollary above has matricial

representation

Ψ( · ; θ) = ξ( · ; θ) + α(θ){f ( · ) − E(θ)} , (1.25)

where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .

Theorem 2 Consider a partial parametric model. Suppose that the function E is dif-

ferentiable. Then we have for any regular estimating function with representation (1.25)

and for each θ ∈ Θ and z ∈ Z:

1.2. DESCRIPTION OF THE THESIS 21

−1

JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (1.26)

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating

functions by taking

Finally, it is proved the following theorem, which solves the problem of optimality.

Theorem 3 Consider a partial parametric model. Suppose that the function E is differ-

entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating

functions at (θ, z) ∈ Θ × Z if and only if,

22 CHAPTER 1. INTRODUCTION

We describe next the basic setup used throughout this thesis. Let us consider a family

of probability measures P defined on a common measurable space (X , A). It is assumed

that the elements of P possess a common support, say X , and that there exists a σ-finite

measure λ defined on (X , A) such that each member of P is absolutely continuous with re-

spect to λ. The family P defines the basic model on which we construct the mathematical

machinery necessary to attack later the problems of estimation under the semiparametric

models we have in mind. In this way, even though no parametric structure is assumed

in P, we will not construct a theory of nonparametric inference in full generality. It will

be seen that the assumption of fixed support and the existence of a common dominating

measure, usually avoided in the ordinary theory of nonparametric models, will provide

us a simple mathematical environment suitable for our purposes. Clearly, each P ∈ P is

identified with a version of its Radon-Nikodym derivative with respect to λ. Denote the

class of these densities by

( )

∗ dP

P = (·) : P ∈ P .

dλ

We shall use capital letters to denote the elements of P and small letters to represent the

elements of P ∗ . Without loss of generality we assume that the versions of the Radon-

Nikodym derivatives used are such that for all P ∈ P and for each x ∈ X ,

dP

p(x) = (x) > 0 . (1.28)

dλ

Moreover, when treating elements of P ∗ it will be assumed tacitly that λ almost every-

where equal functions are identified.

We introduce next the notation for Lq spaces used thought. For q ∈ [1, ∞), the Lq -

space with respect to the probability measure P ∈ P with density p ∈ P ∗ will be denoted

by

Z

q q q

L (P ) = L (p) = f : X −→ IR : |f (x)| p(x)λ(dx) < ∞ .

X

The usual norm of Lq (p) will be denoted by k · kLq (p) . In the special case of the Hilbert

space L2 (p), the natural inner product will be denoted, for all f, g ∈ L2 (p), by

Z

< f, g >P = < f, g >p = f (x)g(x)p(x)λ(dx) .

X

space of L2 (P ) functions that have zero expectation under P is denoted by

Z

L20 (P ) = L20 (p) 2

= f ∈ L (P ) : f (x)p(x)λ(dx) = 0 .

X

1.3. BASIC SET-UP 23

Given a set A ⊆ L20 (P ), we denote the closure of A with respect to the topology of L2 (P )

by clL2 (P ) (A) = clL2 (p) (A), and the orthogonal complement of A in L20 (P ) by A⊥ . Finally,

we consider the space L∞ (P ) = L∞ (p) of functions from X to IR essentially bounded with

respect to the probability P ∈ P (or p ∈ P ∗ ) equipped with the norm given, for each

f ∈ l∞ (p), by

(see Dunford and Schwartz, 1964). We stress that for 1 ≤ r ≤ q ≤ ∞, Lq (p) ⊆ Lr (p).

Moreover, for all f ∈ Lq (p),

We introduce next the basic structure and notation common to all the semiparametric

models treated in the thesis. Consider a family of distributions P dominated by a the σ-

finite measure λ with representation

( )

dPθz

P∗ = ( · ) = p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z .

dλ

nature (typically infinite dimensional). We assume that Θ is open and that the mapping

(θ, z) 7→ p( · ; θ, z) is a bijection between Θ × Z and P ∗ .

Let us assume from now on that for each (θ0 , z0 ) ∈ Θ × Z,

∀x ∈ X , p(x; θ0 , z0 ) > 0 .

∇p(x; θ, z0 )|θ=θ0

l(x; θ0 , z0 ) = = (l1 (x; θ0 , z0 ), . . . , lq (x; θ0 , z0 ))T

p(x; θ0 , z0 )

is λ-almost everywhere well defined and assume that for i = 1, . . . , q,

24 CHAPTER 1. INTRODUCTION

Part I

Models

25

Chapter 2

2.1 Introduction

We consider in this chapter some aspects of the general theory of non-parametric statistical

models which will be useful for the theory of semiparametric models. The key notions

introduced here are the path differentiability, the associated concept of tangent spaces

and tangent sets, and the notions of functional differentiability.

In section 2.2 we study a range of concepts of path differentiability and comparisons

of those notions are provided. An important point there is the equivalence between the

Hellinger differentiability, often used in the literature (see Bickel et al., 1993), and the

weak differentiability (see Pfanzagl, 1982, 1985 and 1990). Two auxiliary notions of path

differentiability are introduced: strong and mean differentiability. It is proved that weak

(or Hellinger differentiability) is an intermediate notion of path differentiability, weaker

than strong differentiability and stronger than mean differentiability. A new notion of

path differentiability, called essential differentiability, is introduced. We will interpret the

tangents of essential differentiable paths as score functions of one dimensional “regular

submodels” in the classical sense. Since the essential differentiability is weaker than the

other notions provided, this interpretation extends immediately to all the other path

differentiability notions considered.

In section 2.3 some differentiability notions of functionals are studied. In the approach

given a cone contained in the tangent set (i.e. the class of tangents of differentiable paths)

is chosen and the differentiability of the functional in question will be defined relatively

to this cone (termed tangent cone). Alternative notions of functional differentiability are

given by adopting different notions of path differentiability and/or using different tangent

cones. As we will see, the stronger the path differentiability notion used and the smaller is

the tangent cone, the weaker the notion of differentiable functionals induced, in the sense

that more statistical functionals are differentiable. We provide next some lower bounds

27

28 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

under a repeated sampling scheme. The weaker the path differentiability required and

the larger the tangent cone adopted, the sharper are the bounds obtained. The theory

will be applied to estimation in semiparametric models in section 2.4.

2.2. DIFFERENTIABLE PATHS 29

The main purpose of this section is to introduce the mathematical machinery necessary to

extend the notion of score function, classically defined for parametric models, to a context

where no (or only a partial) finite dimensional parametric structure is assumed. The key

idea here is to consider one-dimensional submodels of the family P of probability measures

(typically infinite dimensional). These submodels will be called paths. Following the steps

of Stein (1956), one should consider a class of submodels (or paths) sufficiently regular

in order to have a score function well defined and well behaved for each submodel, in the

sense that, at least, each score function should be unbiased (i.e. have expectation zero)

and have finite variance. Stein’s idea is to use the worst possible regular submodel to

assess the difficulty of statistical inference procedures for the entire family P. Evidently,

if the class of ”regular submodels” is too small, no sensible results are to be expected

from that procedure. On the other hand, if the class of ”regular submodels” is too large,

the Stein’s procedure can become intractable or no simplification is really gained, which

is not in the spirit of the method proposed. Hence, when applying the Stein procedure it

is our task to find a class of ”regular submodels” with the adequate size.

The idea of ”regular submodel” mentioned will be formalized by introducing the no-

tion of path differentiability. A range of concepts of path differentiability are studied

in this section, all of them fulfilling the minimal requirement for a ”regular submodel”,

i.e. the score functions of the differentiable paths (viewed as submodels) will be auto-

matically well defined, unbiased and possess finite variances. The strongest notion of

path differentiability considered is the L∞ differentiability (or pointwise differentiability)

and the weakest notion is the essential differentiability. It will turn out that a notion

of path differentiability called “Hellinger differentiability” (or “weak differentiability”) is

the weakest notion that captures some important essential statistical properties of the

model P. Another distinguished notion considered is the L2 differentiability which will

involve calculations with Hilbert spaces, simplifying all the computations required. The

L2 differentiability coincides with the Hellinger differentiability in most of the examples

considered in this thesis. It turns that the L2 differentiability will be useful in the theory

of estimating functions.

This section is organized as follows. Subsection 2.2.1 studies the basic notion of path

differentiability and some general properties of differentiable paths. Some specific concepts

of differentiability are introduced in the subsections 2.2.2, 2.2.3 and 2.2.4 where weak

or Hellinger, Lq and essential differentiability are studied, respectively. The associated

notions of tangent sets and tangent spaces are discussed in subsection 2.2.5.

30 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

We give next a more precise definition of the terms ”submodel” and ”regular submodel”

informally used in the previous discussion. Recall that we were interested in defining a

one-dimensional submodel contained in the family P for which the score function would

be well defined and well behaved.

Let us consider a subset V of [0, ∞) which contains zero and for which zero is an

accumulation point. The set V will play the role of the parameter space in the ”submodel”

we define. Typical examples are: [0, ) for some > 0 and {1/n : n ∈ N }∪{0}. A mapping

from V into P ∗ assuming the value p ∈ P ∗ at zero is said to be a path converging to p.

Here the image of V under a path plays the role of the ”submodel” of P and the path acts

as a one-dimensional parametrization of the ”submodel”. It is convenient to represent a

path by a generalized sequence {pt }t∈V = {pt }, where for each t ∈ V , pt ∈ P ∗ is the value

of the path at t.

We introduce next the notion of differentiability which will enable us to formalize more

precisely what in the Stein program is the class of ”regular submodels”. A path {pt }t∈V

(converging to p) is differentiable at p ∈ P ∗ if for each t ∈ V we have the representation

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) (2.1)

for a certain ν( · ) ∈ L20 (p), and

rt −→ 0, as t ↓ 0 . (2.2)

The convergence in (2.2) is in some appropriate sense to be specified later. In fact, in

the next subsections we explore several notions of path differentiability by introducing

alternative definitions for that convergence. The term rt in (2.1) will be referred to as the

remainder term.

The function ν : X −→ IR given in (2.1) is said to be the tangent associated to the

differentiable path {pt }. Here the tangent plays the role of the score function of the

submodel parametrized by t ∈ V at p0 = p. To see the analogy with the score function

suppose that the convergence of rt in (2.2) is in the sense of the pointwise convergence.

In that case the tangent coincides with the score function of the submodel associated

with the differentiable path {pt } at p0 = p. In the general case, where the convergence

of rt is not necessarily pointwise convergence, the general chain rule for differentiation

of functions in metric spaces (see Dieudonné , 1960) can often be applied to justify our

interpretation of the tangent. We stress that according to our definition, the tangent of

a differentiable path (or alternatively the score of a regular submodel) has automatically

finite variance and mean zero (i.e. it is in L20 (p)).

Before embracing the study of notions of differentiability generated by some specific

definitions of the convergence of rt , we give a useful and trivial general property of re-

mainder terms of differentiable paths. Suppose that a path {pt } is differentiable at p ∈ P ∗

2.2. DIFFERENTIABLE PATHS 31

with representation given by (2.1), with ν ∈ L20 (p). Then we have, for each t ∈ V

pt ( · ) − p( · )

rt ( · ) = − ν( · ) (2.3)

tp( · )

and

Z ( )

Z

pt (x) − p(x)

rt (x)p(x)λ(dx) = − ν(x) p(x)λ(dx) = 0 .

X X tp(x)

Most of the estimation theory for non- and semi-parametric models found in the lit-

erature (see Bickel et al., 1993 and references therein) is developed using the notion

of Hellinger differentiability studied next. This notion appears in the literature in two

equivalent forms: weak differentiability (see Pfanzagl 1982, 1985 and 1990) and Hellinger

differentiability (see Hájeck, 1962, LeCam, 1966 and Bickel et al., 1993). This notion of

differentiability plays a central role in the theory presented because it enables us to grasp

some essential statistical properties of the models considered. For instance, the Hellinger

differentiability is equivalent to local asymptotic normality of the submodel defined by

the path. Moreover, the Hellinger differentiability is used in the so called convolution the-

orem, which gives a bound for the concentration of a rich class of estimators (the regular

asymptotic linear estimators).

We begin by introducing the weak differentiability which is in the general form of path

differentiability formulated before. A path {pt }t∈V is weakly differentiable at p ∈ P ∗ if

there exist ν ∈ L20 (p) and a generalized sequence of functions {rt }t∈V such that for each

t∈V

and

1Z

|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.4)

t {x:t|rt (x)|>1}

Z

|rt (x)|2 p(x)λ(dx) −→ 0 , as t ↓ 0 . (2.5)

{x:t|rt (x)|≤1}

definition of path differentiability with the convergence of the generalized sequence {rt }

given by (2.4) and (2.5).

32 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Let us introduce now the Hellinger differentiability of paths. The key idea in this ap-

proach is to characterize the family P of probability measures by the class of square roots

of the densities, instead of the densities. The advantage of this alternative characterization

is that the square roots of the densities are in the Hilbert space

Z

L2 (λ) = f : X −→ IR : f 2 (x)λ(dx) < ∞ .

X

In this way the statistical model in play is naturally embedded into a space with a rich

mathematical structure. Using the usual topology of L2 (λ) one defines the differentiability

of paths in the sense of Fréchet (or in this case, since the domain of the path is contained

in IR, the equivalent notions of Hadamard and Gateaux differentiability could be used

also). The precise definition of Hellinger differentiability is the following. A path {pt }t∈V

is Hellinger differentiable at p ∈ P ∗ if there exists a generalized sequence {st }t∈V in L20 (p)

converging to zero as t ↓ 0, i.e.

kst kp −→ 0 , as t ↓ 0 (2.6)

1/2 1

pt ( · ) = p1/2 ( · ) + tp1/2 ( · ) ν( · ) + tp1/2 ( · )st ( · ) . (2.7)

2

The factor 21 in the second term of the right side of (2.7) will serve to accommodate with

the other notions of differentiability. Note that each st is in fact in L20 (p). For, from (2.7)

1/2

p ( · ) − p1/2 ( · ) ν( · )

st ( · ) = t − . (2.8)

tp1/2 ( · ) 2

1/2

2

pt (x) 1/2

pt (x)λ(dx) = 1 < ∞ , we have that pt ( · )/p1/2 ( · ) ∈

R R

Since X p1/2 (x)

p(x)λ(dx) = X

1/2

1/2

pt (x) 1 pt (x)

L2 (p), and hence tp1/2 (x)

= t p1/2 (x)

− 1 ∈ L2 (p).

Proposition 1 A path {pt } is Hellinger differentiable if and only if {pt } is weak differ-

entiable.

t

2.2. DIFFERENTIABLE PATHS 33

We study next a useful range of path differentiability notions. These notions will serve

us to graduate how strong is the Hellinger or weak differentiability; and they will be used

auxiliary in the calculation of the weak tangents of weak differentiable paths. In spite

of the secondary role these differentiability notions play in our development, they are

important in the general theory of differentiability of statistical functionals, in particular

in the theory of von Mises functionals. The L2 differentiability defined below will be

useful when studying the use of estimating functions for semiparametric models.

The main idea here is to consider the Lq convergence for the generalized sequence {rt }

appearing in the definition of differentiable paths. The precise definition is the following.

A path {pt }t∈V ⊆ P ∗ is Lq differentiable at p ∈ P ∗ , for q ∈ [1, ∞], if there exist ν ∈ L20 (p)

and a generalized sequence {rt } in Lq (p) such that for each t ∈ V ,

and

p ∈ P ∗ , then it is also Lr differentiable at p with the same tangent.

Proof: The proposition follows immediately from the fact that convergence in Lq (p)

implies convergence in Lr (p). u

t

There are two distinguished cases of Lq path differentiability: strong and mean dif-

ferentiability corresponding to L2 and L1 differentiability respectively. The L1 differen-

tiability is remarkable because it is the weakest notion of differentiability found in the

literature, and the L2 differentiability distinguish itself because the L2 spaces, when en-

dowed with the natural inner product, are Hilbert spaces, which simplifies significantly

the calculations.

We study next the relation between weak and Lq path differentiability. As we will see

in the propositions 3 and 4 given above, weak differentiability is an intermediate notion

of path differentiability between L2 and L1 differentiability.

differentiable at p, with the same tangent.

34 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Proof: Let {pt } be a differentiable path in the L2 sense with representation (2.9) and

krt kL2 (p) −→ 0 as t ↓ 0. We show that the path {pt } fulfills the conditions (2.4) and (2.5)

for the convergence of the remainder term in the sense of the weak path differentiability.

For,

1Z 1Z

|rt (x)|p(x)λ(dx) ≤ t|rt (x)||rt (x)|p(x)λ(dx)

t {x:t|rt (x)|>1} t {x:t|rt (x)|>1}

Z

= |rt (x)|2 p(x)λ(dx)

{x:t|rt (x)|>1}

Z

≤ |rt (x)|2 p(x)λ(dx)

X

= krt k2p −→ 0 , as t ↓ 0 .

Hence {rt } satisfies (2.4). On the other hand,

Z Z

2

|rt (x)| p(x)λ(dx) ≤ |rt (x)|2 p(x)λ(dx) = krt k2p −→ 0, as t ↓ 0.

{x:t|rt (x)|≤1} X

Hence {rt } satisfies (2.5). We conclude that {pt } is differentiable in the weak sense with

tangent ν. u

t

differentiable (or differentiable in mean) at p, with the same tangent.

Proof: Take a path {pt } weakly differentiable at p with tangent ν. There exists a

generalized sequence of functions {rt } satisfying (2.4) and (2.5), such that for all t ∈ V ,

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) .

Note that (2.4) implies that

Z

|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.11)

{x:t|rt (x)|>1}

Z

|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 . (2.12)

{x:t|rt (x)|≤1}

From (1.29),

Z

|rt (x)|p(x)λ(dx) = kst kL1 (p) ≤ kst kL2 (p) −→ 0 .

{x:t|rt (x)|≤1}

2.2. DIFFERENTIABLE PATHS 35

Z Z

|rt (x)|p(x)λ(dx) = |rt (x)|p(x)λ(dx)

X {x:t|rt (x)|≤1}

Z

+ |rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 .

{x:t|rt (x)|>1}

t

We study next the weakest notion of path differentiability considered in this text. A

path {pt }t∈V ⊆ P ∗ is essential differentiable at p ∈ P ∗ if there exists ν ∈ L20 (p) and a

generalized sequence {rt : X −→ IR}t∈V of (A, B(IR))- measurable functions such that for

each t ∈ V ,

and for any sequence {kn }n∈N ⊆ V such that kn −→ 0 as n → ∞ there is a subsequence

{ki }i∈N ⊆ {kn }n∈N such that rki ( · ) −→ 0 p-almost surely as i → ∞.

We show next that essential differentiability is weaker than differentiability in mean

which, in view of propositions 3 and 4 implies that the essential differentiability is the

weakest notion of path differentiability considered here.

same tangent.

Proof: The generalized sequence {rt }t∈V is Cauchy, because it converges in L1 to zero.

Using theorem 3.12 in Rudin (1987, page 68) the essential differentiability follows. u

t

36 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

The following scheme represents the interrelation between the various notions of path

differentiability considered.

L∞ differentiability

⇓

p

L differentiability

⇓

q

L differentiability

⇓

L2 differentiability

⇓

Weak differentiability ⇔ Hellinger differentiability

⇓

1

L differentiability

⇓

essential differentiability

Here 2 < p < r < ∞.

Re-taking the Stein approach, the notion of differentiable path formalized the idea of

”regular one-dimensional submodel”, the tangent of a differentiable path playing the role

of the score function of these submodels. Here we elaborate the notion of tangent set

which is the class of all possible tangents of differentiable paths. This will be useful to

work with the idea of ”worst possible case” contained informally in the Stein method,

and to specify global properties common to all the scores of ”regular one-dimensional

submodels”. For technical reasons we need in fact to work in many situations with the

smallest closed subspace containing the tangent set, which is called the tangent space.

In the next section we will define a notion of differentiability for statistical functionals.

There the tangent set will play the role of ”test functions”, analogous to the role of test

functions when one defines the differentiability of tempered distributions (see Rudin,

1973). The notion of tangent space plays a crucial role when studying the theory of

models with nuisance parameters. There we will need to obtain a component of a partial

score function orthogonal (in the L2 sense, i.e. uncorrelated) to the scores of a model

obtained by fixing the parameter of interest and letting the nuisance parameter vary.

This component of the partial score function is obtained by orthogonal projection of the

score function onto the orthogonal complement of the tangent space (or nuisance tangent

2.2. DIFFERENTIABLE PATHS 37

space as we will call the tangent space of the submodel we mentioned). It will then be

comfortable to work with a closed subspace of L2 . We remark that it can be proved that

the tangent set is a pointed cone, but in general not even a vector space. Therefore the

necessity to introduce the notion of tangent space as given here.

The formal definition of tangent space and tangent set depends on the notion of path

differentiability one uses. We give next a general definition of tangent set and tangent

space which will be made precise when we specify the notion of path differentiability we

use. Suppose we adopt a certain definition of path differentiability according to which a

differentiable path at p ∈ P ∗ , say {pt }, has representation, for each t ∈ V ,

and

rt −→ 0 , as t ↓ 0 , (2.15)

where the convergence in (2.15) is in a certain sense known. Then the tangent set of P

at p ∈ P ∗ is the class

( )

o o ν ∈ L20 (p) : ∃V, {pt }t∈V ⊆ P ∗ , {rt }t∈V , such that

T (p) = T (p, P) = .

∀t ∈ V, (2.14) and (2.15) hold

Since the tangent sets and spaces depend on the notion of path differentiability adopted,

we speak of Lq (for q ∈ [1, ∞]), weak (or Hellinger) tangent sets and tangent spaces. When

W

necessary we use the notation T for the weak tangent space. The Lq tangent spaces are

q e

represented by T and the essential tangent spaces by T .

The following proposition relates the notions of tangent sets and tangent spaces given.

∞ r q 2 W 1 e

T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) .

Proof: Straightforward from the interrelations between the notions of path differentia-

bility. u

t

We close this section with two examples of the calculation of tangent spaces.

38 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

class P of all distributions in IR dominated by the Lebesgue measure with continuous

density (with respect to the Lebesgue measure) and with support (of the density) equal to

the whole real line. Denote the class of densities of P by P ∗ . We calculate the tangent

space of P at each p ∈ P ∗ .

Take an arbitrary element ν of Cb ∩ L20 (p). Here Cb denotes the class of continuous

compact supported functions from IR to IR. It is a classical result of analysis that Cb is

dense in L2 (p) (see Rudin, 1966), hence Cb ∩ L20 (p) is dense in L20 (p). We show that

ν ∈ T 0 (p, P) (for any notion of tangent sets defined before). Consider the path {pt } given

for t ∈ [0, ∞) small enough, by

We claim that for t sufficiently small, pt ∈ P ∗ , which implies that ν ∈ T 0 (p, P). It suffices

to verify that pt is positive and integrates 1. For t small pt is positive because ν is bounded

and p is bounded in the support of ν, hence the second term in the right hand of (2.16)

is smaller than p (for t small). That pt integrates 1 follows from the fact that ν has

expectation zero (with respect to p). u

t

It is not surprising that the previous enormous class of distributions possesses a “full”

tangent space. The next example show that this could be the case even in families where

we have a lot of information about the distributions of the family.

Example 2 (Full tangent space for families with information on the moments)

Consider the class P of all distributions in IR dominated by the Lebesgue measure with

continuous density (with respect to the Lebesgue measure) and with support (of the density)

equal to the whole real line. Suppose further that the moments of all orders exist and that

there exist a δ > 0, a k ∈ N and the constants m1 , . . . , mk such that for each i ∈ {1, . . . , k}

the moment of order i is contained in the open interval (mi − δ, mi + δ). I claim that the

tangent space of P at any P ∗ is L20 (p). The proof follows the same line of the argument

as given in the previous example. Take a path as in (2.16) with ν ∈ Cb ∩ L20 (p). For t

sufficiently small, pt will be positive, integrate to one, possess finite moments of all orders,

and the moments of order i, for i ≤ k will be contained in the interval (mi − δ, mi + δ).

u

t

2.3. FUNCTIONAL DIFFERENTIABILITY 39

2.3.1 Definition and first properties of functional differentiabil-

ity

We consider in this section a functional φ : P ∗ −→ IRq (for some q ∈ N ) which will

play the role of a parameter of interest that we want to estimate. RTypical examples

are the Rmean and the second moment functionals defined by φ(p) = X xp(x)λ(dx) and

φ(p) = X x2 p(x)λ(dx) respectively. An important non trivial example for the theory of

semiparametric models is the interest parameter functional defined next and studied in

detail in section 2.4.

Suppose that the family P ∗ of probability densities with respect to a measure λ can be

represented in the form

P ∗ = {p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z} .

P ∗ . The interest parameter functional φ : P ∗ −→ IRq is defined, for each p( · ; θ, z) ∈ P ∗ ,

by

φ{p( · ; θ, z)} = θ .

u

t

a theory of estimation for the functional φ. Let p be a fixed element of P ∗ . Consider a

non empty subset T (p) of the tangent space at p. A functional φ : P ∗ −→ IRq is said to

be differentiable at p ∈ P ∗ with respect to T (p) if there exists a function φ•p : X −→ IRq ,

such that φ•p ∈ {L20 (p)}q and for each ν ∈ T (p) there is a differentiable path {pt } with

tangent ν and

φ(pt ) − φ(p)

−→ < φ•p , ν >p , as t ↓ 0 . (2.17)

t

Here < φ•p , ν >p is the vector with components given by the inner product of the q

components of φ•p and ν. The function φ•p : X −→ IR is said to be a gradient of the

functional φ at p (with respect to T (p)). Note that φ•p depends on the point p at which

we study the differentiability of the functional φ. If a functional φ is differentiable at each

p ∈ P ∗ we say that φ is differentiable.

40 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Since the definition of functional differentiability depends on the notion of path dif-

ferentiability, we speak of L∞ , Lp , strong (L2 ), weak, mean (L1 ) and essential functional

differentiability. When necessary we superpose a symbol indicating the notion of path

differentiability in play. When we are speaking generically or when it is clear from the

context which notion of path differentiability is in play, we just use the notation φ•p for

the gradient and T 0 (p, P ∗ ) = T 0 (p), T (p, P ∗ ) = T (p) for the tangent set and the tangent

space of P ∗ at p respectively.

Note that the notion of functional differentiability introduced here involves a subset

T (p) of the tangent space and not necessarily the whole tangent space as is current in

the literature. This will give much more flexibility to the estimation theory developed.

Clearly the smaller is the class T (P ) (or the stronger is the notion of path differentiability)

used, the weaker is the related functional differentiability. On the other hand, the larger

is the class T (p), the sharper will be the results of the estimation theory related, in the

sense that the bounds for the lower asymptotic variance will be larger or the optimality

results will include more estimating sequences. In this sense the ideal would be to choose

the larger T (p) (and the stronger path differentiability) that makes differentiable the

functional under study. Of course, we will have to require some mathematical properties

for the classes T (p) in order to obtain a notion of functional differentiability useful for the

estimation theory of differentiable functionals. For instance, it will be assumed through

(and silently) that T (p) is a pointed cone (i.e. if ν ∈ T (p), then for each α ∈ IR+ ∪ {0},

αν ∈ T (p)). We will refer form now on to T (p) as the tangent cone. It will be necessary

sometimes to require the tangent cones to be convex.

We consider next a trivial example that illustrates the mechanics of the functional

differentiability.

space (X , A). Consider a family of probability measures P on (X , A) dominated by λ

given by the following representation

( )

dP

P= ( · ) = p( · ) : (2.19) − (2.22) hold . (2.18)

dλ

Z

p(x)λ(dx) = 1 ; (2.20)

X

p is continuous ; (2.21)

2.3. FUNCTIONAL DIFFERENTIABILITY 41

Z

x2 p(x)λ(dx) ∈ IR+ . (2.22)

X

We denote the class of densities of the elements of P with respect to λ by P ∗ . Define the

functional M : P ∗ −→ IR by, for each p ∈ P ∗

Z

M (p) = x p(x)λ(dx) .

X

we have seen in the previous section the tangent space of P at any p ∈ P ∗ is the whole

space L20 (p).

Take p ∈ P ∗ fixed and an arbitrary L2 -differentiable path at p, say {pt }t∈V , with

representation given by for each t ∈ V

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) ,

L2 (p)

where ν ∈ L20 (p), {rt } ⊂ L20 (p) and rt −→

0

0 as t ↓ 0. We have,

M (pt ) − M (p) xpt (x)λ(dx) − X xp(x)λ(dx)

R R

X

= (2.23)

t t

= < ν( · ), ( · ) >p + < rt ( · ), ( · ) >p

−→ < ν( · ), ( · ) >p , as t ↓ 0 .

The last convergence comes from the continuity of the inner product and the L2 (p) con-

vergence of the path remainder term to zero.

Define the function

Z

Mp• ( · ) = (·) − xp(x)λ(dx) .

X

Z

< ν, Mp• >p =< ν( · ), ( · ) − xp(x)λ(dx) >p =< ν( · ), ( · ) >p . (2.24)

X

Since (2.24) and (2.23) hold for any L2 differentiable path, we conclude that M is differ-

entiable with respect to the L2 tangent set and Mp• is a gradient of M . An argument based

on subsequences (c.f. Labouriau , 1996) yields the differentiability of M with respect to

the essential tangent set, i.e. the mean functional is differentiable with is the strongest

sense we can define in our setup.

Note that in this example (2.17) holds for any differentiable path with tangent ν ∈

T (p). However, according to our definition of functional differentiability it would be

enough if the condition (2.17) holds for one path with tangent ν. u

t

42 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

It follows immediately from the definition of gradient that a function φ?p : X −→ IR in

L20 (a) is also a gradient of φ at p if and only if,

∀ν ∈ T (p), < ν, φ•p >p = < ν, φ?p >p . (2.25)

We conclude from the remark above that if φ•p is a gradient of φ at p and ξ ∈ {T (p)}⊥ (i.e.

ξ is in the orthogonal complement of the tangent space with respect to L20 (p)), then φ•p + ξ

is also a gradient of φ at p. Hence, in general the gradient of a differentiable functional is

not unique.

A gradient φ•p of a differentiable functional at p ∈ P ∗ is said to be a canonical gradient

if φ•p ( · ) ∈ T̄ (p). Here T̄ (p) denotes the L2 closure of the space spanned by T (p). The

following proposition shows that there exists only one canonical gradient (apart from

almost surely equal functions) and gives a recipe to compute the canonical gradient,

namely by orthogonal projecting any gradient onto T̄ (p). We will see that the canonical

gradient plays a crucial rule in the theory of estimation of functionals.

X −→ IRq is a gradient of φ at p, then the vector formed by the orthogonal projection of

components of φ•p onto T̄ (p), say

Y Y

(

{φ•1p |T̄ (p)}, . . . , {φ•qp |T̄ (p)})T = ( {φ∗1p |T̄ (p)}, . . . , {φ∗qp |T̄ (p)})T ,

Y Y Y Y

(

p almost surely.

Proof: We prove the proposition for the case where q = 1. The same argument applied

componentwisely proves the case for q ∈ N , but with a more notation. From the projection

theorem we have the following orthogonal decomposition

φ•p = {φ•p |T̄ (p)} + {φ•p |T̄ (p)} .

Y Y

Y Y

Since {φ•p |T̄ ⊥ (p)} is orthogonal to T̄ (p), we conclude from (2.25) that {φ•p |T̄ (p)} is a

Q Q

gradient.

2.3. FUNCTIONAL DIFFERENTIABILITY 43

Y Y

Y Y

<

Y Y

< (2.26)

Y Y

ν=

which yields

Y Y

k

t

Q Q

We conclude that

Example 5 (Mean functional continued) It can be shown that the tangent space of

the model P given by (2.18) at each p ∈ P ∗ is the whole space L20 (p). Hence the gradient

calculated in example 4 is the canonical gradient. Moreover, the canonical gradient is the

only possible gradient for the mean functional. Note that if we drop the condition that

requires the existence of the variance of p (i.e. condition (2.22)), then Mp• is no longer a

gradient (because it is not in L2 ) and M is not differentiable at p. u

t

We consider next a proposition given trivial (but useful) rules for calculating gradients

of “composed” gradients.

gradient at p ∈ P ∗ Ψ• and φ• respectively. Let g : IRq −→ IRq be a differentiable function.

by aΨ• + bφ• (aΨ? + bφ? ).

canonical gradient of φ then ∇g{φ(p)}{φ• ( · )}T is the canonical gradient of g ◦ φ.

44 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Proof:

i) Straightforward.

ii) We give next the proof for the case where q = 1. The general case is obtained in a

similar way. Take an arbitrary differentiable path {pt } with tangent ν. Define ξ(t) = φ(pt ),

we have

φ(pt ) − φ(p)

−→< ν, φ• >p = ξ 0 (0) .

t

Now,

(g ◦ φ)(pt ) − (g ◦ φ)(p)

−→ (g ◦ ξ)0 (0) = g (ξ(0)) < ν, φ• >p

t

= < ν, φ(p)φ• >p .

u

t

We study next some results concerning the estimation of a differentiable statistical func-

tional under repeated sampling. These results will illustrate the importance of the canon-

ical gradient and will guide the choice of the notion of path differentiability and tangent

cone to be used.

We start by defining sequences of estimators for a given differentiable functional φ :

P −→ IRq (with respect to some tangent cones T (p)) based on samples. A sequence of

∗

functions {φ̂n }n∈N = {φ̂n } such that for each n ∈ N , φ̂n : X n −→ IRq is (An , B(IRq ))-

measurable is said to be an estimating sequence . Next we introduce two notions of

regularity of estimating sequences often found in the literature. An estimating sequence

{φ̂n } is said to be weakly regular (for estimating φ, with respect to the choice of tangent

cones made) if for each p ∈ P ∗ and each ν ∈ T (p) there exists a differentiable path

{pn−1/2 }n∈N converging to p and with domain V = {n−1/2 : n ∈ N }, for which

√ Z

n{φ(pn−1/2 ) − φ(p)} −→ φ• (x, p)ν(x)p(x)λ(dx)

X

and there exists a probability distribution Lpν (not depending on the path) such that

h√

D

i

Lpn−1/2 n{φ̂n ( · )φ(p)} −→ Lpν .

n

If the distributions Lpν above do not depend on the tangent ν, then we say that {φ̂n }n∈N

is regular.

2.3. FUNCTIONAL DIFFERENTIABILITY 45

An important class of estimating sequences are the asymptotic linear sequences defined

next. An estimating sequence {φ̂n } is said to be asymptotic linear (for estimating φ) if

there exists a function ICφ : X × P ∗ −→ IR such that for each p ∈ P ∗ , the function

ICφ ( · ; p) : X −→ IR is in L20 (p) and for each n ∈ N given a sample x = (x1 , . . . , xn ) of

size n, φ̂n admits the following representation

n

1X

φ̂n (x) = φ(p) + ICφ (xi ; p) + opn n−1/2 . (2.27)

n i=1

The function ICφ is called the influence function of φ. The representation (2.27) can be

re-written as

√ n n

o 1 X

n φ̂n − φ(p) = √ ICφ (xi ; p) + opn (1) .

n i=1

√ n o

D

n φ̂n − φ(p) −→ N [0, Covp {ICφ ( · ; p)}] ,

where

Z

ovp {ICφ ( · ; p)} = ICφ (x; p)ICφT (x; p)p(x)λ(dx) . (2.28)

X

Theorem 4 Let {φ̂n } be an asymptotic linear estimating sequence with influence function

w0

IC. Suppose that for each p ∈ P ∗ the tangent cone is given by T (p) =T (p, P ∗ ). Then,

{φ̂n } is regular if and only if for all p ∈ P ∗ , φ is differentiable at p (with respect to T (p))

and IC( · ; p) is a gradient of φ at p.

Proof: See Pfanzagl (1990) for the case where q = 1 or Bickel et al. (1995). u

t

The theorem above identifies (influence functions of) regular asymptotic linear se-

quences ofR estimators for estimating the functional φ with the gradients of φ. The co-

variance, φ•p (x)φ•p (x)T p(x)λ(dx), of a gradient φ•p of φ is the asymptotic covariance of

the corresponding regular asymptotic linear estimating sequence (under p) with influence

function φ•p . On the other hand, since the components of the canonical gradient φ∗ of φ

are the orthogonal projection of the components of any gradient onto the tangent space,

we have for a given gradient φ•p and for all p ∈ P ∗

φ•p ( · ) = φ?p ( · ) + R( · ; p) ,

46 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

for some R( · ; p) ∈ {T ⊥ (p; P)}q . A standard argument yields then that, for all p ∈ P ∗ ,

Z n on oT Z n on oT

φ∗p (x) φ∗p (x) p(x)λ(dx) ≤ φ•p (x) φ•p (x) p(x)λ(dx) , (2.29)

X X

with inequality in the sense of the Löwner partial order of matrices. That is, the covari-

ance of the canonical gradient is a lower bound for the asymptotic covariance of regular

asymptotic linear estimating sequences. Moreover, only an asymptotic linear estimating

sequence with influence curve equal to the canonical gradient achieves this bound. We

say that an asymptotic linear estimating sequence is optimal if, for each p ∈ P ∗ , its in-

fluence function is the canonical gradient of φ. The bound (2.29) is sometimes called the

semiparametric Cramèr-Rao bound.

In spite of the elegance of this theory, some care should be observed in applying it.

Firstly, there is a certain degree of arbitrariness in choosing only the class of regular

asymptotic linear estimating sequences. When restricting to that class one can discard

many interesting sequences. This criticism applies, of course, to any optimality approach.

A second, more specific criticism is the following: It occurs very often that the tangent

space of large (semi- or non-parametric models) is the whole space L20 (see the examples

at the end of the section on tangent spaces). In those cases, due to the uniqueness of the

canonical gradient, each differentiable functional possesses only one gradient. We conclude

from the previous discussion that then there is only one possible influence function and

hence all regular asymptotic linear estimating sequences are asymptotically equivalent (as

far as the asymptotic variance is concerned). Therefore an optimality theory for regular

asymptotic linear estimators is meaningless for the models with tangent spaces equal to

the whole L20 . We refine next the optimality theory for functional estimation.

It is convenient to introduce the following notation. Given a differentiable functional

φ with respect to the tangent cones {T (p) : p ∈ P ∗ } and with canonical gradient φ? ( · , p)

at each p ∈ P ∗ , denote X φ?p (x)φ?p (x)T p(x)λ(dx) by Iφ (p). That is Iφ (p) is the covari-

R

ance matrix of the canonical gradient. A weakly regular estimating sequence {φ̂n } is

asymptotically of constant bias at p ∈ P ∗ if for each ν, η ∈ T (p)

Z Z

xdLpν (x) = xdLpη (x) ∈ IRq .

Theorem 5 (van der Vaarts extended Crámer-Rao theorem) Let φ : P ∗ −→ IRq

w

be a differentiable at p ∈ P ∗ with respect to T (p) ⊆T (p). Suppose that the sequence {φ̂n }

is weakly regular and asymptotically of constant bias at p ∈ P ∗ . Suppose also that the

covariance matrix of Lp0 exists. Then

Cov(Lp0 ) ≥ Iφ (p) , (2.30)

2.3. FUNCTIONAL DIFFERENTIABILITY 47

where the symbol 00 ≥00 is understood in the sense of the Löwner partial order of matrices

1

. Moreover, the equality in (2.30) occurs only if

√ n

1 X

n{φ̂n − φ(p)} = √ φ?p (xj ) + oP (1) . (2.31)

n j=1

t

We see from the theorem above that the larger are the tangent cones T (p) used, the

sharper are the inequalities (2.30). Small tangent cones make more likely the differentia-

bility of the functional but can make also the bound in (2.30) unattainable.

Another important optimality result in the theory of estimation of functionals is the

convolution theorem, which we give the following version.

Theorem 6 (Convolution theorem) Suppose that T (p) is convex and φ : P ∗ −→ IRq

differentiable at p ∈ P ∗ with respect to T (p). Then any limiting distribution Lp of a

regular estimating sequence for φ at p satisfies

Lp = N (0, Iφ (p)) ∗ M , (2.32)

where M is a probability measure on IRq .

w

Proof: See Pfanzagl (1990) for the case where q = 1 and T (p) =T (p) and van der Vaart

(1980) for the general case. u

t

The expression (2.32) shows that, under the assumptions of the convolution theorem, a

regular estimating sequence cannot possess asymptotic covariance smaller than the L2 (p)

squared norm of the canonical gradient. This provides an extension of the interpretation

of the optimality theory for regular asymptotic linear estimating sequences. In fact, even

when the tangent cone is is the whole L20 , the “optimal” regular asymptotic linear esti-

mating sequence attains the bound for the concentration of regular estimating sequences

given by the convolution theorem, provide the functional is differentiable. An advantage

of the version of the convolution theorem presented is that we need not to work with the

whole tangent space but with a convex cone of it. This can be useful when the functional

in study is not differentiable or when the calculation of the (weak) tangent space is not

feasible.

We close this section presenting a theorem that gives a minimax approach to the

problem of estimation of functionals. A function l : IRq −→ IR is sad to be bowl-shaped

if l(0) = 0, l(x) = l(−x) and for all k ∈ IR, {x : l(x) ≤ k} is convex.

1

That is A ≥ B means that A − B is positive definite.

48 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

w

T (p) ⊆T (p)is convex and φ : P ∗ −→ IRq differentiable at p ∈ P ∗ with respect to T (p).

Then

i) For any sequence of estimators which is weakly regular at p and bowl-shaped loss function

l

Z Z

sup l(x)dFpν (x) ≥ l(x)dN (0, Iφ (p))(x) . (2.33)

ν∈T (p)

ii) For any bowl-shaped loss function l and any estimating sequence {φ̂n },

√ Z

lim lim inf sup EQ {l[ n{φ̂n − φ(Q)}]} ≥ l(x)dN (0, Iφ (p))λ(dx) , (2.34)

c→∞ n→∞ Q∈H (p,c)

n

where Hn (p, c) := {Q ∈ P : n {dQ1/2 (x) − p1/2 (x)}2 λ(dx)} is the interception between P

R

and the ball constructed with the Hellinger distance of center p and radius n−1/2 .

t

Note that from part i) one can obtain a bound for the concentration of weakly regular

estimating sequences based on the the canonical gradient, provided φ is differentiable with

respect to some convex tangent cones. In particular, if there exist an optimal asymptotic

linear estimating sequences and the assumptions of the theorem hold (i.e. differentiability

of φ and convexity of the tangent cone), then the bound for weak regular estimating

sequences given by (2.33) is attained by this regular asymptotic linear estimating sequence.

In this way, in the case where the tangent space is the whole L20 , the optimality of the

(unique) regular asymptotic linear estimating sequence can be justified. The bound of

the second part of the theorem above holds for the whole class of estimators, however it

is in general not attainable.

2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 49

We consider a family of distributions P dominated by a the σ- finite measure λ with

representation

( )

∗ dPθz

P = ( · ) = p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z .

dλ

Here θ is a q- dimensional interest parameter and z is a nuisance parameter of arbitrary

nature. We assume that Θ is open and that the mapping (θ, z) 7→ p( · ; θ, z) is a bijection

between Θ × Z and P ∗ . The interest parameter functional φ : P ∗ −→ IRq is defined, for

each p( · ; θ, z) ∈ P ∗ , by

φ{p( · ; θ, z)} = θ .

We will consider the differentiability of the interest parameter functional φ for a range of

tangent cones.

Recall that we assumed that for each (θ0 , z0 ) ∈ Θ × Z,

∀x ∈ X , p(x; θ0 , z0 ) > 0 ,

that the partial score function

∇p(x; θ, z0 )|θ=θ0

l(x; θ0 , z0 ) = = (l1 (x; θ0 , z0 ), . . . , lq (x; θ0 , z0 ))T

p(x; θ0 , z0 )

is λ- almost everywhere well defined and that for i = 1, . . . , q,

li (x; θ0 , z0 ) ∈ L20 (Pθ0 z0 ) .

Let us consider a fixed (θ, z) ∈ Θ × Z at which we will study the differentiability of

φ. For notational simplicity we denote p( · ; θ, z) by p( · ).

The first tangent cone we consider is

T1 (p) = span{li (x; θ, z) : i = 1, . . . , q} .

Take ν ∈ T1 (p). There exists α ∈ IRq such that ν( · ) = lT ( · ; θ, z)α. Define (for t small

enough) the path

pt ( · ) = p( · ; θ + tα, z) .

Clearly, there exists {rt } such that

p( · ; θ + tα, z) − p( · )

lT ( · ; θ, z)α = + rt ( · ) , (2.35)

tp( · )

50 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

with rt ( · ) −→ 0 λ- almost everywhere. Hence the path {pt } is (L∞ ) differentiable with

tangent lT ( · ; θ, z)α. Moreover,

φ(pt ) − φ(p) θ + tα − θ

= = α.

t t

−1

Defining φ?p ( · ) = Covθz { l( · ; θ, z) }l( · ; θ, z) we obtain,

Z Z

−1

φ?p (x)ν(x)p(x)λ(dx) = Covθz {l( · ; θ, z)}l(x; θ, z)lT (x; θ, z)αp(x)λ(dx)

X X

φ(pt ) − φ(p)

= α = lim .

t→0 t

We conclude that φ is differentiable at p with respect to T1 (p). Moreover,

Covθz ( l( · ; θ, z) )−1 l( · ; θ, z)

is the canonical gradient of φ. Note that we used (in (2.35)) implicitly the L∞ path dif-

ferentiability, however the argument presented holds for any weaker path differentiability.

For, note that the essential point is that we identify (through (2.35)) any element of the

tangent cone T1 (p) with a L∞ differentiable path. If we adopt a path differentiability

weaker than the L∞ differentiability, then the L∞ differentiable paths identified with the

elements of the tangent cone would be differentiable in the current sense also and the

differentiability of the functional φ follows from the argument presented above.

The efficient scores Iφ (p) (i.e. the correlation matrix of the canonical gradient of φ at

p) is the inverse of the correlation matrix of the score function l( · ; θz). The bounds for the

asymptotic variance obtained with this naive choice of tangent cones are not attainable

in general. This will be apparent from the development presented next where sharper

bounds will be presented.

We introduce the notion of nuisance tangent space that plays a fundamental rule in

the estimation theory in semiparametric models. For each θ0 ∈ Θ consider the submodels

Pθ∗0 = {p( · ; θ0 , z) : z ∈ Z} .

The nuisance tangent set at (θ, z) ∈ Θ × Z, TN0 (θ, z), is the tangent set of Pθ∗ , i.e.

TN0 (θ, z) = T 0 (p, Pθ∗ ). The closure of the space spanned by the nuisance tangent set is

called the nuisance tangent space and denoted by TN (θ, z). Here we do not specify the

notion of path differentiability adopted, but when necessary a symbol will be superim-

posed.

An alternative for the tangent cone better than T1 (p) is

2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 51

We show next that φ is differentiable with respect to T2 (p), no matter which notion of

path differentiability we use. Consider a ν ∈ TN0 (p) ⊂ T2 (p). There is a differentiable

path {pt } contained in Pθ∗ with tangent ν. Since for each t, pt ∈ Pθ∗ , φ(pt ) = θ = φ(p) and

φ(pt ) − φ(p)

= 0.

t

From the definition of functional differentiability, any gradient φ•p of φ should satisfies, for

each ν ∈ TN0 (θ, z),

φ(pt ) − φ(p) Z •

0 = lim = φp (x)ν(x)p(x)λ(dx) . (2.36)

t&0 t X

On the other hand, the argument presented in the case of the tangent cone be T1 (p)

implies that, if ν ∈ span{li (x; θ, z) : i = 1, . . . , q}, say ν( · ) = l( · ; θ, z)T α, for some

α ∈ IRq , then any gradient φ•p of φ satisfies,

Z

α= φ•p (x)ν(x)p(x)λ(dx) . (2.37)

X

Clearly, the conditions (2.36) and (2.37) are sufficient to ensure that φ•p is a gradient

of φ. From these considerations, a natural candidate for being a gradient of φ is the

(standardized) projection of the score function onto the orthogonal complement of the

nuisance tangent space. Formally, define the function lE : X × Θ × Z −→ IRq by, for each

T

(θ, z) ∈ Θ × Z, lE ( · ; θ, z) = l1E ( · ; θ, z), . . . , lqE ( · ; θ, z) where, for i = 1, . . . , q,

Y

Here (g|A) is the orthogonal projection of g ∈ L20 (Pθz ) onto A ⊆ L20 (Pθz ). Moreover,

Q

TN⊥ (θ, z) is the orthogonal complement of TN (θ, z) in L20 (Pθz ). The function lE is called

the efficient score function and we define the efficient score by

Z

J(θ, z) = lE (x; θ, z)lE (x; θ, z)T p(x)λ(dx) .

X

Define

φ?p ( · ) = J(θ, z)−1 lE (x; θ, z) .

Clearly φ?p satisfies (2.36) and (2.37). We conclude that φ?p is a gradient of φ. Moreover,

φ?p is the canonical gradient (with respect to T2 (p)), since φ?p is in the closure of the span

of the tangent cone.

Note that choosing T2 (p) as the tangent cone, the functional φ is still differentiable

and we obtain a bound related with the extended Cramér-Rao inequality sharper than

52 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

the bound obtained with T1 (p). However, since the T2 (p) is not necessarily convex, it is

impossible to use the convolution theorem and the local minimax theorem.

A third alternative for the tangent cone is

n o

= l( · ; θ, z)T α + η( · ) : α ∈ IRq , η ∈ TN0 (θ, z) .

introduce next an additional assumption in the model that will make φ differentiable.

Suppose that for each α ∈ IRq and each η ∈ TN0 (θ, z) there exists a generalized sequence

{zt } = {zt (θ, z)} such that {pt } ⊂ P ∗ , given by

pt ( · ) = p( · ; tα + θ, zt ) (2.38)

often in the literature in an implicit form (see for instance Pfanzagl, 1990, page 17, for

the case where q = 1). We prove differentiability of φ at p with respect to T3 (p) under

(2.38). Given ν( · ) = l( · ; θ, z)T α + η( · ) ∈ T3 (p), and taking a path {pt } as in (2.38) we

obtain

φ(pt ) − φ(p) tα + θ − θ

= = α.

t t

On the other hand,

Z Z

J −1 (θ, z)lE(x; θ, z)ν(x)p(x)λ(dx) = J −1 (θ, z) lE (x; θ, z)lT(x; θ, z)p(x)λ(dx)α

X X

Z

−1

+J (θ, z) lE (x; θ, z)η(x)p(x)λ(dx)

X

φ(pt ) − φ(p)

= α = lim .

t&0 t

Hence φ is differentiable at p with respect to T3 (p) and

is the canonical gradient. In other words, we obtained the same canonical gradient of

φ if we work with T2 (p) or T3 (p) and consequently the extended Cramér-Rao bound is

also the same with the two choices of tangent cone. Note that T3 (p) is convex hence

we can use the convolution and the local asymptotic minimax theorems. This provides

an additional justification of the extended Cramér-Rao bound (via convolution theorem)

and a optimality theory involving a larger class of estimators, namely the weakly regular

2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 53

asymptotic linear estimating sequences (as in the first part of the local asymptotic mini-

max theorem) or even arbitrary estimating sequences (as in the second part of the local

asymptotic minimax theorem). However, we pay a price for these improvements, we have

to introduce regularity conditions on the model in order to obtain the differentiability of

the interest parameter functional.

It is current in the literature to take the whole (weak or Hellinger) tangent set as the

tangent cone, assume that the tangent set is equal to T3 (p) and use (implicitly) assump-

tions equivalent to (2.38) (see Pfanzagl, 1990 page 17). The strength of the approach

based on tangent cones, and not necessarily on the whole tangent set, is that it allow us

to graduate the regularity conditions. We can avoid the assumptions mentioned above

in the difficult cases or take full advantage of them in the sufficiently regular cases. The

approach based on tangent cones allow us to treat the cases where the tangent set is dif-

ficult (or virtually impossible) to calculate (see for instance the semiparametric extended

L2 - restricted models considered in chapter 5).

We conclude the chapter with a comment regarding reparametrizations. Suppose

that we reparametrize the model by considering the interest parameter g(θ) instead of θ.

Here g is a one-to-one differentiable application from IRq to IRq . The interest parameter

functional becomes g ◦ ψ(Pθz ) = g(θ). An application of the proposition 8 and the chain

rule shows that if an estimating sequence {θ̂n } attains the semiparametric Cramèr-Rao

bound for estimating θ then the transformed sequence {g(θ̂n )} attains the Cramèr-Rao

bound for estimating g(θ).

54 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Chapter 3

Functions

3.1 Introduction

In this chapter the theory of estimating functions for semiparametric models is studied.

The basic definitions and properties of estimating functions are given in section 3.2.

There a related notion called quasi estimating function is also introduced. Quasi esti-

mating functions are essentially functions of the observations, the interest parameter and

(different from the estimating functions) of the nuisance parameter. They will provide

a way to formalize in a more clear way the theory of estimating function and relate

estimating functions with regular asymptotic linear estimators. In order to construct

an optimality theory for estimating functions, we define a class of what we call regular

estimating functions. Two alternative (and equivalent) characterizations of the regular

estimating functions are provided in the subsections 3.2.2 and 3.2.3. The second charac-

terization is motivated by differential geometric considerations concerning the statistical

model (inspired by Amari and Kawanabe, 1996).

The characterizations referred to are used to derive an optimality theory in section 3.3.

A necessary and sufficient condition for the coincidence of the bound for the concentration

of estimators based on estimating functions and the semiparametric Cramèr-Rao bound

is provided in subsection 3.3.3. This condition says essentially that the nuisance tangent

space should not depend on the nuisance parameter.

The last section contains some complementary material. Subsection 3.4.1 studies a

technique for obtaining optimal estimating functions when the likelihood function can

be decomposed in certain way. In this way an alternative justification for the so called

principle of conditioning will be provided. A generalization of the notion of estimating

function is introduced in subsection 3.4.2. The chapter closes with a result that will allow

55

56 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

derived from regular estimating functions. This result turns out to be useful in the

subsequent chapters.

3.2. BASIC DEFINITIONS AND PROPERTIES 57

tions: basic definitions and properties

3.2.1 Estimating and quasi-estimating functions

A function Ψ : X ×Θ −→ IRq such that for each θ ∈ Θ, the associated function Ψ( · ; θ, z) :

X −→ IRq is measurable, is termed an estimating function. Estimating functions are

used to define sequences of estimators for the parameter of interest θ in the following way.

Under a repeated independent sample squeme, given a sample x = (x1 , . . . , xn )T of size n

of the (unknown) distribution Pθz ∈ P, define θbn implicitly by the solution of the equation

n

X

Ψ(xi ; θbn ) = 0 . (3.1)

i=1

Under regularity conditions each θbn is well defined and the sequence {θbn } is consistent (for

estimating θ) and asymptotically normally distributed. We explore this fact to construct

an optimality theory.

We introduce next a notion related to estimating functions. A function Ψ : X × Θ ×

Z −→ IRq , of the parameters and the observations, such that for each θ ∈ Θ and each

z ∈ Z, the function Ψ( · ; θ, z) : X −→ IRq is measurable is called a quasi-estimating

function. Each estimating function can be naturally identified with a quasi-estimating

function by making it correspond to a suitable quasi-estimating function constant on

the nuisance parameter. We make no distinction between estimating functions and the

corresponding quasi- estimating functions. This abuse of language causes, in general, no

risk of ambiguity.

A quasi- estimating function Ψ : X × Θ × Z −→ IRq such that the conditions (3.2)-

(3.6) below are satisfied is said to be a regular quasi-estimating function . The conditions

are, with ψi denoting the ith component of Ψ and for all θ0 ∈ Θ, all z ∈ Z and all

i, j ∈ {1, ..., p},

ψi ( · ; θ0 , z) ∈ L20 (Pθ0 z ); (3.2)

the partial derivative with respect to θ is well defined (almost everywhere), i.e.

∂

ψi ( · ; θ, z) |θ=θ0 exists ; (3.3)

∂θj

the order of differentiation with respect to θ and integration can be exchanged in the

following sense

∂ Z Z

∂

j

ψi (x; θ, z)p(x; θ, z)λ(dx)|θ=θ0 = [ψi (x; θ, z)p(x; θ, z)]θ=θ0λ(dx); (3.4)

∂θ ∂θj

58 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

"Z #

∂

Eθz {∇θ Ψ( · ; θ, z)} = ψi (x; θ, z) |θ=θ0 p(x; θ0 z)λ(dx) ; (3.5)

X ∂θj i,j=1,...,q

and

n o Z

T

Eθz Ψ(·; θ0 , z)Ψ (·; θ0 , z) = ψi (x; θ0 , z)ψj (x; θ0 , z)p(x; θ0 z)λ(dx) (3.6)

X i,j=1,...,q

is positive definite.

It is presupposed that the parametric partial score function is a regular quasi-estimating

function.

A regular quasi-estimating function that does not depend on the nuisance parameter

z is said to be a regular estimating function.

In this section we give a characterization of the class of regular estimating functions.

components ψ1 , . . . , ψq . For all (θ, z) ∈ Θ × Z and i ∈ {1, . . . , q},

ψi ( · ; θ, z) ∈ TN⊥ (θ, z) .

2

Here and in the rest of this chapter TN (θ, z) =T N (θ, z) and TN⊥ (θ, z) is the orthogonal

2

complement of the nuisance tangent space T N (θ, z) in L20 (Pθz ).

Proof: Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , k} z ∈ Z fixed and ν ∈ TN0 (θ, z) arbitrary.

We prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ). This implies the proposition,

because of the continuity of the inner product.

Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term

{rt }t∈V . Using (2.1), for each t ∈ V ,

1Z 1Z

= ψi (x; θ, z)pt (x)dµ(x) − ψi (x; θ, z)p(x; θ, z)dµ(x)

t X t X

−hrt ( · ), ψi ( · ; θ, z)iθz

= −hrt ( · ), ψi ( · ; θ, z)iθz .

3.2. BASIC DEFINITIONS AND PROPERTIES 59

L2 (Pθz )

Since rt −→ 0, from the continuity of the inner product, we conclude that

hν( · ), ψi ( · ; θ, z)iθz = 0 .

u

t

If the quasi- estimating function does not depend on the nuisance parameter (i.e. it

corresponds to a genuine estimating function), then we can obtain a sharper result.

Proposition 10 Let Ψ : X × Θ −→ IRq be a regular estimating function with components

ψ1 , . . . , ψq . For all θ ∈ Θ and i ∈ {1, . . . , q},

TN⊥ (θ, z) .

\

ψi ( · ; θ) ∈

z∈Z

In fact, the proposition above holds for the class of quasi- estimating functions with

expectation invariant with respect to the nuisance parameter.

Proof: Take θ ∈ Θ and i ∈ {1, . . . , k} fixed and arbitrary ξ ∈ Z and ν ∈ TN0 (θ, ξ). We

prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ).

Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term

{rt }t∈V . Using (2.1), for each t ∈ V ,

hν( · ), ψi ( · ; θ, z)iθz = −hrt ( · ), ψi ( · ; θ, z)iθz .

L2 (Pθz )

Since rt −→ 0, from the continuity of the inner product, we conclude that

hν( · ), ψi ( · ; θ, z)iθz = 0 .

u

t

estimating functions

We present in this section a variant of the geometric theory of estimating functions for

semiparametric models given in Amari and Kawanabe (1996). The development presented

is closely connected with the theory given in that paper, however it is not exactly the

same. We point out the most remarkable differences at the end of the section.

Take (θ, z) ∈ Θ × Z fixed. Given a ∈ L20 (Pθz ) and z∗ ∈ Z denote p( · , θ, z) and

p( · , θ, z∗ ) by p( · ) and p∗ ( · ) respectively and define the m-parallel transport of a from z

to z∗ by

(m) p( · )

Πzz∗ a( · ) = a( · ) .

p∗ ( · )

60 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

If a posses a finite expectation under Pθz∗ we define the e-parallel transport of a from z

to z∗ by

(e) Z

Πzz∗ a( · ) = a( · ) − a(x)p(x; θ, z∗ )λ(dx) .

X

The basic properties of the m- and e-parallel transport are given next.

Proposition 11 We have for each z, z∗ ∈ Z and each a, b ∈ L20 (p) ∩ L20 (p∗ ):

Z (m) Z (e)

Πzz∗ b(x)p∗ (x)λ(dx) = Πzz∗ b(x)p∗ (x)λ(dx) = 0 ; (3.7)

X X

(m)

ha, Πzz∗ biθz∗ = ha, biθz ; (3.8)

(e) (m)

hΠzz∗ a, Πzz∗ biθz∗ = ha, biθz (3.9)

and

(m) (m) (e) (e) (m) (e)

Πzz∗ Πzz∗ a( · ) =Πzz∗ Πzz∗ a( · ) =Πzz a( · ) =Πzz a( · ) = a( · ) . (3.10)

t

The parallel transports defined above have their origin in differential geometric consid-

erations for statistical parametric models (α-connections). We will not enter in details of

the geometric theory for semiparametric models, but refer instead to Amari and Kawan-

abe (1996) for an informal discussion. The parallel transports permit us to change the

inner product (see (3.8)), i.e. it permits us to move from one L2 space to another, keeping

to certain extent the structure given by the inner product of the first space. For instance

the L2 orthogonality (i.e. noncorrelation) is preserved after m-parallel transporting. From

the statistical viewpoint the e- and the m-parallel transport corresponds to correcting for

the mean and correcting for the distribution, respectively, when we move from one L2

space to another.

The following class of functions will be of interest in the theory of estimating functions,

r ∈ TN⊥ (θ, z) :

∀z∗ ∈ Z and ∀ν∗ ∈ TN (θ, z∗ ),

FIA (θ, z) = (m) .

hΠzz∗ ν∗ , riL2 (Pθz ) = 0

3.2. BASIC DEFINITIONS AND PROPERTIES 61

(e)

When the e-parallel transport is well defined one can use alternatively the relation hν∗ , Πzz∗

(m)

riL2 (Pθz∗ ) = 0 instead of hΠzz∗ ν∗ , riL2 (Pθz ) = 0. Informally, FIA is the class of functions r in

TN⊥ (θ, z) such that r corrected for the mean or corrected for the distribution is orthogonal

to each TN (θ, z∗ ) under Pθz∗ (for z∗ running in the whole Z).

Proposition 12 For each (θ, z) ∈ Θ × Z, FIA (θ, z) is a closed subspace of L20 (Pθz ).

(m)

Proof: The linearity and the continuity of hΠzz∗ ν∗ , ( · )iL2 (Pθz ) implies that FIA (θ, z) is a

vector subspace and a closed set in L2 (Pθz∗ ), respectively. u

t

terms of the classes of functions FIA ’s.

have, for i = 1, . . . , q, for all θ ∈ Θ and all z ∈ Z,

ψi ( · , θ) ∈ FIA (θ, z) .

we have from proposition 10 that ψi ( · ; θ) ∈ TN⊥∗ (θ, z∗ ) and then

(m)

hΠzz∗ ν∗ , ψi ( · ; θ)iL2 (Pθz ) = 0 .

t

The proposition above can be easily sharpened making the components of the regular

estimating functions belong to the intersection (over the nuisance parameter) of the FIA ’s.

However the following theorem shows that this is in fact not necessary, since in fact FIA

does not depend on the nuisance parameter. We will use sometimes the notation FIA (θ).

Here TN⊥∗ (θ, z∗ ) denotes the orthogonal complement of TN (θ, z∗ ) in L20 (Pθz∗ ).

Proof:

62 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

’⊆’ Take η ∈ FIA (θ, z), z∗ ∈ Z arbitrary and ν∗ ∈ TN (θ, z∗ ). Applying (3.8) yields

(m)

hν∗ , ηiL2 (Pθz∗ ) = hΠzz∗ ν∗ , ηiL2 (Pθz ) = 0 .

Hence η ∈ TN⊥∗ (θ, z∗ ). Since z∗ was choose arbitrarily in Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ).

’⊇’ Take an arbitrary z∗ ∈ Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ) and ν∗ ∈ TN (θ, z∗ ). Using (3.8) we

obtain

(m)

hΠzz∗ ν∗ , ηiL2 (Pθz ) = hν∗ , ηiL2 (Pθz∗ ) = 0 .

t

obtained here is equivalent to what we obtained in the last section. We remark that the

characterization based on the intersection of the nuisance tangent spaces can be found

in Jørgensen and Labouriau (1995) and the characterization based on parallel transports

(i.e. based on FIA ) is a variant of the results of Amari and Kawanabe (1996). The

main difference of the variant presented here and the original formulation in Amari and

Kawanabe (1996) is that here we define via the m-parallel transport and there FIA is

constructed through e-parallel transport. Both formulations are equivalent from this

point of view, provided the e-parallel transport is well defined. Moreover, when defining

via the m-parallel transport the class FIA is automatically a closed subspace in L20 .

3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 63

3.3.1 Classic optimality theory

Given a regular (estimating) quasi-estimating function Ψ we define the Godambe infor-

mation of Ψ by JΨ : Θ × Z −→ IR2q , where for each θ ∈ Θ and each z ∈ Z,

JΨ (θ, z) = Eθz{∇θ Ψ( · ; θ, z)}Eθz{Ψ( · ; θ, z)ΨT( · ; θ, z)}−1 Eθz{∇θ Ψ( · ; θ, z)}T

= SΨ (θ, z)VΨ−1 (θz)SΨT (θ, z) .

Here

SΨ (θ, z) := Eθz{∇θ Ψ( · ; θ, z)} and VΨ (θz) := Eθz{Ψ( · ; θ, z)ΨT( · ; θ, z)}

are called the sensibility and the variability of Ψ (at (θ, z) ), respectively.

Using standard arguments based on a Taylor expansion of Ψ it can be shown that

under some additional regularity conditions (each ψi twice continuous differentiable with

respect to each component of θ, for instance) a sequence {θbn } of roots of a regular esti-

mating functions is asymptotically normally distributed with asymptotic variance given

by JΨ−1 (θ, z), provided {θbn } is weakly consistent. (see Jørgensen and Labouriau, 1995

for conditions for consistency and asymptotic normality). Hence, we say that a regular

estimating function Ψ is optimal when for all θ ∈ Θ, for all z ∈ Z and for each regular

estimating function Φ,

JΦ (θ, z) ≤ JΨ (θ, z) .

Here ”≤” is understood in the sense of the Löwner partial order of matrices given by the

positive definiteness of the difference.

In the literature of estimating functions it is customary to say that it is possible to

justify the use of some estimators using finite sample arguments via estimating functions

and the Godambe estimation (see the articles of Godambe referred to). The argument

used there is that the Godambe information is a quantity that should be maximized when

using estimating functions. We do not share this point of view. The estimating functions

themselves are not the object of our direct interest. Our concern with estimating functions

is only through the estimators (or inferential procedures) associated with them. Hence

one should judge estimating functions only through the properties of such inferential

procedures. In fact, apart from the asymptotic variance, there are no clear connections

between the Godambe information and the (asymptotic or finite sample) properties of the

estimators associated with regular estimating functions.

We say that two regular quasi-estimating functions, Ψ, Φ : X × Θ × Z −→ IRk , are

equivalent if, for each θ ∈ Θ and z ∈ Z there exists a k × k matrix with full rank K(θ, z),

such that

Ψ(x; θ, z) = K(θ, z)Φ(x, θ, z) , Pθz a.s. .

64 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

We stress that K(θ, z) must not depend on the observation x. Clearly, two equivalent

estimating functions have the same roots almost surely and hence produce essentially the

same estimators, i.e. they are equivalent from the statistical point of view. Moreover,

it is easy to see that two equivalent quasi-estimating functions share the same Godambe

information for each value of the parameters.

obtained through estimating functions

We define the information score function, lI : X × Θ × Z −→ IRq , by the orthogonal

projection of the partial score function, l onto FIA (θ). More precisely, for each θ ∈ Θ

and z ∈ Z, the ith component of the information score function (i = 1, . . . , q) at (θ, z) is

given by

liI ( · ; θ, z) = Π{l( · ; θ, z)|FIA (θ)}, .

The space spanned by the components l1I , . . . , lqI of the information score function at

(θ, z) ∈ Θ × Z is denoted by E(θ, z), i.e.

E(θ, z) = span{liI ( · ; θ, z) : i = 1, . . . , q} .

Note that E(θ, z) is a closed (since it is finite-dimensional vector space) subspace of

L20 (Pθz ). Hence given any regular estimating function Ψ : X × Θ× −→ IRq with com-

ponents ψ1 , . . . , ψq we have, for all θ ∈ Θ, z ∈ Z and i ∈ {1, . . . , q} the orthogonal

decomposition

ψi ( · ; θ, z) = ψiA ( · ; θ, z) + ψiI ( · ; θ, z) , (3.11)

where ψiI ( · ; θ, z) ∈ E(θ, z) and ψiA ( · ; θ, z) ∈ A(θ, z) := E ⊥ (θ, z). Here A(θ, z) is the

orthogonal complement of E(θ, z) in L20 (Pθz ). The decomposition above induces the fol-

lowing decomposition of each regular quasi-estimating function

Ψ( · ; θ, z) = ΨA ( · ; θ, z) + ΨI ( · ; θ, z) , (3.12)

where the components ψiA ( · ; θ, z), . . . , ψiA ( · ; θ, z) of ΨA at (θ, z) are in A(θ, z) and the

components ψiI ( · ; θ, z), . . . , ψiI ( · ; θ, z) of ΨI at (θ, z) are in E(θ, z).

We show next that taking the “component” ΨI of a regular (quasi-) estimating function

improves the Godambe information. However, at this stage a technical difficulty appears,

the function ΨI is not necessarily a regular quasi-inference function, and hence does not

necessarily possesses a well-defined Godambe information. For this reason we introduce

next an extension of the notion of sensitivity, and consequently of Godambe information,

which will make us able to speak of Godambe information of some non-regular (quasi)

3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 65

estimating function Ψ : X × Θ × Z −→ IRq . We characterize the sensitivity of Ψ in an

alternative form that will suggest the extension one should define. For each (θ, z) ∈ Θ×Z

and each i, j ∈ {1, . . . , q} we have

∂ Z

0 = ψj (x; θ, z)p(x; θ, z)dµ(x) (3.13)

∂θi X

(differentiating under the integral sign )

Z

∂

= {ψj (x; θ, z)p(x; θ, z)} dλ(x)

X ∂θi

Z

∂ Z

∂

= {ψj (x; θ, z)} p(x; θ, z)dλ(x) + ψj (x; θ, z) {p(x; θ, z)} dλ(x) .

X ∂θi X ∂θi

Hence

Z

∂

{ψj (x; θ, z)} p(x; θ, z)dλ(x)

X ∂θi

Z

∂

= − ψj (x; θ, z) {p(x; θ, z)} dλ(x)

X ∂θi

Z

= ψj (x; θ, z)li (x; θ, z)p(x; θ, z)dλ(x) = −hψj ( · ; θ, z), li ( · ; θ, z)iθz

X

(decomposing li = liA + liI with liI ∈ FIA and liA ∈ TN )

= −hψj ( · ; θ, z), liI ( · ; θ, z)iθz − hψj ( · ; θ, z), liA ( · ; θ, z)iθz

(Since ψj ∈ FIA and liA orthogonal FIA )

= −hψj ( · ; θ, z), liI ( · ; θ, z)iθz

(decomposing ψj = ψiA + ψiI and using the orthogonality of liI and ψiA )

= −hψjI ( · ; θ, z), liI ( · ; θ, z)iθz .

h ij=1,...,k

SΨ (θ, z) = −hψjI ( · ; θ, z), liI ( · ; θ, z)iθz . (3.14)

i=1,...,k

i=1,...,k denotes the matrix formed by aij ’s with i indexing the columns and j

indexing the lines.

We define the extended sensitivity (or simply the sensitivity ) of Ψ by the matrix

in the right-hand side of (3.14). The (extended) Godambe information is defined in the

same way we did before but using the extended sensitivity instead of the sensitivity. Note

that both, the standard and the extended, versions of the sensitivity (and the Godambe

information) coincide in the case where Ψ is regular. Moreover, the extended sensitivity

66 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

is defined for each quasi-estimating function whose components are in L20 , not only for

regular estimating functions. According to the new definition both Ψ and ΨI posses the

same sensitivity.

= SΨ−1I (θ, z){VΨI (θ, z) + VΨA (θ, z)}SΨ−TI (θ, z)

= SΨ−1I (θ, z)VΨI (θ, z)SΨ−TI (θ, z) + SΨ−1I (θ, z)VΨA (θ, z)SΨ−TI (θ, z)

≥ SΨ−1I (θ, z)VΨI (θ, z)SΨ−TI (θ, z) = JΨ−1I (θ, z) .

u

t

The following proposition gives further properties of regular inference functions, which

will allow us to establish an upper bound for the Godambe information.

have:

(i) ΨI ∼ lI ;

(i) Assume without loss of generality that the components of the efficient score func-

tion l1I ( · ; θ, z), . . . , lqI ( · ; θ, z) are orthonormal in L20 (Pθz ). For each i ∈ {1, . . . , q}, ex-

panding ψi ( · ; θ, z) in a Fourier series with respect to a basis whose first q elements are

l1I ( · ; θ, z), . . . , lqI ( · ; θ, z) one obtains

+ · · · + hlqI ( · ; θ, z), ψi ( · ; θ, z)iθz lkI ( · ; θ, z) + ψiA ( · ; θ, z) .

3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 67

That is,

+ · · · + hlqI ( · ; θ, z), ψi ( · ; θ, z)iθz lqI ( · ; θ, z) .

Moreover, for j = 1, . . . , q

Z ( )

∂

hljI ( · ; θ, z), ψi ( · ; θ, z)iθz =− ψi (x; θ, z) p(x; θ, z)dλ(x) . (3.16)

X ∂θj

We conclude from (3.15) and (3.16) that ΨI ( · ; θ, z) = SΨ (θ, z)lI ( · ; θ, z), which means

that ΨI and lI are equivalent.

(ii) From the previous discussion span{ΨIi ( · ; θ, z) : i = 1, . . . , q} is the space spanned by

−SΨ (θ, z)lI ( · ; θ, z) which is the span of {liI ( · ; θ, z) : i = 1, . . . , q}, since the sensitivity by

assumption is of full rank.

(iii) Straightforward.

u

t

A consequence of the two last proposition is that JlI is an upper bound for the Go-

dambe information of regular quasi inference functions. This upper bound is attained by

any (if any exists) extended regular inference functions with components in E. In partic-

ular if lI is a regular (quasi-) inference function, then it is an optimal (quasi-) inference

function.

We study in this section the Attainability of the semiparametric Cramér-Rao bound

through regular estimating function. More precisely, we give a necessary and sufficient

condition for the coincidence of the semiparametric Cramér-Rao bound and the bound

given in the previous section for the asymptotic variance of estimators derived from regular

estimating functions.

Let us consider the interest parameter functional Φ : P ∗ −→ Θ given by, for each

(θ, z) ∈ Θ × Z,

Φ(p( · ; θ, z)) = θ .

P ∗ , with respect to the tangent cone T (p) = TN (θ, z) ∪ span{li ( · ; θ, z) : i = 1, . . . , q}.

Here we adopt the L2 path differentiability, since this is the path differentiability used

to characterize the class of regular estimating functions. We stress that the theory of

functional differentiability used here is compatible with any notion of path differentiability

68 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

stronger than (or equal to) the weak path differentiability, in particular the L2 path

differentiability is allowed. Moreover, in the examples we have in mind (L2 - restricted

models) the notions of weak and L2 path differentiability coincide. Take a fixed p( · ; θ, z) =

p( · ) in P ∗ . Consider the function

The following lemma will allow us to connect the optimality theory for estimating func-

tions with the semiparametric Cramér-Rao lower bound.

Z

∀ν ∈ TN0 (θ, z), φ•p (x)ν(x)p(x)λ(dx) = 0 (3.17)

X

and

Z

φ•p (x)lT (x; θ, z)p(x)λ(dx) = Iq , (3.18)

X

Take ν ∈ TN0 (θ, z). Since each component of uI ( · ; θ, z) is in FIA (θ, z) ⊆ TN⊥ (θ, z),

condition (3.17) holds. On the other hand,

Z Z

φ•p (x)lT (x; θ, z)p(x)λ(dx) = Covp (lI )−1 lI (x; θ, z)lT (x; θ, z)p(x)λ(dx)

X X

Z

= Covp (lI )−1 lI (x; θ, z)lI (x; θ, z)T p(x)λ(dx)

X

= Iq ,

that is the condition (3.18) holds. We conclude that Φ•p is a gradient of Φ at p with respect

to the tangent cone T (p). u

t

According to the lemma above Φ•p is a gradient of Φ at p, but not necessarily the

canonical gradient. In fact the canonical gradient of the functional Φ at p with respect to

T (p) is

where lE ( · ; θ, z) is the efficient score function at (θ, z) and J −1 (θ, z) is the covariance

matrix of lE ( · ; θ, z) under Pθz (see chapter 2). The unicity of the canonical gradient

implies that Φ•p is the canonical gradient if and only if it is equal to Φ?p and this occurs

3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 69

if and only if TN⊥ (θ, z) = FIA (θ). The covariance of Φ?p ( · ) (under Pθz ), that is J −1 (θ, z),

gives the semiparametric Cramér-Rao lower bound. On the other hand, the lower bound

for the asymptotic covariance of estimators obtained from regular estimating functions is

the covariance (under Pθz ) of Φ•p . We conclude that the following result holds.

Theorem 8 The semiparametric Cramér-Rao lower bound coincides with the bound for

the asymptotic covariance of estimators defined through regular estimating functions at

(θ, z) ∈ Θ × Z if and only if

The theorem above implies that estimating functions produce efficient estimators only if

the orthogonal complement of the nuisance tangent space does not depend on the nuisance

parameter. Examples of that situation are the L2 - restricted and the partial parametric

models and models considered in chapter 5 and 6.

70 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

3.4.1 Optimal estimating functions via conditioning

We present in this section some results which allow us to compute optimal estimating

functions in many practical situations. The results will be in accordance with the so

called conditioning principle. For the sake of simplicity we study here only models with

a one-dimensional parameter of interest.

We study the situation where we have a likelihood factorization of the following form.

Suppose that there exists a statistic T = t(X) such that, for all θ ∈ Θ all z ∈ Z and all

x ∈ X,

p(x; θ, z) = ft (x; θ)h{t(x); θ, z}. (3.19)

Theorem 9 Assume that there exists a statistic T such that one has the decomposition

t t

(3.19). Moreover, suppose that the class {Pθz : z ∈ Z}, where Pθz is the distribution of

T (x) under Pθz (i.e. X ∼ Pθz ), is complete. Then the regular inference function given by

∂

Ψ(x; θ) = log ft (x; θ), ∀x ∈ X , ∀θ ∈ Θ (3.20)

∂θ

is optimal. Moreover, if Φ is also an optimal inference function then Φ is equivalent to

Ψ.

The theorem above gives an alternative justification for the use of conditional inference.

The following technical (and trivial) lemma will be the kernel of the proofs that follow.

But first it is convenient to introduce the following notation. Given a regular inference

function Ψ : X × Θ −→ IRk , we define

Ψ(x; θ, z)

Ψ̃(x; θ) = ,

Eθz {Ψ0 (θ)}

which is called the standardized version of Ψ. Here Ψ0 (θ) = ∇θ Ψ(θ). Along this section

we denote the class of all regular estimating functions by G.

and z ∈ Z, the following assertions hold:

(i)

Eθz {Ψ(θ)l(θ; z)}

= −1,

Eθz {Ψ0 (θ)}

where l(θ; z) is the partial score function at (θ; z);

3.4. FURTHER ASPECTS 71

(ii)

n o h i n o n o

Eθz Φ̃2 (θ) = Eθz {Φ̃(θ) − Ψ̃(θ)}2 + 2Eθz Φ̃(θ)Ψ̃(θ) − Eθz Ψ̃2 (θ) .

Z

Ψ(x; z)p(x; θ, z)dµ(x) = 0.

Differentiating the expectation above with respect to θ and interchanging the order of

differentiation and integration, we obtain

( )

∂

Eθz Ψ(θ) + Eθz {Ψ(θ)l(θ; z)} = 0

∂θ

which is equivalent to the first part of the lemma. The second part is straightforward. u

t

The following lemma gives a useful tool for computing optimal inference functions.

Lemma 3 Assume the previous regularity conditions. Consider two functions A : Θ −→

IR\{0} and R : X × Θ × Z −→ IR. Suppose that, for each regular inference function Φ,

one has, for each θ ∈ Θ and z ∈ Z,

Z

R(x; θ, z)Φ(x; θ)p(x; θ, z)dµ(x) = 0 .

Ψ(x; θ) = A(θ)l(x; θ, z) + R(x; θ, z), (3.21)

for x Pθz - almost surely, (Ψ does not depend on z even though l and R do), then Ψ is

optimal. Furthermore, a regular inference function Φ is optimal if and only if for all

(θ, z) ∈ ΘZ,

Φ̃(θ) = Ψ̃(θ) , for x Pθz almost surely,

provided that there exists a decomposition as (3.21) above.

" #

Φ(θ)A(θ)l(θ, z) + Φ(θ)R(·; θ, z)

Eθz {Φ̃(θ)Ψ̃(θ)} = Eθz (3.22)

Eθz {Φ0 (θ)}Eθz {Ψ0 (θ)}

A(θ) Eθz {Φ(θ)l(θ, z)}

=

Eθz {Ψ (θ)} Eθz {Φ0 (θ)}

0

A(θ)

= − .

Eθz {Ψ0 (θ)}

72 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

Eθz {Φ̃2 (θ)} = Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + 2Eθz {Φ̃(θ)Ψ̃(θ)} − Eθz {Ψ̃2 (θ)} (3.23)

= Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + Eθz {Ψ̃2 (θ)}

≥ Eθz {Ψ̃2 (θ)},

1 1

JΦ (θ, z) = 2

≤ = JΨ (θ, z). (3.24)

Eθz {Φ̃ (θ)} Eθz {Ψ̃2 (θ)}

We conclude that Ψ is optimal. For the second part of the theorem, note that one has

equality in (3.23), and hence in (3.24), if and only if ∀θ ∈ Θ, ∀z ∈ Z, Eθz [{Φ̃(θ) −

Ψ̃(θ)}2 ] = 0. That is, if a regular inference function Φ is optimal then Φ̃(·; θ) =

Ψ̃(·; θ) Pθz -a.s. , ∀θ ∈ Θ, ∀z ∈ Z. u

t

∂ ∂ ∂

l(x; θ, z) = log p(x; θ, z) = log ft (x; θ) + log h(x; θ, z) . (3.25)

∂θ ∂θ ∂θ

We apply Theorem 3 to prove that ψ is a (“unique”) optimal inference function. More

precisely, defining A(θ) = 1 and R(x; θ, z) = −∂ log h{t(x); θ, z}/∂θ, and using (3.25) we

can write Ψ in the form

∂

Ψ(x; θ) = log ft (; θ) = A(θ)l(x; θ, z) + R(x; θ, z) .

∂θ

According to lemma 3, if R is orthogonal to every regular inference function, then Ψ

is optimal, moreover Ψ is the unique optimal inference function, apart from equivalent

inference functions.

Take an arbitrary regular inference function φ. We show that φ and R are orthogonal.

Note that for each z ∈ Z,

Z Z

0= φ(x; θ)p(x; θ, z)dµ(x) = φ(x; θ)ft (x; θ)h{t(x); θ, z)dµ(x) .

3.4. FURTHER ASPECTS 73

R

On the other hand Eθz (φ|T ) = φ(x; θ)ft (x; θ)dµ(x), which is independent of z. We write

Eθ (φ|T ) for Eθz (φ|T ), and we have Eθz {Eθ (φ|T )} = 0. Since T is complete, Eθ (φ|T ) = 0,

Pθz almost surely. We have then,

u

t

A quasi estimating function is said to be an generalized estimating function if it is

equivalent to an estimating function. More precisely, a quasi estimating function Ψ :

X × Θ × Z −→ IRq is an generalized estimating function if for each (θ, z) ∈ Θ × Z there

exists a non-singular q × q matrix A(θ, z) and a measurable function Φ( · , θ) : X −→ IRq

such that

regular generalized estimating functions.

generalized estimating functions are used for estimating the interest parameter in the

following way. Given a sample x = (x1 , . . . , xn )T of size n of a unknown probability

measure of the model, define the estimator θ̂n implicitly by the solution of the following

equation

n

X

0 = Ψ(xi ; θ̂n ; z)

i=1

n

X

= A(θ̂n , z) Φ(xi ; θ̂n ) ,

i=1

which is equivalent to

n

X

Φ(xi ; θ̂n ) = 0 .

i=1

In other words, for each generalized estimating function there is an estimating function

that yields the same estimating sequence. In fact generalized estimating functions are just

a tool that will simplify some formalizations. Examples of generalized estimating functions

are the efficient estimating function of most of the models studied in the subsequent

chapters. The following result will be useful latter on.

74 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

Proposition 17 If for each (θ, z) ∈ Θ × Z, FIA (θ) = TN⊥ (θ, z), and the efficient score

function is a generalized estimating function, then the estimating function equivalent to

the efficient score function attains the semiparametric Cramèr-Rao bound.

Proof: Take an arbitrary (θ, z) ∈ Θ × Z. Since FIA (θ) = TN⊥ (θ, z), the efficient score

function lE coincides with the information score function lI at (θ, z). Hence the extended

sensibility of the efficient score function (at (θ, z)) is the q × q identity matrix and the

Godambe information of lE at (θ, z) is

−1 E

JlE (θ, z) = Covθ,z (l ) ,

to the efficient score function then its Godambe information is equal to the Godambe

information of the efficient score function, that is Φ attains the semiparametric Cramèr-

Rao bound at (θ, z). The proof follows now from the fact that (θ, z) was chosen arbitrarily.

u

t

Part II

Semiparametric Models

75

Chapter 4

Models

4.1 Introduction

This chapter studies some semiparametric extensions of the location-scale model. The

classic location-scale model is constructed by taking a (fixed) distribution and applying

to it a shift (location) and rescaling (scale) transformation. Now consider the situation

where instead of having a fixed distribution one deals with a given class of distributions. A

shift-rescaling transformation is applied to an unknown element of this class. Our interest

is in estimating the shift and the rescaling, but now in the presence of the indetermination

given by the unknown particular element of the class of distribution that generated the

data, i.e. the location (shift) and the scale (rescaling) are the parameters of interest and

the unknown distribution is the nuisance parameter.

We will consider location-scale models defined for distributions contained in exponen-

tial families and with the support equal to the whole real line. Here we do not know in

which exponential family the supposed data distribution is contained. Additionally, we

consider some models for which some standardized cumulants are fixed and known. This

will allow us to obtain a range of semiparametric models of various sizes.

The main purpose here is to study the behavior of the efficient score function (i.e. the

canonical gradient of the interest parameter functional) for estimating the location and

the scale. This will allow us to gain intuition in a simple example, before treating more

complicate cases. We will not be concerned at this stage with generality and stringent

restrictions on the family of distributions will be largely used in order to keep the math-

ematical argumentation transparent. The location and scale model will be studied as an

example in chapter 5 under less restrictive conditions.

Section 4.2 presents and discusses the semiparametric location-scale model with which

77

78 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

we work. The nuisance tangent spaces are calculated in section 4.3 and some details

supplied in the appendices. Section 4.4 studies the efficient score function and in the

subsections 4.4.1 and 4.4.2 we specialize to the case where the first two and the first three

standardized cumulants are fixed, respectively. Some discussion is provided in section

4.5. There is an appendix with a brief summary of the theory of the Laplace transform,

adapted to the context we need, and we prove a sufficient condition, in terms of the

Laplace transform, for having the class of polynomials dense in the L2 space associated

to a certain probability measure.

4.2. SEMIPARAMETRIC LOCATION-SCALE MODELS 79

In this section we define the semiparametric location-scale model used in the rest of the

paper.

Let λ be the Lebesgue measure and P a family of probability measures defined on

(IR , B(IR)), dominated by λ and given by

( )

dPµσa 1 · −µ

P = Pµσa : (·) = a , µ ∈ IR, σ ∈ IR+ , a ∈ A . (4.1)

dλ σ σ

Here A is the class of functions a : IR −→ IR such that (4.2)-(4.10) given below hold.

∀x ∈ IR , a(x) > 0; (4.2)

Z

a(x)λ(dx) = 1 ; (4.3)

IR

Z

xa(x)λ(dx) = 0 ; (4.4)

IR

Z

x2 a(x)λ(dx) = 1 (4.5)

IR

and

a is differentiable λ-almost everywhere. (4.6)

We consider also the following technical conditions involving the Laplace transform and

the behavior of the function a in the tails. Assume that there exists a δ > 0 (we stress

that δ may depend on a) such that for all s ∈ (−δ, δ)

Z

M [s , a( · )] = esx a(x)λ(dx) < ∞ ; (4.7)

IR

" Z #

M s, = esx λ(dx) < ∞ . (4.8)

a( · ) IR a(x)

Assume further that

lim xi a(x) =

∀i ∈ N0 , x→∞ lim xi a(x) = 0 . (4.9)

x→−∞

80 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Z

for i = 3, ..., k , xi a(x)λ(dx) = mi , (4.10)

IR

where k is an integer greater than 1, m3 , ..., mk are real quantities supposed given and

fixed (if k = 2 we assume the convention that condition (4.10) vanishes). Note that m1 =

0, m2 = 1, m3 , . . . , mk are in fact the first k standardized cumulants of the distributions of

P, which are hence assumed to be fixed. Here the term standardized cumulants refers to

the moments (about zero) of the standardized distribution (i.e. the distribution shifted

and rescaled in order to have mean zero and variance one).

We will treat a ∈ A as the non-parametric component of a semiparametric model and

(µ, σ) as the parameters of interest. Conditions (4.2) and (4.3) imply that each a ∈ A

is a density of a probability measure with support equal to the whole real line. The

identifiability of the parametrization given is a consequence of (4.4) and (4.5). Clearly,

this is not the only possible way to obtain identifiability, but it turns out to be convenient

for our purposes.

We stress that conditions (4.2)-(4.9) are essential for the development given. On the

other hand, condition (4.10) was imposed only to enable us to control the size of the class

of models to be considered. From the statistical viewpoint condition (4.10) can be used to

study the impact of knowing higher order standardized cumulants of the distribution in

play. A potential field of application for these techniques is in the study of turbulent flow

of fluids where, due to the Kolmogorov theory, one can predict the values of the cumulants

of the distribution involved (see Barndorff-Nielsen, 1978 and Barndorff-Nielsen et al., 1990

). From the theoretical point of view we use condition (4.10) to impose constraints on A,

yielding semiparametric models of differing types. In fact, the model obtained without

condition (4.10) is, according to the classification of Wellner et al. (1994), a nonparametric

model, in the sense that the tangent spaces are the whole L20 space. On the other hand,

by imposing condition (4.10) (with some integer k greater than 2) one obtains a genuine

semiparametric model, in the sense that the tangent spaces are infinite dimensional proper

subsets of the L20 spaces. We will then study the effect of this qualitative change on the

efficient score function. This will illustrate how the estimation problem becomes harder

when one jumps from a nonparametric model to a genuine semiparametric model. The

meaning of the conditions (4.2)-(4.9) is discussed in detail in the following. We fix for the

rest of this section (µ, σ, a) ∈ IR × IR+ × A.

We show now that condition (4.8) implies that the location and scale scores are in

L2 (Pµσa ). The components of the score function with respect to the location and the

scale parameters are given respectively by

· −µ

0

−1 a σ

l/µ ( · ) = , (4.11)

σ a · −µ

σ

4.2. SEMIPARAMETRIC LOCATION-SCALE MODELS 81

and

0 · −µ

−1 · −µ a σ

l/σ ( · ) = 1+ . (4.12)

σ σ a · −µ

σ

Note that condition (4.8) together with proposition 19 in appendix 4.6.1 ensures that the

2 2

2 {a ( · ) }

0 2 0

functions {aa(( ·· )}

)

and ( · ) a( · )

are Lebesgue integrable.

It can be seen that condition (4.9) implies that

Z Z

l/µ (x)a(x)λ(dx) = l/σ (x)a(x)λ(dx) = 0 ,

IR IR

i.e. the location and the scale partial scores are unbiased. We will need the condition

(4.9) with polynomials of arbitrary order in the calculation of the nuisance tangent space

and the projection of the score function onto the orthogonal complement of the nuisance

tangent space.

Condition (4.7) implies that the distribution associated with a possesses finite moments

of all orders and that the polynomials are dense in L2 (a) (see the proposition 19 and

theorem 11 in the appendix 4.6.1). Those properties will be crucial in the calculation of

the nuisance tangent space and in the projection of the score function onto the orthogonal

complement of the nuisance tangent space.

The following proposition gives a useful sufficient condition for verifying whether a

given probability density satisfies the technical conditions (4.7)-(4.9).

Proposition 18 Let a : IR −→ IR be a function for which (4.2), (4.3) and (4.6) hold.

Assume moreover that there exists δ > 0 such that for all s ∈ (−δ, δ),

lim esx a(x) = lim esx a(x) = 0

x→+∞ x→−∞

and

a0 (x) a0 (x)

lim esx q = lim esx q = 0

x→+∞ x→−∞

a(x) a(x)

Then the technical conditions (4.7)-(4.9) hold.

Proof: Conditions (4.7) and (4.8) follow from proposition 20 in appendix 4.6.1, part i)

and (4.9) from part ii). u

t

Using the proposition above it is easy to see that the following classic families of

distributions have probability densities satisfying the technical conditions (4.7)-(4.9): the

normal distributions, the hyperbolic distributions, the Gumbel distributions, the double

exponential or Laplace distribution.

82 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

In this section we characterize the nuisance tangent spaces of the semiparametric location-

scale model given in section 4.2 in terms of orthonormal polynomials. More precisely,

we will calculate the L2 nuisance tangent space for the semiparametric location-scale

model presented. Here (µ, σ, a) will be treated as a fixed (but arbitrary) point of the

parameter space IR × IR+ × A. We denote by {ei }i∈N0 the result of a Gram-Schmidt

orthonormalization process with respect to the inner product of L2 (a) applied to the

sequence of monomials, say {1, ( · ), ( · )2 , ...}. Let us adopt the convention that for each

i ∈ N0 , the polynomial ei ( · ) is of degree i and introduce the notation

· −µ

e∗i ( · ) = ei .

σ

The following theorem gives a polynomial characterization for the L2 - nuisance tangent

space.

Theorem 10 The L2 - nuisance tangent space of the location-scale model given by (4.1)

at (µ, σ, a) ∈ IR × IR+ × A is

2

∗

T N (µ, σ, a) = clL2 (Pµσa ) [span{ei ( · ) : i = k + 1, k + 2, ...}] .

Proof: We give now the main steps of the proof (the details can be found in appendix

4.6.2).

First of all, it can be shown that under a semiparametric location-scale model the L2 -

nuisance tangent space at (µ, σ, a) ∈ IR × IR+ × A is given by

( )

· −µ 2

2

T N (µ, σ, a) = ν : ν ∈TN (0, 1, a)

σ

(see theorem 12 in the appendix 4.6.3 for a detailed proof). Therefore there is no loss of

generality in restricting our attention to the case where (µ, σ, a) = (0, 1, a) and show that

2

T N (0, 1, a) = clL2 (P01a ) {span [{ei ( · ) : i = k + 1, k + 2, ...}]} .

2

For notational simplicity, in this proof we denote T N (0, 1, a) by TN (and use the

analogous convention for the tangent sets). Define, for each i ∈ N ,

Hi = span{e1 ( · ), ..., ei ( · )} .

4.3. CALCULATION OF THE NUISANCE TANGENT SPACE 83

L20 (P01a ). Take an arbitrary ν ∈ TN0 and h ∈ Hk . There exists a differentiable path (at

a) {at } ⊂ A with tangent ν. Let {rt } ⊂ L20 (P01a ) be the sequence of remainder terms of

{at }. For each t,

at − a

| < ν, h >a | = < − rt , h >a

ta

1

Z Z

= h(x)a t (x)λ(dx) −

h(x)at (x)λ(dx) − < rt , h >a

t

(from (4.3)-(4.5) and (4.10) )

= | < rt , h >a | .

L2 (P01a )

Since rt −→ 0, < rt , h >a −→ 0 as t ↓ 0. We conclude that < ν, h >a = 0. Therefore

TN0 ⊆ Hk⊥ and since Hk⊥ is a closed linear space,

Next we sketch the proof that Hk⊥ ⊆ TN . The verification of this inclusion above can

be reduced to proving that for each i ∈ {k + 1, k + 2, ...}

ei ( · ), if i is even,

hi ( · ) =

ei+1 ( · ) − ei ( · ), if i is odd,

is in TN (0, 1, a). The proof is done by showing that for t small enough we have that

at ( · ) = a( · ) + ta( · )hi ( · )

belongs to A. The conditions (4.2)-(4.10) for at are verified in the appendix 4.6.2. There,

the crucial point is the verification of (4.7) and (4.8), which is done with the Laplace

transform properties given in appendix 4.6.1. u

t

We remark that in fact the weak and the L1 tangent spaces coincide with the L2

tangent space for the location and scale models. This will be proved in a more general

context in chapter 5. In the rest of the chapter we suppress the symbol “2” indicating

that we work with the L2 tangent space.

84 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

We calculate in this section the efficient score function for the location-scale model (4.1)

by projecting the location and the scale partial scores onto the orthogonal complement of

the nuisance tangent space.

Thought this section (µ, σ, a) is an arbitrary element of the parameter space IR×IR+ ×

A. We denote the probability measure Pµσa by P0 and the inner product and the norm

of L2 (P0 ) by < · , · > and k · k respectively. Moreover, A⊥ will denote the orthogonal

complement of A in L20 (P0 ).

Recall that the L2 - nuisance tangent space at (µ, σ, a) is given by

TN (µ, σ, a) = clL2 (P0 ) [span{e∗k+1 , e∗k+2 , ...}]⊥ .

Since {e∗i }i∈N0 is an orthonormal basis in L2 (P0 ), the orthogonal complement of TN (µ, σ, a)

in L20 (P0 ) is given by

TN⊥ (µ, σ, a) = span{e∗1 , ..., e∗k } . (4.13)

The efficient score function is calculated now by expanding the location and scale

scores in Fourier series and taking the terms of indices 1, . . . , k. More precisely, since

l/µ ( · ) is in L2 (P0 ) we have the following Fourier expansion in terms of the orthonormal

basis {e∗i }i∈N0 ,

∞

c1 e∗1 ( · ) ck e∗k ( · ) ci e∗i ( · )

X

l/µ ( · ) = + ... + + (4.14)

i=k+1

and hence the the location component of the efficient score function at (µ, σ, a) is given

by

E

l/µ ( · ) = c1 e∗1 ( · ) + ... + ck e∗k ( · ) (4.15)

and analogously the scale component of the efficient score function is

E

l/σ ( · ) = d1 e∗1 ( · ) + ... + dk e∗k ( · ) . (4.16)

Here the Fourier coefficients in (4.15) and (4.16) are given, for all i ∈ N , by

ci = < l/µ ( · ), e∗i ( · ) > (4.17)

Z

1 a0 x−µ

σ

x−µ 1 x−µ

= −

x−µ

ei a λ(dx)

IR σ a σ σ σ σ

1

Z

e0i (y)a(y)λ(dy)

+∞

= − ei (y)a(y) −∞ −

σ IR

1Z 0

= e (y)a(y)λ(dy) .

σ IR i

4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 85

Here we used the condition (4.9). A similar calculation leads to the following formula for

the coefficients of (4.16):

1Z

di = < l/σ ( · ), e∗i ( · ) >= ye0i (y)a(y)λ(dy), for i = 1, . . . , k . (4.18)

σ IR

We discuss now the dependence of the efficient score function on the nuisance param-

eter. First of all, since for each i ∈ N , ei ( · ) is a polynomial of degree i, the coefficient ci

is a linear combination of the standardized cumulants up to order k − 1 of the distribu-

tion with density a (see (4.17)). Moreover, the coefficients di are linear combinations of

the standardized cumulants of the distribution in play up to order k. We conclude that

E E

the coefficients of l/µ ( · ) and l/σ ( · ) (given in (4.15) and (4.16) ) depend on the nuisance

parameter a only through the standardized cumulants up to order k. However, the depen-

dence of the efficient score function on the nuisance parameter is more complex because

the polynomials e0 , e1 , ..., ek generated by the orthonormalization procedure (in L2 (a))

depend on higher order standardized cumulants. In fact the polynomial ek ( · ) depends on

the moments of order up to 2k, because in order to normalize the polynomial of degree

k in the Gram-Schmidt procedure, we have to divide the polynomial by its L2 (a) norm,

which clearly depends on the standardized cumulant of order 2k.

Summing up, the dependence of the efficient score function on the infinite dimensional

parameter for the location-scale model under study occurs here via a finite dimensional

intermediate parameter involving only the standardized cumulants of order up to 2k.

4.4.1 The case where the first two standardized cumulants are

fixed

We study now in detail the case where only the standardized cumulants up to order 2

are fixed (i.e. k = 2). It will be shown that in this case the efficient score function is

equivalent to an estimating function, which is independent of the nuisance parameter and

has the sample mean and standard deviation as roots.

The first three elements of the basis {ei }i∈N0 are

e0 ( · ) = 1 , e 1 ( · ) = ( · ) (4.19)

1 q

e2 ( · ) = {( · )2 − m3 ( · ) − 1} , where ∆2 = m4 − m23 − 1 .

∆2

The detailed calculations are given in appendix 4.6.4. There can be found also an argu-

ment showing that ∆2 > 0, and hence (4.19) is well defined. Using the formulas (4.17)

and (4.18) to calculate the coefficients c0 , c1 , c2 , d0 , d1 , d2 of the efficient score function

86 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

gives

c1 = σ1 c2 = −m

σ∆2

3

2

d1 = 0 d2 = σ∆ 2

.

Inserting the coefficients given above into (4.15) and (4.16) we obtain that the efficient

score function at (µ, σ, a) is given by

E

l/µ ( · ) = c1 e∗1 ( · ) + c2 e∗2 ( · ) (4.20)

" ( 2 )#

1 · −µ m3 · −µ · −µ

= − 2 − m3 −1

σ σ ∆2 σ σ

and

( 2 )

2 · −µ · −µ

E

l/σ (·) = d2 e∗2 ( · ) = − m3 −1 . (4.21)

σ∆22 σ σ

Under independent repeated sampling, with sample x = (x1 , ..., xn )T we obtain the

following expression for the efficient score function,

xi −µ 2

Pn xi −µ m3 Pn xi −µ

i=1 σ

− ∆22 i=1 σ

− m3 σ

−1

−1

lE (x

; µ, σ, a) =σ .

xi −µ 2

2 Pn xi −µ

− ∆22 i=1 σ

− m3 σ

−1

− m23

" #

1

M =σ −m23 −∆22

m3 2

we obtain the following estimating function which is equivalent to the efficient score

function

xi −µ̂

P

n

i=1

σ̂

M · lE (x ; µ, σ, a) = Pn xi −µ̂ 2 .

i=1 σ̂

−n

Note that the matrix M is indeed nonsingular (its determinant is −σ∆22 /2 6= 0, see a

justification in appendix 4.6.4). Hence the efficient score function is equivalent to an

estimating function independent of the nuisance parameter with roots

v

n n

1 u1 X

u

X

µ̂ = xi and σ̂ = t (x i − µ̂)2 .

n i=1 n i=1

4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 87

the efficient score function is optimal.

Clearly, the sample mean and the sample variance are regular asymptotic linear esti-

mators (for µ and σ 2 ). They are efficient, since the full tangent space is the whole L20 .

Note that, in this case, the bound given by the L2 path differentiability is attained (by

regular linear asymptotic estimators), hence it coincides with the bound given by the

weak path differentiability.

4.4.2 The case where the first three standardized cumulants are

fixed

We show in this section that in the case where the standardized cumulants up to order 3

are fixed (i.e. k = 3) the roots of the efficient score function do depend on the nuisance

parameter through the cumulants up to order 6. Moreover, we exemplify some situations

where the roots of the efficient score function are not the sample mean and the sample

variance.

We have computed in the last section the first three coefficients of the efficient score

function, namely c0 , c1 , c2 , d0 , d1 and d2 . We calculate now the coefficients c3 and d3 which

will allow us to compute the efficient score function by using (4.15) and (4.16). Note that

the polynomial e3 is given by (see appendix 4.6.4)

( ! ! !)

1 γ m3 γ γ

e3 ( · ) = ( · )3 − 2

( · )2 − m4 − 2 ( · ) − m3 − 2

∆3 ∆2 ∆2 ∆2

where

γ = m5 − m3 m4 − m3

and

! ! !

γ m3 γ γ

∆3 =
( · )3 − ( · )2 − m4 − 2 ( · ) − m3 − 2

2

∆2 ∆2 ∆2

Note also that according to the argument given in appendix 4.6.4, ∆3 > 0 and hence

e3 ( · ) is well defined.

Using (4.17) and (4.18) we obtain

( )

1Z 0 1 m3 γ

c3 = e (x)a(x)λ(dx) = 3 − m4 − 2

σ IR 3 σ∆3 ∆2

and

( )

1Z 0 1 γ

d3 = xe (x)a(x)λ(dx) = 3m3 − 2 2 . (4.22)

σ IR 3 σ∆3 ∆2

88 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

E

l/µ ( · ) = c1 e∗1 ( · ) + c2 e∗2 ( · ) + c3 e∗3 ( · )

" ( 2 )#

1 · −µ m3 · −µ · −µ

= − 2 − m3 −1

σ σ ∆2 σ σ

( 3 ! 2

c3 · −µ γ · −µ

+ −

σ∆3 σ ∆22 σ

! !)

m3 γ · −µ γ

− m4 − 2 − m3 − 2

∆2 σ ∆2

and

E

l/σ ( · ) = d2 e∗2 ( · ) + d3 e∗3 ( · )

" ( 2 )#

1 2 · −µ · −µ

= − 2 − m3 −1

σ ∆2 σ σ

( 3 ! 2

d3 · −µ γ · −µ

+ −

σ∆3 σ ∆22 σ

! !)

m3 γ · −µ γ

− m4 − 2 − m3 − 2 .

∆2 σ ∆2

We consider now some examples where the efficient score function simplifies a bit.

Example 6 Let us consider the case where m3 = m5 = 0 and m4 = 3,i.e. the standard-

ized cumulants up to order 3 coincide with those of the normal distribution. We have then

that the coefficients c3 and d3 vanish and hence the efficient score function coincides with

the one obtained in the case where only the cumulants up to order 2 were fixed. There-

fore, in this case the roots of the efficient score function are the sample mean and standard

deviation. u

t

Example 7 We study now the case where m3 = m5 = 0 and m4 6= 3. This is the case

for instance for the Laplace distribution (double exponential) or the hyperbolic distribution

with symmetry parameter β vanishing, which are symmetric (hence m3 = m5 = 0) and

have the standardized cumulant of fourth order different from 3.

In this case the coefficients of the efficient score function are given by

n o

c3 = 1

σ∆3

(3 − m4 ) 6= 0 , d3 = σ∆1

3

3m3 − 2 ∆γ2 = 0

2

−m3 2

c2 = σ∆2

=0 , d2 = σ∆ 2

1

c1 = σ

, d1 = 0

4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 89

1 ∗ 1

E

l/µ (·) = e1 ( · ) + (3 − m4 )e∗3 ( · )

σ ∆3

3 − m4 · − µ 3 3 − m4 + ∆23 · − µ

≡ − +

∆23 σ ∆23 σ

and

2

· −µ

E

l/σ (·) = d2 e∗2 ( · ) ≡ +1 .

σ

Under independent repeated sampling, with sample x = (x1 , ..., xn )T , we obtain the fol-

lowing expression for the efficient score function

n n

3 − m4 X xi − µ 3 3 − m4 + ∆23 X xi − µ

E

l/µ (x ) ≡ − + (4.23)

∆23 i=1 σ ∆23 i=1 σ

and

n 2

E

X xi −µ

l/σ (x ) ≡ +n . (4.24)

i=1 σ

Now, equating (4.24) to zero we obtain

v

n

u1 X

u

σ̂ = t (x i − µ̂)2 . (4.25)

n i=1

Equating (4.23) to zero, inserting (4.25) and rearranging we obtain the following equation

n

!

− n (2A + 1) µ̂3 + (4A + 3) µ̂2

X

xi

i=1

!2

n n

!

x2i + 2(A + 1)

X X

− {(n + 1)A + n} xi µ̂

i=1 i=1

n n n

( ! ! !)

1

x3i x2i

X X X

+ A + (A + 1) xi = 0,

i=1 n i=1 i=1

4

. We obtained then that µ̂ is a root of a polynomial of third degree with

coefficients depending on the standardized cumulants up to order 6 (note that ∆3 depends

on m6 ). Hence µ̂ depends on the nuisance parameter and the efficient score function

cannot be equivalent to an estimating function independent of the nuisance parameter.

u

t

90 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.5 Discussion

There exist in the literature many studies of the extensions of the pure location model. We

refer to Stein (1956), van der Vaart (1988,1991) and Bickel et al. (1993), among others.

In all the studies referred to the unknown distributions are assumed to be symmetric,

which can simplify the mathematical treatment considerably. In this paper we presented

a class of distributions that are not necessarily symmetric and we treat at the same

time the problem of estimating the scale. It is to be noted that we did not attempt to

obtain the largest class (or even a very large class) of distributions, containing asymmetric

distributions, for which it is possible to treat the problem of estimating the location (and

the scale). Rather, we restricted the discussion to some interesting infinite dimensional

classes of distributions. It would be a considerable improvement to eliminate the technical

assumptions on the Laplace transform and the polynomial decay (of all orders) of the tails

of the densities (i.e. conditions (4.7)-(4.9) ).

In the case of the location-scale model where only the first two moments are fixed,

the efficient score function does not essentially depend on the nuisance parameter and

is in fact a regular estimating function. This implies, according to the discussion in

chapter 2 (see also Jørgensen and Labouriau, 1995), that the efficient score function is an

optimal estimating function, in the sense that it maximizes the Godambe information.

The roots of the efficient score function yield the sample mean and the sample standard

desviation as estimators of the location and scale respectively. This is in agreement

with the literature for the location model with symmetry and with the common sense in

statistics. In this way, the theory of estimating functions does not suggest new estimators

and its merit apparently is only to justify the classic statistical estimation procedure.

However, some care should be taken at this point. We showed that the only possible

estimating sequence derived from regular estimating functions, in this example, is the

sample mean and the sample standard desviation. Hence, that estimating sequence is

optimal in a class containing only one element. Moreover, the usual claim that it is

possible to obtain B-robust estimating sequences from estimating functions fails in this

example.

Note that in the case of the location-scale model where only the first two moments are

fixed, the (global) tangent space is the whole space L20 (see Labouriau,1996a). That is, the

model is not a semiparametric model in the terminology of J. Wellner (see Groeneboom

and Wellner, 1992, page 7) but rather is a nonparametric model. This implies that there

is essentially only one sequence of regular linear asymptotic estimators. Clearly, this

sequence of estimators is optimal, even though optimality in a class containing only one

element does not say very much. Hence, a naive application of the direct extension of the

estimating function theory to semiparametric models and the classical theory of regular

linear asymptotic both fail in this example.

4.5. DISCUSSION 91

When one fixes some standardized cumulants of higher order, the model becomes a

genuine semiparametric model, in the sense that its (global) tangent space becomes a

proper subspace of L20 . In this case the estimation problem is harder even though the

family of distributions now is smaller than the “less restricted case” discussed before.

It will be shown in chapter 5 that, in fact, the semiparametric Cramèr-Rao bound is

not attained in this example. The efficient score function and its roots now depend on

the nuisance parameter, however through a finite dimensional intermediate parameter,

namely only through a finite number of standardized cumulants (2k). This suggests

that plugging some reasonable estimators of the standardized cumulants (of order up

to 2k) into the efficient score function, can produce reasonable asymptotic results. In

fact, since the standardized cumulants can be estimated consistently from the sample

moments, pluging these estimators in the efficient score function should produce efficient

estimators. However, since one has to estimate standardized cumulants of high order

(2k), one can expect a poor performance for finite samples of moderate size. The method

of local efficient estimation and the method of sieves can perhaps offer some attractive

alternatives.

92 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.6 Appendices

4.6.1 The Laplace transform and polynomial approximation in

L2

Basic properties of the Laplace Transform

In this section we review the basic properties of the Laplace transform and prove some

technical lemmas required in the study of the semiparametric location-scale model defined

in section 4.2. The properties presented here are essentially well known for the case were

the distribution is concentrated on the positive real line, however the result concerning

densities of polynomials in L2 is original.

Let f : IR −→ [0, ∞) be a function such that for some s ∈ IR the integral

Z

M (s; f ) = esx f (x)λ(dx) (4.26)

IR

converges. The function M ( · ; f ) : IR −→ [0, ∞] such that for each s ∈ IR, M (s; f ) is

given by (4.26) is said to be the Laplace transform of f .

We now study some properties of the functions with finite Laplace transform in a

neighborhood of zero.

Proposition 19 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0

and for all s ∈ (−δ, δ)

Z

xn f (x)λ(dx) ∈ IR .

IR

Proof: Since for all s ∈ (−δ, δ), M (s; f ) < ∞, e|sx| ≤ esx + e−sx and using the series

version of the monotone convergence theorem (see Billingsley 1986 page 214 theorem 16.6

1

) we have

Z Z

∞ > eδx f (x)λ(dx) + e−δx f (x)λ(dx)

IR IR

Z

≥ e|δx| f (x)λ(dx)

IR

1

RP P R

The theorem referred to states: ”If fn ≥ 0, then n fn dµ = n fn dµ.”.

4.6. APPENDICES 93

(∞

|δx|k

Z )

X

= f (x)λ(dx)

IR k=0 k !

(from theorem 16.6 in Billingsley 1986)

∞

|δx|k

(Z )

X

= f (x)λ(dx) ,

k=0 IR k !

t

The notion of Laplace transform can be extended to functions with range equal to the

whole real line in the following way. Given a function f : IR −→ IR we define the positive

and the negative part of f respectively by

Here χA ( · ) is the indicator function of the set A. We have clearly the decomposition

f ( · ) = f +( · ) − f −( · ) .

IR −→ [−∞, ∞] given by

M ( · ; f ) = M ( · ; f +) − M ( · ; f −) , (4.28)

provided that at least one of the terms of the right side of (4.28) is finite (otherwise the

Laplace transform of f is not defined). The following proposition will be useful for the

calculation of the L2 - nuisance tangent space of the location-scale model considered in

section 4.2.

Proposition 20 Let f : IR −→ IR and δ > 0 be such that for all s ∈ [−δ, δ] M (s; f ) ∈ IR.

Then, for all n ∈ N and all s ∈ (−δ/2, δ/2) we have

M [s; ( · )n f ( · )] ∈ IR .

Proof: Assume without loss of generality that the function f is nonnegative. Take an

arbitrary s ∈ [−δ/2, δ/2] and n ∈ N . By hypothesis, f has finite Laplace transform in

a neighborhood of zero; then, from proposition 19, f has finite moments of all orders, in

particular

Z

x2n f (x)λ(dx) ∈ IR .

IR

94 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

= | < e( · )s f 1/2 ( · ), ( · )n f 1/2 ( · ) >λ |

(Cauchy-Schwartz inequality)

≤ e( · )s f 1/2 ( · ) ( · )n f 1/2 ( · )

Z 1/2Z 1/2

2sx 2n

= e f (x)λ(dx) x f (x)λ(dx) < ∞

IR IR

u

t

In this section we give a sufficient condition for having the class of polynomials dense in

L2 (a). Here a is a density with respect to the Lebesgue measure λ of a positive finite

measure on (IR, B(IR)) and L2 (a) is endowed with the usual inner product and norm

denoted by < · , · >a and k · ka respectively. The conditions we give will ensure that the

measure a possesses all moments finite, i.e. for all k ∈ N ,

Z

xk a(x)λ(dx) ∈ IR .

IR

In that case we can define the sequence of polynomials {ei ( · )}i∈N0 ⊆ L2 (a) as the result

of a Gram-Schmidt orthonormalization process applied to the sequence {1, ( · ), ( · )2 , ...}.

The following theorem gives a sufficient condition for {ei ( · )} to be a complete sequence

in L2 (a), which implies that the polynomials are dense in L2 (a).

Z

∃δ > 0 such that ∀s ∈ [−δ, δ], M (s; a) = esx a(x)λ(dx) < ∞ . (4.30)

IR

Proof: First of all we observe that condition (4.30) implies that the measure determined

by a possesses finite moments of all orders (see proposition 19).

4.6. APPENDICES 95

Z

xk f (x)a(x)λ(dx) = 0 . (4.31)

IR

We prove that f ( · ) = 0 a-a.e. which implies the theorem (see Luenberg, 1969, Lemma 1,

page 61).

Define for each k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,

We will use a series version of the dominated convergence theorem applied to {fk }. In

the following we find a Lebesgue integrable function dominating uniformly (i.e. for all

k) the functions fk , which will enable as to use the referred theorem. We have for each

n ∈ N , k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,

n n n

X X X |xt|k

f k (x) ≤ |fk (x)| = |f (x)|a(x) (4.32)

k=0 k !

k=0 k=0

n ∞

X |xt|k X |xt|k

= |f (x)|a(x) ≤ |f (x)|a(x)

k=0 k ! k=0 k !

q q

xt −xt

= |f (x)| a(x) a(x)(e + e )

= g(x) ,

where the function g is given, for all x ∈ IR, by

q q

g(x) = |f (x)| a(x) a(x)(ext + e−xt ) . (4.33)

q 2 Z

|f (x)|2 a(x)λ(dx) = kf ( · )k2a < ∞.

|f ( · )| a( · ) =

2

L (λ) IR

Then the first term in the right side of (4.33) is in L2 (λ). On the other hand,

q
2 Z

a( · )e( · )t
e2tx a(x)λ(dx) = M (2t; a) < ∞

2 =

L (λ) IR

and
q
2 Z

a( · )e−( · )t
e−2tx a(x)λ(dx) = M (−2t; a) < ∞ .

2 =

L (λ) IR

96 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Then the second term in the right side of (4.33) is in L2 (λ). Using the Cauchy-Schwartz

inequality (see Luenberg, 1969, lemma 1, page 47) we obtain

Z q q

< |f ( · )| a( · ) , a( · ) e( · )t + e−( · )t >λ

g(x)λ(dx) =

IR

q
q

( · )t

+ e−( · )t

≤
|f ( · )| a( · )

2
a( · ) e

< ∞.

L (λ) L2 (λ)

Since (4.32) holds for each n ∈ N , x ∈ IR, t ∈ [−δ/2, δ/2] and g is Lebesgue integrable

we can use the series version of the dominated convergence theorem (see Billingsley, 1986,

theorem 16.7 page 214 2 ) to obtain

(∞

(xt)k

Z Z )

xt

X

e f (x)a(x)λ(dx) = f (x)a(x) λ(dx)

IR IR k=0 k !

(from the series dominated convergence theorem)

∞

(xt)k

(Z )

X

= f (x)a(x)λ(dx) = 0 .

k=0 IR k !

M [t; f ( · )a( · )] = 0 . (4.34)

We show that (4.34) implies that f ( · ) = 0 a-a.e. . For,

kf ( · )k2a = | < f ( · ), 1 >a |

q q

f ( · ) f ( · ), e( · )δ/4 e−( · )δ/4 >a

= <

q q

( · )δ/4

, f ( · )e−( · )δ/4

= <

f ( · )e >a

(from the Cauchy-Schwartz inequality)

q q

f ( · )e( · )δ/4 f ( · )e−( · )δ/4

≤

a a

Z 1/2 Z 1/2

δ/2x −δ/2x

= f (x)e a(x)λ(dx) f (x)e a(x)λ(dx)

IR IR

= (from (4.34)) = 0 .

u

t

2

P Pn

The theorem states: ”IfP n fn converges almost everywhere andR P| k=1 fk |P

≤ gR almost everywhere,

where g is integrable, then n fn and the fn are integrable, and f

n n dµ = n fn dµ”.

4.6. APPENDICES 97

The following proposition gives a sufficient condition for having the Laplace transform

defined in a neighborhood of zero, which is easy to verify.

Proposition 21 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0

and for all s ∈ [−δ, δ]

x→+∞ x→−∞

Then we have:

lim xk f (x) = lim xk f (x) = 0

x→+∞ x→−∞

Proof:

i) Take s ∈ (−δ, δ). Condition (4.35) implies that there exists L ∈ IR+ such that for all

x ∈ IR \ [−L, L], eδx f (x) < 1 and e−δx f (x) < 1. We have then

Z

M (s; f ) = esx f (x)λ(dx)

ZIR Z Z

= esx f (x)λ(dx) + esx f (x)λ(dx) + esx f (x)λ(dx)

[−L,L] [L,∞) (−∞,−L]

Z Z Z

= sx

e f (x)λ(dx) + e e f (x)λ(dx) + e(δ−s)x e−δx f (x)λ(dx)

(s−δ)x δx

[−L,L] [L,∞) (−∞,−L]

Z Z Z

≤ esx f (x)λ(dx) + e(s−δ)x λ(dx) + e(δ−s)x λ(dx) < ∞ .

[−L,L] [L,∞) (−∞,−L]

x→+∞ x→+∞

and

lim xk f (x) = lim {eδx xk }{e−δx f (x)} = 0

x→−∞ x→−∞

u

t

98 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

In this appendix we complete the details of the second part of the proof of theorem ??.

Recall the notational conventions given there, {ei }i∈N0 denotes the result of a Gram-

Schmidt orthonormalization process with respect to the inner product of L20 (a) applied

2

to the sequence of polynomials {1, ( · ), ( · )2 , ...}. Moreover, T N (0, 1, a) is denoted by TN

(and use the analougue convention for the tangent sets). Define, for each i ∈ N ,

Hi = span{e1 ( · ), ..., ei ( · )} .

ei ( · ), if i is even,

hi ( · ) =

ei+1 ( · ) − ei ( · ), if i is odd.

We will prove that for i ∈ {k + 1, k + 2, ...}, hi ( · ) ∈ TN0 ⊆ TN . This implies the lemma.

To see that take any linear combination of {hk+1 ( · ), hk+2 ( · ), ...} in TN . In particular, we

have that if i is even, then ei ( · ) ∈ TN , and if i is odd, then ei ( · ) = hi ( · ) − hi−1 ( · ) ∈ TN .

Hence for all i ∈ {k + 1, k + 2, ...}, ei ( · ) is in TN . Since TN is a closed linear subspace of

L2 (a),

clL2 (P01a ) {[{ei ( · ) : i ∈ {k + 1, k + 2, ...}]} ⊆ TN .

On the other hand, the condition (4.7) and the theorem 11 implies that the polynomials are

dense in L2 (a) and, since Hk = span{e0 ( · ), ..., ek ( · )} and {ei ( · )}i∈N0 is an orthonormal

system, we have

We conclude that the lemma is proved if we show that for all i ∈ {k + 1, k + 2, ...},

hi ( · ) ∈ TN0 and this is proved below.

Let ν( · ) = hi ( · ) for a fixed but arbitrary i ∈ {k + 1, k + 2, ...}. We prove that for t

small enough,

Note that (4.36) is equivalent to take a differentiable path with vanishing remainder term.

Hence, if we show (4.36) we prove in fact that ν ∈ TN0 . We verify next that each at (for t

in a neighborhood of zero) satisfies the conditions (4.2)-(4.10).

4.6. APPENDICES 99

Verification of (4.2):

Note that

lim ν(x) = lim ν(x) = ∞ .

x→+∞ x→−∞

Hence there exists L > 0 such that for all x ∈ IR \ [−L, L], ν(x) > 0. Then, since a( · ) is

strictly positive, for all x ∈ IR \ [−L, L],

On the other hand, since ν( · ) is continuous its restriction to the compact interval [−L, L]

is bounded, and since a( · ) is continuous and strictly positive, the restriction of a( · ) to

[−L, L] is bounded alway from zero. It can be easily shown then that for t small enough

and for all x ∈ [−L, L], at (x) > 0.

Verification of (4.3)-(4.5) and (4.10):

Given i ∈ {1, 2, ..., k} and t ∈ IR+ ,

Z Z Z

i i

x at (x)λ(dx) = x a(x)λ(dx) + t xi ν(x)a(x)λ(dx) = mi + t0 = mi .

IR IR IR

The second last inequality comes from the fact that {ei }i∈N0 is an orthogonal system in

L2 (a) and from (4.3)-(4.5) and (4.10).

Verification of (4.6):

Since the polynomials are of class C ∞ the property follows immediately for at .

Verification of (4.7):

We have for all s ∈ (−δ, δ) (δ = δ(a) ) and all t ∈ IR+

Since ν is a polynomial and M (s; a) < ∞ by hypothesis, from proposition 20 (in the

appendix on the Laplace transform), M (s; aν) < ∞, then, for all s ∈ (−δ, δ)

M (s, at ) < ∞ .

Verification of (4.8):

A routine calculation yields

{a0t ( · )}2 = p( · ){a0 ( · )}2 + q( · ){a( · )}2 + w( · )a( · )a0 ( · ) , (4.37)

for some polynomials p, q, and w. We will show that for all s ∈ [−δ/2, δ/2] and for t ∈ IR+

small enough,

M [s; p( · ){a0 ( · )}2 /at ( · )] , M [s; q( · )a2 ( · )/at ( · )] , (4.38)

M [s; w( · ){a( · )a0t ( · )}/at ( · )] ∈ IR .

100 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

{a0 ( · )}2 {a0 ( · )}2

" # " #

M s; t = M s; p( · ) (4.39)

at ( · ) at ( · )

a2 ( · ) a( · )a0t ( · )

" # " #

+ M s; q( · ) + M s; r( · ) ∈ IR ,

at ( · ) at ( · )

for all s ∈ [−δ/2, δ/2], which implies (4.8).

We prove (4.38). Take t small enough in such a way that for all x ∈ IR, at (x) > 0 and

let s be an arbitrary element of [−δ/2, δ/2]. The Cauchy-Schwartz inequality gives that

#

0

a0 ( · )

"

a( · )a ( · ) a( · )

s

( · ) 2s

= < e( · ) 2 r( · ) q

t

M s; r( · ) ,e >L2 (λ) (4.40)

q

at ( · )

at ( · ) at ( · )

a( · ) ( · )s/2 a0 ( · )

( · )s/2

≤ e

r( · ) q e q

at ( · ) at ( · )

L2 (λ) L2 (λ)

We verify that each of the terms in the right side of (4.40) are finite. Note that since

limx→±∞ ν(x) = ∞, there exists a L > 0 such that for all x ∈ IR \ [−L, L], at (x) > a(x).

We have then

2

a( · )
a2 (x)

Z

( · )s/2 sx 2

e r( · ) q = e r (x) λ(dx) (4.41)

at ( · )

IR at (x)

L2 (λ)

Z

a2 (x)

sx 2

Z

a2 (x)

= e r (x) λ(dx) + esx r2 (x) λ(dx)

[−L,L] at (x) IR\[−L,L] at (x)

Z

a2 (x) Z

a2 (x)

≤ esx r2 (x)) λ(dx) + esx r2 (x) λ(dx) < ∞ .

[−L,L] at (x) IR a(x)

The first integral in the last line is finite because its integrand is continuous and hence

bounded in [−L, L], and the second integral is finite because it coincides with the Laplace

transform of r2 ( · )a( · ) which according to proposition 20 is finite. Moreover,

2

( · )s/2 a0 ( · )
{a0 (x)}2

Z

e q
= esx λ(dx) (4.42)

at ( · )

IR at (x)

L2 (λ)

Z

{a0 (x)}2 Z

{a0 (x)}2

≤ esx λ(dx) + esx λ(dx) < ∞ .

[−L,L] at (x) IR a(x)

The first integral in the last line is finite because its integrand is continuous and hence

bounded in [−L, L], and the second integral is finite because it coincides with the Laplace

4.6. APPENDICES 101

0 2

transform of {aat((··)}) which according to condition (4.8) is finite. Inserting (4.41) and

(4.42) in (4.40) we obtain, for all s ∈ [−δ/2, δ/2],

a( · )a0 ( · )

" #

M s; r( · ) <∞.

at ( · )

2

h i

We show now that M s; q( · ) aat (( ·· )) is finite. Using the Cauchy-Schwartz inequality we

obtain

"

a2 ( · )

#

a( · ) ( · )s/2 a( · )

< e( · )s/2 q( · ) q

M s; q( · ) = ,e >L2 (λ) (4.43)

q

at ( · )

at ( · ) at ( · )

a( · )
( · )s/2 a( · )

( · )s/2

≤
e q( · ) q
e q
< ∞.

at ( · )
at ( · )

L2 (λ) L2 (λ)

To see that the right side of the expression above is finite note that for every polynomial,

say r, we have

a( · ) {a(x)}2

Z

( · )s/2
sx 2

e r( · ) q
= e r (x) λ(dx) (4.44)

at ( · )

IR at (x)

L2 (λ)

Z

sx 2 {a(x)}2 Z

{a(x)}2

≤ e r (x) λ(dx) + esx r2 (x) λ(dx)

[−L,L] at (x) IR\[−L,L] a(x)

Z

{a(x)}2

≤ esx r2 (x) λ(dx) + M [s; r2 ( · )a( · )] < ∞.

[−L,L] at (x)

Note that proposition 20 himplies that M [s; r2 ( · )a( · )] < ∞.

0 2

i

Finally, we show that M s; p( · ) {aat((··)}) is finite. For,

{a0 ( · )}2 0

" Z

2

#

{a (x)}

M s; p( · ) = esx p(x) λ(dx)

at ( · )

IR at (x)

0

Z

{a (x)} 2 Z

{a0 (x)}2

≤ esx |p(x)| λ(dx) + esx |p(x)| λ(dx)

[−L,L] at (x) IR\[−L,L] a(x)

{a0 (x)}2 {a0 ( · )}2

Z " #

sx

≤ e |p(x)| λ(dx) + M s; |p( · )| < ∞.

[−L,L] at (x) a( · )

Verification of (4.9):

Given i ∈ N0 we have for each t ∈ IR+ ,

lim xi at (x) = lim {xi a(x) + xi ν(x)a(x)} = 0 ,

x→±∞ x→±∞

102 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

because xi ν(x) is a polynomial and from (4.9), for each polynomial p( · ) we have

lim p(x)a(x) = 0 .

x→±∞

u

t

4.6. APPENDICES 103

In this section we show how to extend the calculation of the strong nuisance tangent

space of a semiparametric location-scale model at the point (0, 1, a0 ) to an arbitrary point

(µ0 , σ0 , a0 ) ∈ IR × IR+ × A. More precisely we prove the following theorem

(µ0 , σ0 , a0 ) ∈ IR × IR+ × A is given by

· − µ0

TN (µ0 , σ0 , a0 ) = ν : ν ∈ TN (0, 1, a0 ) .

σ0

The theorem above holds for any of the notions of path differentiability given in Labouriau

(1996a), and in particular for any path differentiability used in this paper. Therefore we

do not specify the path differentiability adopted in this appendix.

The point (µ0 , σ0 , a0 ) will be considered fixed in the rest of this section. We introduce

the following notation:

Pµ0 σ0 a0 = P0 ;

1 · − µ0

a• ( · ) = a0 .

σ0 σ0

The inner product and the norm of L2 (P0 ) will be denoted by < · , · >0 and k · k0

respectively. We also use the symbol P0∗ to denote Pµ∗0 σ0 . Note that

1 · − µ0

P0∗ = a : a∈A

σ0 σ0

and

( )

ν ∈ L20 (P0 ) : ∃ > 0, {a0t }t∈[0,) ⊆ P0∗ , {rt0 }t∈[0,) ⊆ L2 (P0 )

TN0 (µ0 , σ0 , a0 ) = 0 .

such that (4.45) and (4.46) hold

and

rt0 ( · ) −→ 0 , as t ↓ 0 , (4.46)

where the convergence above is one of the forms of convergence defined in section ??.

· −µ0

Lemma 5 If ν ∈ TN0 (0, 1, a0 ), then ν σ0

∈ TN0 (µ0 , σ0 , a0 ).

104 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

∗

Proof: Since ν ∈ TN0 (0, 1, a0 ) there exists > 0, {at }t∈[0,) ⊆ P01 = A and {rt }t∈[0,) ⊆

2

L (P01a0 ) such that for all t ∈ [0, ),

and

rt0 ( · ) −→ 0 , as t ↓ 0 .

Using (4.47), for all t ∈ [0, ), we can write

1 · − µ0 1 · − µ0 1 · − µ0 · − µ0

at = a0 + t a0 ν (4.48)

σ0 a0 σ0 a0 σ0 a0 a0

1 · − µ0 · − µ0

+ t a0 rt

σ0 a0 a0

· − µ0 · − µ0

= a• ( · ) + ta• ( · )ν + ta• ( · )rt .

a0 a0

Clearly, σ10 at · −µ

a0

0

∈ P0∗ , because at ( · ) ∈ A. Suppose now that the convergence of the

remainder term is in the Lp sense (for q ∈ [1, ∞)). We have then

· − µ0
q x − µ0 q

Z

rt = rt a• (x)λ(dx) (4.49)

a0

Lq (P0 ) IR a0

Z

= {rt (y)}q a0 (y)λ(dy) = krt ( · )kqLq (a0 ) .

IR

· −µ0

Since, for all t ∈ [0, ), rt ( · ) ∈ Lq (P01a0 ), then from (4.49), rt a0

∈ L2 (P0 ), and

q

q 0

since krt ( · )kL2 (P0 ) −→ 0,
rt · −µ

a0

0

q −→ 0. We conclude that ν · −µ0

σ0

∈TN

L (P01a0 )

(µ0 , σ0 , a0 ). An analogous argument can be used to prove the lemma for the weak nuisance

tangent spaces. The idea again is to use a suitable change of variables in the integrals

used to define the convergence of the remainder term. u

t

· −µ0

Lemma 6 If ν σ0

∈ TN0 (µ0 , σ0 , a0 ), then ν( · ) ∈ TN0 (0, 1, a0 ).

Proof: We give next the argument for the L2 - nuisance tangent space. For the other no-

· −µ0

tions of path differentiability the argument is analogous. Since ν σ0 ∈ TN0 (µ0 , σ0 , a0 ),

there exist > 0, {a0t }t∈[0,) ⊆ P0∗ and {rt0 }t∈[0,) ⊂ L2 (P0 ) such that

· − µ0

a0t ( · ) = a• ( · ) + ta• ( · ) ν + ta• ( · ) rt ( · ) (4.50)

σ0

4.6. APPENDICES 105

and

0

rt ( · )
−→ 0 , as t ↓ 0 .

0

Since for each t ∈ [0, ) a0t ( · ) ∈ P0∗ , there exists at ( · ) ∈ A such that

1 · − µ0

a0t ( · ) = at .

σ σ0

Then (4.50) is equivalent to

1 · − µ0 1 · − µ0 1 · − µ0 · − µ0

at = a0 + t a0 ν

σ σ0 σ σ0 σ σ0 σ0

1 · − µ0 0

+t a0 rt ( · ) .

σ σ0

1

Eliminating the common factor σ

and changing variables we obtain

Note that

Z

k rt [σ0 ( · ) + µ0 ] k2L2 (P01a0 ) = {rt [σ0 (x) + µ0 ]}2 a0 (x)λ(dx) (4.51)

IR

1 y − µ0

Z

= {rt (y)}2 a0 λ(dy)

IR σ σ0

= krt ( · )k2L2 (P0 ) .

0

that ν( · ) ∈ TN0 (0, 1, a0 ). u

t

· − µ0

TN0 (µ0 , σ0 , a0 ) = ν : ν∈ TN0 (0, 1, a0 ) .

σ0

· − µ0

= clL2 (P0 ) span ν : ν ∈ TN0 (0, 1, a0 )

σ0

(from lemma 7 )

106 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

· − µ0

= clL2 (P0 ) ν : ν ∈ span{TN0 (0, 1, a0 )}

σ0

(from lemma 8 )

· − µ0

0

= ν : ν ∈ clL2 (a0 ) [span{TN (0, 1, a0 )}]

σ0

· − µ0

= ν : ν ∈ TN (0, 1, a0 )

σ0

u

t

We present now the two technical lemmas required in the proof given above.

Lemma 7 Given a class of functions A we have

· −µ · −µ

span ν : ν∈A = ν : ν ∈ span(A) .

σ σ

Proof:

’⊆’ n o

Take h ∈ span ν · −µ

σ0

0

: ν ∈ A . Then there exists n ∈ N , t1 , ..., tn ∈ IR and h1 , ..., hn ∈

n o

· −µ0

ν σ0

: ν ∈ A such that

n

X

h( · ) = ti hi ( · ) .

i=1

n o

· −µ0

Since for each i ∈ {1, ..., n}, hi ∈ ν σ0

: ν ∈ A , there exists νi ∈ A such that

· − µ0

νi = hi ( · ) .

σ0

Pn

Clearly i=1 ti νi ( · ) ∈ span(A) and

n

· − µ0

X

h( · ) = ti νi .

i=1 σ0

n o

· −µ0

Hence h ∈ ν σ0

: ν ∈ span(A) .

’⊇’ n o

Take z ∈ ν · −µ

σ0

0

: ν ∈ span(A) . Then there exists ν ∈ span(A) such that z( · ) =

ν · −µ

σ0

0

. Since ν ∈ span(A), there exists n ∈ N , t1 , ..., tn ∈ IR and ν1 , ..., νn ∈ A such

that ν( · ) = ni=1 ti νi ( · ). We have then

P

n

· − µ0

X

z( · ) = ti νi

i=1 σ0

4.6. APPENDICES 107

n o

· −µ

and hence z ∈ span ν σ

: ν∈A . u

t

· −µ · −µ

clL2 (P0 ) ν : ν∈A = ν : ν ∈ clL2 (a0 ) (A) .

σ σ

· − µ0 2 · − µ0 1 · − µ0

Z

2

f = f a0 λ(dx) (4.52)

σ0

L2 (P0 ) IR σ0 σ0 σ0

Z

= f 2 (y)a(y)λ(dy) = kf ( · )kL2 (a0 ) .

IR

We prove now the lemma.

”⊆” n o n o

Take z ∈ clL2 (P0 ) ν · −µ

σ

: ν ∈ A . Then there exists a sequence {zn } ⊆ ν · −µ

σ

: ν ∈ A

L2 (P0 )

such that zn −→ z. Moreover, for all n ∈ N we have zn ( · ) = νn · −µ σ0

0

, for some νn ∈ A.

2

Since {zn } is convergent, it is a Cauchy sequence in L (P0 ). From (4.52), {νn } is a Cauchy

sequence in L2 (a0 ) and, since L2 (a0 ) is complete, {zn } is convergent, say

L2 (a0 )

νn −→ ν ,

for some η ∈ clL2 (a0 ) (A). Using (4.52) and the L2 (a0 )-continuity of the norm k · kL2 (a0 )

we obtain

· −µ · −µ

ν − z( · ) = lim νn − zn ( · ) = 0.

σ 2

L (P0 ) n↑

σ 2

L (P0 )

Hence, z( · ) = ν · −µ

σ

and the inclusion follows.

”⊇” n o

Take z ∈ ν · −µ σ

: ν ∈ clL2 (a ) (A) . Since ν ∈ clL2 (a ) , there exists a sequence {νn } ⊆ A

0 0

L2 (a0 )

such that νn −→ ν. Note that {νn } is a Cauchy sequence

in L2 (a0 ). Define the

sequence {zn } in L2 (P0 ) by, for all n, zn ( · ) = νn · −µ σ0

0

. From (4.52) we see that

2

{zn } isna Cauchy

sequenceo

in L (P0 ) and then it is convergent with limit, say ξ ∈

· −µ0

clL2 (P0 ) ν σ0 : ν ∈ A . Using (4.52) and the L2 (P0 )-continuity of the norm k · kL2 (P0 )

we obtain

· − µ0

kξ( · ) − z( · )kL2 (P0 ) = lim
zn ( · ) − νn

= 0. (4.53)

n↑ σ0

L2 (P0 )

u

t

108 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

L2 (a)

We calculate here the first four orthogonal polynomials in L2 (a) (i.e. e0 , e1 , e2 and e3 ) in

terms of the standardized cumulants of the distribution given by a. Here a is an arbitrary

element of the class of probability densities A defined in section 4.2.

Throughout this appendix the inner product and the norm of L2 (a) willR be denoted by

< · , · > and k · k respectively. We use also the notation, for i ∈ N0 , mi = IR xi a(x)λ(dx).

Recall that {ei }i∈N0 is the result of a Gram-Schmidt orthonormalization procedure

applied to the sequence of polynomials {1, ( · ), ( · )2 , ...}. We start the Gram-Schmidt

procedure by setting

e0 ( · ) = 1 ,

and observing that

ke0 k = 1 .

Taking e1 ( · ) = ( · ) we have

Z

< e0 , e1 >= xa(x)λ(dx) = 0

IR

and

Z

2

ke1 k = x2 a(x)λ(dx) = 1 . (4.54)

IR

We calculate now e2 . The Gram-Schmidt orthogonalization procedure (see Luenberg,

1969, page 55) gives the following polynomial of degree 2 orthogonal to e0 and e1 :

p2 ( · ) = ( · )2 − < e0 , ( · )2 > e0 ( · )− < e1 , ( · )2 > e1 ( · )

= ( · )2 − m3 ( · ) − 1 .

Note that the polynomials {1, ( · ), ( · )2 } are linearly independent in L2 (a) and p2 is a

linear combination of these polynomials with non-vanishing leading coefficient. Hence

p2 6= 0 and consequently kp2 k =

6 0. We denote kp2 k by ∆2 . We stress that ∆2 > 0, which

is a known result (see Kendall and Stuart, 1952) We have

Z

∆22 = (x2 − m3 x − 1)2 a(x)λ(dx) = m4 − m23 − 1 .

IR

We define

1

e2 ( · ) = {( · )2 − m3 ( · ) − 1} .

∆2

4.6. APPENDICES 109

lowing polynomial of degree 3 orthogonal to e0 , e1 and e2

( )

3 m5 − m3 m4 − m3 2 m3 (m5 − m3 m4 − m3 )

=( · ) − ( · ) − m4 − (·)

∆22 ∆22

( )

m5 − m3 m4 − m3

− m3 − .

∆22

of L2 (a) (in fact e0 , e1 , e2 and ( · )3 ) with a non-vanishing leading coefficient. Then p3 ( · ) 6=

0 and consequently ∆3 = kp3 k > 0 (this generalizes the result of Kendall and Stuart (1952)

concerning ∆2 ). We have then

1

e3 ( · ) = p3 ( · )

∆3

" ( )

1 3 −m3 m4 − m3 2 m3 (m5 − m3 m4 − m3 )

= (·) − ( · ) − m4 − (·)

∆3 ∆22 ∆22

( )#

m5 − m3 m4 − m3

− m3 − .

∆22

110 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Chapter 5

Restrictions

5.1 Introduction

This chapter treats a class of semiparametric models defined via restrictions imposed on

the moments of some square integrable functions. The class of models studied include

semiparametric extensions of many important models such as the multivariate location

and shape models, covariance selection models, growth curve models with modeled vari-

ance, generalized linear models, factor analysis, multivariate structural models, linear

structural relationships models, among others. In fact we present a tool to produce

semiparametric extensions of parametric models for which the moment structure plays a

structural rule.

The class of semiparametric models presented possesses a simple mathematical struc-

ture which makes it attractive to be used for testing general inference procedures. We

will be able to calculate explicitly the nuisance tangent spaces, which are the orthogonal

complements of the spaces spanned by the square integrable functions used to introduce

the restrictions defining the model. It is interesting to observe that the different notions

of nuisance tangent spaces defined in chapter 2 coincide for the models considered. More-

over, the orthogonal complement of the nuisance tangent spaces does not depend on the

nuisance parameter. This simplifies the treatment of some classic techniques for inference

in semiparametric models. For instance, any regular estimating function will be a lin-

ear combination of the restriction functions used to define the model corrected by their

means. In fact it will be shown that there is only one possible root for a regular estimating

function. That root is obtained in the form of a moment estimator. The efficient score

function will be a linear combination of the mean corrected restriction functions, however

with the coefficients in general depending on the nuisance parameter. The semiparametric

111

112 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Cramèr-Rao bound for the asymptotic variance of regular asymptotic linear estimating

sequences will be obtained from the quadratic norm of the efficient score function. We

will give a necessary and sufficient condition for attaining this bound with estimating

sequences based on regular estimating functions. There will be presented examples where

the bound is attained and examples where it is not attained by roots of regular estimating

functions.

The chapter is organized in the following way. Section 5.2 introduces the main class

of semiparametric models we deal with and some examples are given. The estimation

via regular asymptotic linear estimating sequences and regular estimating functions is

considered in section 5.4. Most of the technical proofs are given in the section 5.5.

5.2. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS 113

We define next the class of semiparametric models studied in this chapter. Suppose that

there are k, m ∈ N ∪ {0}, l ∈ N ∪ {0, ∞}, the functions f1 , . . . , fk : X −→ IR and the

functions g1 , . . . , gm : X × Z −→ IR such that for each θ0 ∈ Θ the submodel Pθ∗0 is the

class of functions p : X −→ IR+ for which the conditions (5.1)-(5.6) hold. The conditions

referred are, for j = 1, . . . , k and i = 1, . . . , m:

∀x ∈ X , p(x) > 0; (5.1)

p is of class C l , (5.2)

where for r ∈ N , C r is the class of functions r times continuously differentiable C 0 is the

class of continuous functions and C ∞ is the class of functions infinitely many differentiable;

Z

p(x)λ(dx) = 1; (5.3)

X

Z

fj2 (x)p(x)λ(dx) < ∞; (5.4)

X

Z

fj (x)p(x)λ(dx) = Mj (θ0 ); (5.5)

X

Z

for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (5.6)

X

The conditions (5.1) and (5.3) ensure that p is a probability density of a distribution

with support equal to the whole sample space X . Conditions (5.4)-(5.6) are used to restrict

each submodel Pθ0 (and consequently shrink the model P). These condition could be used

to express a partial a priori knowledge about the phenomena we study or to ensure some

desirable mathematical characteristics of the model, such as identifiability and regularity

of the partial score functions. For instance, the conditions (5.6) can be used to ensure

that the partial score functions are in L2 . Condition (5.2) can be assumed to hold apart

from a λ- null set.

Note that there are redundances in the model. In fact, condition (5.4) can be expressed

in terms of condition (5.6), however it is important to make it explicit that the functions

f1 , . . . , fk are in L2 (p). Furthermore, without loss of generality, we assume that the

functions f1 , . . . , fk are linearly independent (as elements of L2 ); because if f1 , . . . , fk are

linearly dependent, then condition (5.4) could be expressed with a smaller number of

functions.

We refer to the models of the form described above as L2 - restricted semiparametric

models or simply L2 - restricted models.

114 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.3 Examples

5.3.1 Location-scale models

In this example we consider a collection of location-scale models defined on the real line,

for which we fix the first k (for k ∈ N greater or equal 2) standardized cumulants of the

distributions. The models we define are larger than the semiparametric extensions of the

location-scale models considered in chapter 4. In particular, we avoid to use strong condi-

tions on the tails and on the Laplace transform of the distributions, but those conditions

can be incorporated in the model through condition (5.6), without essentially modifying

the results obtained. These models will serve as prototypes of L2 - restricted models.

The sample space here X is the real line, B is the Borel σ-algebra in IR and λ is

the Lebesgue measure defined on (IR, B). Consider the following family of probability

measures dominated by λ

( )

dPµσa 1 · −µ

P = Pµσa : (·) = a , µ ∈ IR, σ ∈ IR+ , a ∈ A , (5.7)

dλ σ σ

where A is the class of probability densities a : IR −→ IR+ such that the conditions

(5.8)-(5.16) are satisfied. The conditions referred are:

a ∈ C 1; (5.9)

Z

a(x)λ(dx) = 1; (5.10)

IR

Z

xa(x)λ(dx) = 0; (5.11)

IR

Z

x2 a(x)λ(dx) = 1; (5.12)

IR

Z

x2k a(x)λ(dx) < ∞; (5.13)

IR

Z

for j = 3, . . . , k, xj a(x)λ(dx) = mj . (5.14)

IR

5.3. EXAMPLES 115

k. We convent that condition (5.14) vanishes if k = 2. Furthermore, we assume that the

score function of the submodels defined by fixing the values of a ∈ A are in L2 , i.e.

)2

a0 (x)

Z (

a(x)λ(dx) ∈ (0, ∞) ; (5.15)

IR a(x)

)2

xa0 (x)

Z (

a(x)λ(dx) ∈ (0, ∞) . (5.16)

IR a(x)

The parameter of interest is θ := (µ, σ) ∈ Θ := IR × IR+ . The nuisance parameter is

a ∈ A.

The characterization given above is natural for the location and scale model, however it

is not in the form of a L2 - restricted semiparametric model. In fact, we define the submodel

∗

P(0,1) (= A) and then use the affine transformation (·) 7→ σ(·) + µ to characterize the

whole model P. On the other hand, in the definition of the L2 - restricted model we give

conditions satisfied by each submodel Pθ (for θ ∈ Θ).

We give next an alternative characterization of the location and scale model given by

(5.7), which is in the form required for the L2 - restricted models. For each (µ, σ) ∈ IR×IR+

consider the following class of probability densities on the real line,

∗

Pµσ = {pµ,σ : IR −→ IR+ : (5.18)-(5.24) hold } . (5.17)

The conditions referred in the definition above are, for each (µ, σ) ∈ IR × IR+ ,

∀x ∈ IR, pµ,σ (x) > 0; (5.18)

pµ,σ ∈ C 1 ; (5.19)

Z

pµ,σ (x)λ(dx) = 1; (5.20)

IR

j !

j

Z

j

σ j−i µi mj−i ,

X

for j = 1, . . . , k, x pµ,σ (x)λ(dx) = (5.21)

IR i=0

i

The terms in the right of (5.21) come from the binomial expansion of (σx + µ)j with the

ith power of x replaced by mi (i = 0, . . . , j). We assume further that

Z

x2k pµ,σ (x)λ(dx) < ∞; (5.22)

IR

116 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

( 0 )2

Z

p µ,σ (x)

pµ,σ (x)λ(dx) ∈ (0, ∞) ; (5.23)

IR pµ,σ (x)

)2

xp0µ,σ (x)

Z (

pµ,σ (x)λ(dx) ∈ (0, ∞) . (5.24)

IR pµ,σ (x)

It is easy to see that the model given by (5.17) coincides with the model given by (5.7).

Moreover, the model given by (5.17) is clearly a L2 - restricted model. Here the condition

(5.21) corresponds to the condition (5.12) in the definition of L2 - restricted model with

fj ( · ) = ( · )j and Mj (µ, σ) given by the right side of (5.21).

5.3. EXAMPLES 117

We define next a semiparametric version of a q- dimensional location and shape model

(q ∈ N ). Even though this model is a straightforward extension of the one- dimensional

location and scale model, it will serve us as a basis for the next example where we illustrate

alternative uses of the conditions (5.5) and (5.6).

Let V = Vq be the class of symmetric and nonsingular q × q matrices. Consider the

following class of probability densities on IRq , for each µ ∈ IRq and Σ ∈ V,

∗ q

Pµ Σ = {p : IR −→ IR+ such that (5.26) - (5.33) hold} . (5.25)

p ∈ C 1; (5.27)

Z

p(x)λ(dx) = 1; (5.28)

IRq

Z

xp(x)λ(dx) = µ; (5.29)

IRq

Z

x · xT p(x)λ(dx) = Σ + µµT ; (5.30)

IRq

Z

(xT · x)2 p(x)λ(dx) < ∞. (5.31)

IRq

Assume moreover that the components of the partial score function are in L2 , i.e.

Z n o2

for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.32)

IRq

Z n o2

for i, j = 1, . . . , q, i ≤ j, l/σij (x) p(x)λ(dx) ∈ (0, ∞) , (5.33)

IRq

where l/µi and l/σij are the components corresponding to the partial derivative with respect

to the ith component of µ and the (i, j)th entry of Σ, respectively, in the score function

for the location and shape generated by the distribution associated with the density p.

Note that by (5.26) and (5.28) p is a probability density on IRq with support equal to the

118 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

whole IRq . Moreover, (5.27) implies that the partial score function referred in (5.32) and

(5.33) are well defined.

The multivariate location and shape model is defined by

∗

.

A multivariate location model can be defined in a similar way, simply by dropping the

parameter Σ, eliminating the conditions (5.30) and (5.33) and replacing condition (5.31)

by

Z

xT xp(x)λ(dx) < ∞ .

IRq

5.3. EXAMPLES 119

The models considered in this example are constructed by imposing some additional

restrictions on the multivariate location and the multivariate location and shape semi-

parametric models defined previously. We will assume that some entries of the inverse of

the covariance matrix vanish. This corresponds to assuming that some pairs of variables

are conditionally not correlated given the others (see Whittaker, 1990).

We consider here two situations: first the case where the covariance matrix is to be

estimated (the covariance selection model on the semiparametric multivariate location

and shape model) and the case where the covariance matrix is not to be estimated and is

unknown (the covariance selection model on the semiparametric location case). The two

cases mentioned differ significantly: the first is a L2 - restricted model while the second is

not.

It is convenient to denote by σij and σ ij the (i, j)th entry of the matrices Σ and Σ−1 ,

respectively (for i, j ∈ {1, . . . , q}). Clearly each σ ij is a function of the entries of the

matrix Σ (we do not need to specify it now), say

where f ij is a function from IRq+(q−1)+...+1 into IR. Note that in the definition of f ij we

used only the triangle below the diagonal of the matrix Σ, in order to avoid ambiguity due

to the symmetry of Σ. In fact one should adopt a similar convention in the parametrization

of the location and shape model.

Let us consider a semiparametric multivariate location and shape model as defined by

(5.25), where for each (µ, Σ) ∈ IRq × V the submodel Pµ ∗

Σ is the class of functions

q

p : IR −→ IR+ such that (5.26)-(5.33) and the additional conditions (5.34) and (5.35)

given below hold.

The additional conditions referred are, for some pairs i, j ∈ {1, . . . , q}, i > j,

σ ij = 0 . (5.34)

σ kl 6= 0 . (5.35)

Specifying conditions like (5.34) and (5.35) we can define whether each pair of variables

(entries of the random vector of observations) is conditionally correlated or not given the

other variables. This corresponds to specify a covariance selection model.

120 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

constant function equal to 0 and M (µ, Σ) = f ij (σ11 , . . . , σqq ). Since the constant functions

are in L2 , the associated condition (5.4) holds also. By the same argument the condition

(5.35) can be expressed as a condition of the type of condition (5.6). We conclude that

the covariance selection model described above is a L2 - restricted model, even though in

a rather artificial form.

We consider now the case where the correlation matrix is not part of the parameter space

and is unknown. In this case we cannot use (5.34) or (5.35) to state that a certain entry

of the inverse of the covariance matrix vanish or not, because σ ij is not a function of the

fixed parameter of interest as before. We have to introduce instead a condition in terms

of the mixed second moments as done below.

Consider for each µ ∈ IRq a class of probability densities on IRq given by

The referred conditions are, for some given pairs (i, j) and (k, l), with i, j, k, l ∈ {1, . . . , q},

i > j and k > l,

p ∈ C 1; (5.37)

Z

p(x)λ(dx) = 1; (5.38)

IRq

Z

xp(x)λ(dx) = µ; (5.39)

IRq

Z

|x|2 p(x)λ(dx) < ∞; (5.40)

IRq

Z n o2

for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.41)

IRq

Z Z

ij

f x1 x1 p(x)λ(dx), . . . , xq xq p(x)λ(dx) = 0 ; (5.42)

IRq IRq

5.3. EXAMPLES 121

Z Z

f kl q

x1 x1 p(x)λ(dx), . . . , q

xq xq p(x)λ(dx) 6= 0 , (5.43)

IR IR

where for x ∈ IRq , |x|2 = x21 + . . . + x2q and x1 , . . . , xq are the components of the vector

x. The model in study is given by P ∗ ∪µ∈IRq Pµ∗ .

Note that the conditions (5.42) and (5.43) cannot be expressed in terms of the con-

ditions (5.1)-(5.6) and hence the model presented is not a L2 - restricted model. We will

study latter this kind of models.

We close this example by considering the case where the dimension is q = 3 in details,

which is relatively simple. Let us consider the case where the pair formed by the first and

the second variables is conditionally uncorrelated and the other pairs are conditionally

correlated given the other variables. The conditions in terms of the unknown entries of

the inverse of the covariance matrix are

σ 12 = 0 , σ 13 6= 0 and σ 23 6= 0 .

or equivalently,

σ13 σ23 − σ12 σ33 = 0, σ12 σ23 − σ13 σ22 6= 0 and σ13 σ12 − σ11 σ23 6= 0 .

R R

x1 x3 p(x)λ(dx) IRR3 x2 x3 p(x)λ(dx)

IRR3

− IR3 x1 x2 p(x)λ(dx) IR3 x3 x3 p(x)λ(dx) = 0

R R

x1 x2 p(x)λ(dx) IRR3 x2 x3 p(x)λ(dx)

IRR3

− IR3 x1 x3 p(x)λ(dx) IR3 x2 x2 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞)

R R

x1 x3 p(x)λ(dx) IRR3 x1 x2 p(x)λ(dx)

IRR3

− IR3 x1 x1 p(x)λ(dx) IR3 x2 x3 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞) .

122 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

We consider in this section a very rich class of multivariate models called the Linear

structural relationships models (LISREL). This class contains some important models,

as for example the multivariate linear regression models, the factor analysis models and

the path analysis models.

The scenario of LISREL models is the following. Suppose that we observe a p- di-

mensional random vector Y and a q- dimensional random vector X. The vectors Y and

X are linearly related to two (non observable) latent variables η and ξ, with dimensions

m × 1 and n × 1 respectively, through the following equations,

Y = Λy η +

X = Λx ξ + δ .

Here Λy and Λx are p × m and q × n matrices of parameters respectively; and δ are

p × 1 and q × 1 random vectors respectively with

cov() = Θ ; cov(δ) = Θδ ; cov(ξ) = Φ ;

E(ξ) = 0 ; E(η) = 0 .

We assume that Θ , Θδ and Φ are positive definite matrices given. Here E( · ) and cov( · )

are the expectation and covariance operators respectively. The latent variables are related

through the following equation

η = Γξ + ζ ,

where Γ is a m × n matrix of parameters and ζ is a m × 1 random vector with

E(ζ) = 0; cov(ζ) = Ψ .

Here Ψ is a given matrix positive definite matrix. We assume additionally that ζ, and

δ are mutually uncorrelated; and η are uncorrelated and δ is uncorrelated with ξ. It

is assumed that m, n, p and q, are such that the number of parameters is less or equal

than the number of entries of the inferior triangle of the joint covariance matrix of Y and

X, i.e. (p + q)(p + q + 1)/2. Some of the entries of Θ , Θδ , Ψ or Φ can be sometimes

considered as parameters.

The class of possible joint distributions of Y and X with the properties mentioned

above is said to be a Linear structural relationships model (LISREL) (see Johnson and

Wichern, 1992). Sometimes in the literature maximum likelihood estimation is performed

based on the assumption of multivariate normal distributions for the random quantities.

However, in general, it is not specified any particular distribution for the random quanti-

ties referred and moment estimation via the covariance matrix is performed. When using

5.3. EXAMPLES 123

LISREL models one is interested in modeling the covariance structure of a group of vari-

ables (for example in factor analysis) and an alternative definition of a LISREL model

is given by taking the class of distributions with a special covariance structure. The

following covariance structure can be easily obtained from the model description given

above:

cov(Y ) = Λy ΓΦΓT + Ψ + Θ , (5.44)

and

We show that the LISREL models are examples of L2 - restricted model. The conditions

(5.44)-(5.46) can be expressed in terms of conditions of the type of (5.5) with integrand

given by, for all x = (x1 , . . . , xp+q ) ∈ IRp+q and for i, j = 1, . . . , p + q, i > j,

fij (x) = xi xj ,

(here we index the functions fi ’s given in (5.5) by a double index) and the left side of the

equation (i.e. Mij (θ)) determined by the equations (5.44)-(5.46).

If the matrices Θ , Θδ , Ψ or Φ are viewed as parameters (or some entries as parameters

and some entries known) the model is still a L2 - restricted model. In this case the right

side of the equations of type (5.5) becomes more complicated. However, if some entries

of Θ , Θδ , Ψ or Φ are nuisance parameters (i.e. are not interest parameters and are

unknown), then the model in general is no longer a L2 - restricted model. An extension of

the L2 restricted models given in the next chapter will cover these cases.

Example 8 This example is extracted from Johnson and Wichern (1992, page 445). Sup-

pose that we want to model firm performance and managerial talent in a certain economic

system. Since these two quantities cannot be measured directly, one might represent them,

by two latent variables, say η and ξ and use a set of related observable variables in a

LISREL model. In this example the firm performance is characterized by the profit, Y1 ,

and the common stock price, Y2 . The managerial talent is represented by the time of

chief executive experience, X1 and memberships of boards of directors, X2 . A LISREL

model with the specifications given below is pointed by Johnson and Wichern (1992) as an

alternative for modeling this situation. Take m = n = 1, p = q = 2,

" # " # " # " #

1 λ2 θ1 0 θ3 0

Λy = , Λx = , Θ = , Θδ = ,

λ1 1 0 θ2 0 θ4

124 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

var(ξ) = Φ and var(ζ) = Ψ. Denoting the covariance of the ith and the jth entries of

the joint vector (Y T |X T )T by κij we obtain from the covariance structure of a LISREL

model the following equations:

κ22 = λ21 (γ 2 φ + ψ) + θ2 , κ13 = λ2 γφ,

κ14 = γφ, κ23 = λ1 λ2 γφ,

κ24 = λ1 γφ, κ33 = φλ22 + θ3 ,

κ34 = φλ2 , κ44 = φ + θ4 .

The parameters were estimated by replacing the second order moments in the equations

above by the sample moments and solving for the parameters. We will show that this

intuitively appealing procedure can be justified by the theory of semiparametric models.

5.3. EXAMPLES 125

Consider a growth curve data such that for a given subject, out of several independent

subjects, we observe a data vector X = (X0 , X1 , . . . , Xn )T representing observations taken

at times t0 = 0 < t1 < . . . < tn , respectively. We will assume that X0 and the increments

∆Xi ≡ Xi − Xi−1 , for i = 1, . . . , n, are distributed according to distributions all contained

in a family of distributions F parametrized in an identifiable form by the mean and the

variance, with F (µ, σ 2 ) representing the element of F with mean µ and variance σ 2 . The

model states that, and for i = 1, . . . , n,

and for i = 1, . . . , n,

where ∆ti ≡ ti − ti−1 , V1 is a suitable function fixed and α, β, η and λ are parameters.

Moreover, suppose that X0 , ∆X1 , . . . , ∆Xn are uncorrelated. It is easy to see that we

have the following moment structure,

and

" #

ηV1 (α/η) ηV1 (α/η)eT

Cov(X) = (5.50)

ηV1 (α/η)e ηV1 (α/η)E + λV1 (β/λ)T

where e = (1, . . . , 1)T , E = [Eij ] and T = [Tij ] are n × n matrix with Eij = 1 for all

(i, j) and Tij = Tji = ti for j ≥ i. Hence the model is adequate to study linear growth

with data with non constant variance over time. When the family F is an exponential

dispersion model with variance function V1 and parametrized in a suitable way (such that

X satisfies (5.49) and (5.50)) the model described above coincides with the latent growth

process studied in Jørgensen, Labouriau and Lundbye-Christensen (1996). Consider now

the class F of all families of distributions in IR parametrized by the mean and the variance

in an identifiable form, for which all the members of the family posses finite moments of

fourth order. For each F ∈ F we can generate a linear growth curve model by using

(5.47) and (5.48). The union of all these models, obtained by taking each element of F

and applying the construction described above, provides a semiparametric extension of the

original parametric model described above. Adding suitable differentiability conditions

(satisfied by the exponential dispersion models, for example) one obtain a L2 - restricted

model.

126 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Let us consider now a more complicated structure. Suppose that X is not directly

observable. We can observe only a vector Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,

mean and the variance, with mean µ and variance σ 2 ; V2 is a suitable function given. We

can think on Y as a noisy observation of the latent linear growth process X. We have,

for i = 0, 1, . . . , n,

and

Hence we have again a model suitable for linear growth data with variance varying with

time. If the families F and G are exponential dispersion models with variance functions

V1 and V2 respectively suitably parametrized, then the model defined by (5.47), (5.48)

and (5.51) coincides with a linear growth curve considered in Jørgensen, Labouriau and

Lundbye-Christensen (1996). The covariance matrix of Y is

Here D is a diagonal matrix with ith diagonal entry α+βti , E and T are defined as before.

The component of the covariance matrix proportional to E reflects the between subjects

variation at time t0 , the component proportional to T arises because the X- process has

stationary and independent increments and the third component proportional to D can

be interpreted as a measure of the noisy. If the family G is allowed to vary freely in a

class of distribution families, as we did before with the semiparametric model for X, one

obtains a L2 - restricted model. The restrictions corresponding to (5.5) are obtained from

equations (5.52) and (5.53).

If the matrix D is the identity and the functions V1 and V2 are constant functions, then

the model described above becomes a generalization of the univariate Brownian growth

model of Lundbye-Christensen (1991).

Linear growth in biology occurs only in a few cases. Let us generalize the model

described above to a situation where the growth does not follow a linear pattern on time.

Suppose that there is a unobserved linear latent process X defined as in (5.47) and (5.48).

We observe a growth process Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,

5.3. EXAMPLES 127

where b is a sufficiently smooth one-to-one increasing real function given (the growth

curve). We have, for i = 0, 1, . . . , n,

and Cov(Y ) is given by (5.53) with D diagonal with ith diagonal given by b(α + βti ).

Applying the same construction used before, we obtain a L2 - restricted model.

128 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

ing sequences and regular estimating functions

In this section we study two classic techniques for obtaining estimators of the interest

parameter under the semiparametric L2 restriction models, namely: regular linear asymp-

totic estimating sequences and regular estimating functions. To pursue this task the first

step is to calculate the nuisance tangent spaces.

It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote the

L20 - linear space spanned by the projection of the functions f1 , . . . , fk onto L20 (Pθz ) by

Hk (θ, z). More precisely,

f1 ( · ) −

R

X

Hk (θ) = span

fk (x) dPdλθz (x)λ(dx)

fk ( · ) −

R

X

The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z). Note that Hk

depends only on θ, but Hk⊥ depends on both parameters, because we took the orthogonal

complement in L20 (Pθz ).

We denote the Lp nuisance tangent space (for 1 ≤ p ≤ ∞) and the weak (or Hellinger)

p w

nuisance tangent space at (θ, z) ∈ Θ × Z by T N (θ, z) and T N (θ, z) respectively (see

Labouriau, 1996a for details).

Theorem 13 Under the L2 - restricted semiparametric model the nuisance tangent spaces

are given, for θ ∈ Θ and z ∈ Z, by

1 w 2

⊥

T N (θ, z) =T N (θ, z) =T N (θ, z) = Hk (θ, z) .

The proof of theorem 13 is given in two lemmas in appendix 5.5.1. First it is proved

1

that T N ⊆ Hk⊥ (lemma 9) by using an argument based on the existence of almost surely

convergent subsequences of each L1 convergent sequence. The second step is to prove that

2

Hk⊥ ⊆T N (lemma 10), which is done by directly verifying that the continuous compact

supported elements of Hk⊥ are in the L2 nuisance tangent set. The inclusion follows from

the fact that the class of continuous compact supported functions is dense in L2 , provided

the sample space is locally compact Hausdorff.

5.4. EFFICIENT ESTIMATION 129

A careful analysis of the argument presented shows that in fact the theorem 13 holds

for any Lp nuisance tangent space (for 1 ≤ p ≤ ∞) or even for weaker notions of nuisance

tangent space.

Note that the nuisance tangent space is determined only by the functions involved

in the condition (5.5). The restrictions associated with (5.6) produce no effects on the

nuisance tangent space. Note that condition (5.5) can be improved by replacing the

property of the integral being equal to a constant by the property of belonging to a real

set that pocesses only isolated points.

The condition (5.4) associated with (5.5) is indeed necessary, otherwise the space Hk

would not be in L2 , which would produce irregularities in the nuisance tangent spaces.

That is we should use square integrable functions to restrict the model indeed.

We calculate now the so called efficient score function of a well behaved L2 - restricted

model. It is pressuposed that the parametric partial score function is well defined and

regular, in the sense given below. Consider the function l/θ : X × θZ −→ IRq with

components l/θ1 , . . . , l/θq : X × θZ −→ IR given by, for each θ0 ∈ Θ, z0 ∈ Z and i ∈

{1, . . . , q}

∂ Pθo z0

l/θi ( · , θ0 , z0 ) = i log ( · ) .

∂θ dλ θ=θ0

It is assumed that for each (θ0 , z0 ) ∈ Θ × Z and each i ∈ {1, . . . , q}, l/θi ( · , θ0 , z0 ) is well

defined and is in L2 (Pθ0 z0 ).

The efficient score function is obtained by orthogonal projecting the parametric partial

score function on the orthogonal complement of the nuisance tangent space. Note that

different versions of the efficient score function are obtained by using alternative definitions

of the nuisance tangent space. However, from theorem 13 the different notions of tangent

space coincide for the models we work with. Hence we can just reffer generically to the

efficient score function.

Consider the function l/θ E

: X × Θ × Z −→ IRq with components l/θ E E

1 , . . . , l/θ q : X ×

Y

E

l/θ i ( · , θ0 , z0 ) = l/θi ( · , θ0 , z0 )|TN⊥ (θ0 , z0 ) ,

where ( · |A) is the orthogonal projection operator onto A ⊆ L20 (Pθo z0 ) and TN⊥ (θ0 , z0 )

Q

is the orthogonal complement of the nuisance tangent space at (θ0 , z0 ) in L20 (Pθo z0 ). The

E

function l/θ is the efficient score function.

130 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

We calculate now the efficient score function at (θ, z) ∈ Θ × Z. Denote the result of a

Gram-Schmidt orthonormalization process, in the space L2 (Pθz ), applied to the functions

1, f1 , . . . , fk by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form an orthonormal basis on Hk =

TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for i = 1, . . . , q,

k

E

hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,

X

l/θ i ( · , θ, z) = (5.56)

j=1

where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be

written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,

hξ1 , l/θ1 iθz . . . hξ1 , l/θq iθz

. .

A(θ, z) =

. .

. (5.58)

. .

hξk , l/θ1 iθz . . . hξk , l/θq iθz

n o

Covθz lE ( · ; θ, z) = A(θ, z)Covθz {ξ θz ( · )}AT (θ, z) = A(θ, z) AT (θ, z) . (5.59)

The inverse of the covariance matrix above provides a lower bound for the asymptotic

variance of regular asymptotic linear estimating sequences. Here we use the partial order

of matrices (i.e. , A ≥ B iff A − B is positive semidefinite).

tions

In this section we characterize the class of regular estimating functions for the L2 restricted

models considered so far.

Let us recall the definition of regular estimating function. A function Ψ : X ×Θ −→ IRq

is said to be a regular estimating function when (i)-(v) given below hold. The properties

regular estimating functions reffered to are, for i = 1, . . . , q,

5.4. EFFICIENT ESTIMATION 131

(iii) For j = 1, . . . , q,

( )

∂ Z dPθz Z

∂ dPθz

ψi (x; θ) (x)λ(dx) = ψi (x; θ) (x) λ(dx) ;

∂θj X dλ ∂θj dλ

n o

(v) For all (θ, z) ∈ Θ × Z, Eθz Ψ( · ; θ)ΨT ( · ; θ) is a positive definite matrix.

Here ψ1 , . . . , ψq are the components of the function Ψ and Eθz (X) is the expectation of a

random vector X under Pθz .

Theorem 14 Under the semiparametric L2 - restricted model we have that any regular

estimating function Ψ : X × Θ −→ IRq can be expressed in the following form, for all

θ ∈ Θ and for i = 1, . . . , q,

k

X

ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} . (5.60)

j=1

and Labouriau (1996, chapter 4), for all θ ∈ Θ and i ∈ {1, . . . , q}

2

ψi ( · ; θ) ∈ ∩z∈Z TN⊥ (θ, z) . (5.61)

2

TN⊥ (θ, z) = span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)} .

2

Since, for θ fixed and for i = 1, . . . , k, Mi (θ) does not depend on z, TN⊥ (θ, z) in fact does

not depend on z and (15) becomes,

t

The representation (5.60) can be written in a matricial form in the following way,

132 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

where

α11 (θ) . . . α1k (θ) f1 ( · ) M1 (θ)

. . . .

α(θ) = . . , f( · ) = . and M (θ) = . .

. .

.

.

αq1 (θ) . . . αqk (θ) fk ( · ) Mk (θ)

(see chapter 3 or Jørgensen and Labouriau, 1996, chapter 4) is, for each θ ∈ Θ and z ∈ Z,

This can be easily seen by differentiating Ψ and observing that Eθz {f ( · ) − M (θ)} = 0.

The variability matrix of Ψ is given by, for each θ ∈ Θ and z ∈ Z,

JΨ (θ, z) := SΨT (θ, z)VΨ−1 (θ, z)SΨ (θ, z) = {∇M (θ)}T [Covθz {f ( · )}]−1 {∇M (θ)} .

The condition (iii) in the definition of regular estimating function says that, for all

(θ, z) ∈ Θ × Z, SΨ (θ, z) is non singular. We have then rank(SΨ (θ , z)) = q. Since (see

Mardia et al., 1979 page 464 formula A.4.2.d)

α(θ) and ∇M (θ) must be of full rank (i.e. have rank equal to q).

We study now the roots of a regular estimating function Ψ based on a repeated inde-

pendent sample x = (x1 , . . . , xn )T . The estimator of θ based on Ψ and x is given by θ̂

such that

n

X

Ψ(xl ; θ̂) = 0 , (5.65)

l=1

which is equivalent to

n

X n

X

0 = α(θ̂){f (xl ) − M (θ̂)} = α(θ̂) {f (xl ) − M (θ̂)} =

l=1 l=1

n

( )

1X

≡ α(θ̂) f (xl ) − M (θ̂) = α(θ̂){IP n f (x) − M (θ̂)} .

n l=1

5.4. EFFICIENT ESTIMATION 133

Here IP n is the empirical sample operator given by IP n g(x) = 1/n nl=1 g(xl ), for any

P

g : X −→ IRq . Since the rank of α(θ̂) is q, we can construct a q × q matrix γ(θ) which

has the columns equal to q linearly independent columns of α(θ̂). The system (5.65) is

equivalent to

0 = γ(θ̂){IP n f (x) − M (θ̂)} .

Since the rank of γ(θ) is q, the matrix γ(θ) is non singular and the system above has the

unique solution M (θ̂) = IP n f (x). Hence we proved the following.

Theorem 15 Under a L2 - restricted model, the only possible estimator for the interest

parameter θ based on a regular estimating function and a sample x = (x1 , . . . , xn )T is

θ̂n = θ̂n (x) such that

n

1X

M (θ̂n ) = IP n f (x) := f (xl ) . (5.66)

n l=1

In other words, under a L2 - restricted model, the moment estimator (5.66) is the unique

possible estimating sequence that can arise from the method of (regular) estimating se-

quences. This shows how poor is this class of estimators for the models in discussion.

In particular, any optimality theory for estimating sequences (such as maximizing the

Godambe information) is absolutely meaningless for L2 - restricted models.

We give next conditions for having consistency and asymptotic normality of the mo-

ment estimators arising from regular estimating functions.

Theorem 16 Consider a L2 - restricted model, a regular estimating function Ψ and the

estimator θ̂n (based on a sample of size n) given by the solution of (5.66) in theorem 15.

i) If the application θ 7→ M (θ) from Θ ⊆ IRq to IRq is invertible with continuous inverse,

then {θ̂n } is consistent;

ii) If in addition to the assumptions of the previous item, each Mj ( · ) and αij ( · ) are

twice differentiable, then

√ D

n(θ̂n − θ) −→ N 0, JΨ−1 (θ, z) .

Proof:

i) Let M −1 be the inverse of M , i.e. for all θ ∈ Θ, M −1 {M (θ)} = θ. The law of large

numbers gives

P

IP n f −→ Eθz (f ) = M (θ) .

134 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Since M −1 is continuous,

P

θ̂n = M −1 {IP n f } −→ M −1 {M (θ)} = θ .

ii) Take (θ, z) ∈ Θ×Z fixed. Note that by the assumptions, the entries of Eθz {∇∇Ψ( · ; θ)}

P

are in IR. Moreover, from the previous item, θ̂n −→ θ. Using the results 4.9 and 4.10 in

Jørgensen and Labouriau (1996, pages 10 and 171) we conclude the proof. u

t

functions

We study next the robustness of the estimators of the type (5.66), i.e. the estimators

derived from regular estimating functions. Since the right side of (5.66) is a mean its

Hampel inference function is given by

Assuming M inversible and differentiable and applying the chain rule for the Hampel

influence function (see Labouriau, 1989) we obtain that the Hampel influence function for

the estimator given by (5.66) is

Hence the Hampel influence function of an estimator derived from a regular estimating

function is bounded if and only if the function f is bounded. That is θ̂n is resistent to

gross errors,i.e. B- robust if and only if the function f is bounded. On the other hand

the Hampel influence function is continuous (on x) if and only if f is continuous, and

that is the condition for having bounded sensibility to lateral shifts, i.e. V - robustness.

via estimating functions

We treat next the question of whether the lower bound for the asymptotic covariance of

regular asymptotic linear estimating sequences is attained by estimators obtained from

regular estimating functions. First, note that for each (θ, z) ∈ Θ × Z we have

with the bound for the concentration of estimators derived from estimating functions.

5.4. EFFICIENT ESTIMATION 135

We show that the efficient score function is a generalized estimating function. Recall

that the efficient score function has representation

lE ( · ; θ, z) = A(θ, z)ξ θz ( · ) ,

where A(θ, z) is given by (5.58). On the other hand, f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)

generate the same space that the orthonormalized basis {ξ1θz ( · ), . . . , ξkθz ( · )}. Then there

exists a nonsingular matrix, say B(θ, z) such that

ξ θz ( · ) = B(θ, z){f ( · ) − M (θ)} .

Hence,

lE ( · ; θ, z) = A(θ, z)B(θ, z){f ( · ) − M (θ)} .

We conclude that lE ( · ; θ, z) is equivalent to {f ( · ) − M (θ)} provided A(θ, z) is of full

rank. An application of proposition 17 in chapter 3 shows that in fact the semiparametric

Cramèr-Rao bound is attained by the estimating function given in theorem 15 (i.e. given

by (5.66)) ( provided A(θ, z) is of full rank). We have proved then the following result.

Proposition 22 Suppose that for each (θ, z) ∈ Θ × Z the matrix A(θ, z) given by (5.58)

is of full rank. Then the semiparametric Cramèr-Rao bound is attained by the moment

estimator given by (5.62).

The following construction gives a more precise result about the attainability of the

semiparametric Cramèr-Rao bound for L2 - restricted models. The first step will be to

obtain an alternative representation of regular estimating functions that will permit us

to compare the Godambe information with the covariance of the efficient score function.

T

Take (θ, z) ∈ Θ × Z fixed. Define ξ θz ( · ) = ξ1θz ( · ), . . . , ξkθz ( · ) , where 1, ξ1θz , . . . , ξkθz is

the result of an orthonormalization process (with respect to the inner product of L2 (Pθz ))

applied to 1, f1 , . . . , fk . That is, the equations of type (5.5) became, for each (θ0 , z0 ) ∈

Θ × Z and for i = 1, . . . , k,

Z

dPθ0 z0

ξiθz (x) (x)λ(dx) = Miθz (θ0 ) .

X dλ

Note that by construction Miθz (θ) = 0. From theorem 14, any regular estimating function

Ψ evaluated at θ can be represented as,

n o

Ψ( · ; θ) = γ θz (θ) ξ θz ( · ) − M θz (θ) = γ θz (θ)ξ θz ( · ) ,

T

where M θz (θ0 ) = M1θz (θ0 ), . . . , Mkθz (θ0 ) . The sensibility at (θ, z) is given by

SΨ (θ, z) = −γ θz ∇M θz (θ) ,

136 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

the variability is

VΨ (θ, z) = γ θz (θ)Covθz ξ θz γ θz (θ)T

−1

JΨ (θ, z) = ∇M θz (θ)T Covθz ξ θz ∇M θz (θ) = ∇M θz (θ)T ∇M θz (θ) .

Comparing the Godambe information given above with the variance of the efficient score

function given by (5.59) and using theorem 16 yields the following result.

Theorem 17 Consider a L2 - restricted model for which the assumptions of theorem 16

(item ii)) hold. Then the bound for the asymptotic variance of regular asymptotic linear

estimating sequences is attained by a regular estimating function at (θ, z) ∈ Θ × Z if and

only if

∇M θz (θ)T = A(θ, z) , (5.67)

where A(θ, z) is the matrix defined in (5.58).

We will present one example where the semiparametric bound is not attained by estimators

derived from regular estimating functions. In fact, in the example the bound is not

attained even by regular asymptotic linear estimating sequences.

5.4.6 Examples

In this section the examples considered before are revisited.

Let us consider a one dimensional location and scale model, as in section 5.3.1, with the

first k standardized cumulants fixed. The nuisance tangent space at (µ, σ, a) ∈ IR×IR+ ×A

is given by

" ( k )#⊥

· −µ · −µ

TN (µ, σ, a) = Hk⊥ = span ,..., − mk .

σ σ

Any regular estimating function Ψ = (ψ1 , ψ2 )T is of the form, for i = 1, 2,

k

( j )

· −µ

αij (µ, σ)

X

Ψi ( · ; µ, σ) = − mj . (5.68)

i=1 σ

The following theorem gives a sufficient condition for characterizing a regular estimating

function.

5.4. EFFICIENT ESTIMATION 137

IR×IR+ −→ IR2 of the form (5.68) is a regular estimating function, provided the functions

αij (i = 1, 2 and j = 1, . . . , k) are differentiable and for all (µ, σ) ∈ IR × IR+ and

Xk k

X

jα1j (µ, σ) mj−1 jα2j (µ, σ) mj

j=1 j=1

(5.69)

Xk k

X

− jα2j (µ, σ) mj−1 jα1j (µ, σ) mj 6= 0,

j=1 j=1

In the case of k = 2 the condition (5.69) in the theorem above becames equivalent to

say that the matrix α(µ, σ) is non singular. Taking the functions α11 (µ, σ) = α12 (µ, σ) =

α21 (µ, σ) = α22 (µ, σ) = 1, for all µ ∈ IR and σ ∈ IR+ , one obtains an estimating function

which is not regular (does not satisfy (iv) ), but has all its components in ∩a∈A TN⊥ (µ, σ, a).

This shows that we do need condition(5.69) for characterizing regular estimating functions.

We verify next whether the roots of regular estimating function attain the bound for

the asymptotic covariance of regular asymptotically linear estimating sequence in the case

of the location and scale model. Consider first the case where k = 2. Take a fixed value

(µ, σ, a) in the parameter space. Direct calculation yields the following representation for

the efficient score function

( 2 )

· −µ 1 · −µ

ξ1µσa ( · ) = and ξ2µσa ( · ) = − m3 (a) − 1 ,

σ ∆2 (a) σ

" 1 −2µ

hξ1µσa , l/µ iµσa hξ1µσa , l/σ iµσa

" # #

σ ∆2 (a)σ 2

A(µ, σ, a) = = .

hξ2µσa , l/µ iµσa hξ2µσa , l/σ iµσa 0 2

∆2 (a)σ

Note that m3 (a) and ∆2 (a) depend on the fixed density a. For each (µ0 , σ0 ) ∈ IR × IR+ ,

µ0 − µ

M1µσa (µ0 , σ0 ) := Eµ0 σ0 a {ξ1µσa ( · )} =

σ

138 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

and

σ0 − 2µµ0 − µ2

( )

1

M2µσa (µ0 , σ0 ) := µσa

Eµ0 σ0 a {ξ2 ( · )} = − m3 (a) − 1 .

∆2 (a) σ2

Hence

" 1 #

µσa 0

∇M (µ0 , σ0 )|(µ0 ,σ0 )=(µ,σ) = σ

−2µ 2 = AT (µ, σ, a) .

∆2 (a)σ 2 ∆2 (a)σ

We conclude from theorem 17 that the bound for regular asymptotically linear estimating

sequence is attained by estimators based on regular estimating functions, for the semi-

parametric location and scale model with two standardized cumulants fixed. It is easy

to see that the moment estimator associated with any regular estimating function in this

example is the sample mean and the sample variance.

An alternative argument is that the global tangent space is the whole L20 , which implies

that the efficiency of any regular asymptotic linear estimating sequence, in particular the

roots of regular estimating functions.

We study now the case where k = 3. The efficient score function has representation

(5.70) with ξ µσa ( · ) = (ξ1µσa ( · ), ξ2µσa ( · ), ξ3µσa ( · ))T , where ξ1µσa and ξ2µσa are as in the case

where k = 2 and

" 3 2

1 · −µ −m3 m4 − m3 · −µ

ξ3µσa ( · ) = −

∆3 σ ∆22 σ

( )

m3 (m5 − m3 m4 − m3 ) · −µ

− m4 −

∆22 σ

( )#

m5 − m3 m4 − m3

− m3 −

∆22

( 3 2 )

1 · −µ · −µ · −µ

=: + k2 + k1 + k0 .

∆3 σ σ σ

(see chapter 4 for a detailed calculation). Here ∆3 is a constant depending on the standard-

ized cumulants up to order 6 of the density a. Let us study the case where m3 = m5 = 0.

We have then

" 3 #

1 · −µ · −µ

ξ3µσa ( · ) = + m4 .

∆3 σ σ

Moreover,

3σ 2 + 1

hξ3µσa ( · ), l/µσa iµσa = .

∆3

5.4. EFFICIENT ESTIMATION 139

" 3 #

1 µ0 − µ µ0 − µ

= 3(µ0 − µ) + − m4 .

∆3 σ σ

and

∂ 1 m4

M3µσa (µ0 , σ) = 3− ,

∂µ0

µ0 =µ

∆3 σ

which is not equal to hξ3µσa ( · ), l/µσa iµσa . We conclude, from theorem 17 that the bound

for regular asymptotically linear estimating sequence is not attained by estimators based

on regular estimating function. That is, the moment estimator associated with regular

estimating functions does not generate an optimal regular asymptotic linear estimating

sequence. Note that this conclusion easily extends to the case where k > 3.

Let us consider initially the semiparametric multivariate location model introduced in

section 5.3.1.

f1 (x) = x1 , . . . , fq (x) = xq ,

and the following functions of the interest parameter on the right hand of (5.5)

M1 (µ) = µ1 , . . . , Mq (µ) = µq .

f 1 ( · ) − µ1 f q ( · ) − µ1

ξ1 ( · ) = , . . . , ξq ( · ) = .

kf1 kL2 (p) kfq kL2 (p)

The Fourier coefficient for the expansion of the efficient score function with respect to

this basis are

and

− xi p(x)λ(dx)

R

for i, j = 1, . . . , qhl/µi , ξi iL2 (p) = := ki 6= 0 .

kfq kL2 (p)

140 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

(k1 ( · − µ1 ), . . . , kq ( · − µq ))T .

where A(µ, p) is given by (5.58) and f and M are defined as before. Note that the (i, i)th

(for i = 1, . . . , q) entry of the matrix A(µ, p) is given by

µp 1

aii (µ, p) := hξi , l/µ1 iµp = .

kfi ( · ) − µi kL2 (p)

µp µ∗1 − µ1

M1 (µ∗ ) =

kfi ( · ) − µi kL2 (p)

and

∂ µp 1

M1 (µ∗ ) = .

∂µi µ∗ =µ kfi ( · ) − µi kL2 (p)

Hence the diagonal of the matrices A and ∇M are equal. It is easy to see that both A

and ∇M are diagonal matrices and hence they are equal. We conclude from theorem

17 that the bound for regular asymptotically linear estimating sequence is attained by

estimators based on regular estimating functions.

It is easy to see that for the multivariate location and shape model the ortonormal

basis of Hk is composed by the functions

f i ( · ) − µi fij ( · ) − σij

ξi ( · ) = , ξij ( · ) = ,

kfi − µi kL2 (p) kfij − σij kL2 (p)

for i, j = 1, . . . , q, i ≥ j. Here fij (x) = xi xj and σij is the (i, j)th entry of the covariance

matrix. Reasoning like in the case of the multivariate location model given above we

conclude that the semiparametric Cramèr-Rao is attained by estimating sequences derived

from regular estimating functions.

5.4. EFFICIENT ESTIMATION 141

The nuisance tangent spaces of the covariance selection model defined on the location and

shape model coincide with the nuisance tangent spaces of the multivariate location and

shape model. This is a immediate consequence of the fact that the additional restrictions

inserted in the location and shape model for obtaining the covariance selection model are of

the type of condition (5.5), but with the integrand in the left part (i.e. the corresponding

functions fi s) equal to zero.

We conclude that the inference for the covariance selection model based on regular

estimating functions (and on regular asymptotic linear estimating sequences) can be done

in two steps: first estimating the mean vector and the covariance matrix as in the location

and shape model and then inverting the estimated covariance matrix.

The restrictions of type (5.5) for LISREL models are obtained by (5.44)-(5.46), i.e. the

entries of the covariance matrix of (X T , Y T )T should be equal to some functions of the

interest parameters. Hence the orthogonal complements of the nuisance tangent spaces

are given by the space spanned by the functions fij : IRp+q −→ IR for i, j = 1, . . . , p + q,

i ≥ j, corrected by their means. The referred functions are defined by

fij (z) = zi zj ,

of any regular estimating function will be the solution of the equations (5.44)-(5.46) with

the covariances in the left side replaced by the corresponding sample covariances.

There are basically two estimation procedures with LISREL models descrived in the

literature (see Johnson and Wichern, 1992, and the references therein): maximum like-

lihood estimation under the assumption of multivariate normality and direct solution of

the equations (5.44)-(5.46) using the sample covariance matrix. The previous discussion

shows that the intuitive second estimation procedure can be justified by assuming the

semiparametric embebing of the model as we described and using regular asymptotic

linear estimating sequences or regular estimating functions.

142 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.5 Appendices

5.5.1 Technical proofs related with L2 - restricted models

Lemma 9 Under the L2 restricted semiparametric model, for each (θ, z) ∈ Θ × Z,

1

⊥

T N (θ, z) ⊆ Hk (θ, z) .

Proof: Take (θ, z) ∈ Θ × Z fixed and denote by p the density of Pθz . For simplicity

1

we drop p, θ and z from the notation along this proof. We prove that TN0 ⊆ Hk⊥ , which

implies the lemma.

10

Suppose by hypothesis of absurdum that there is a ν ∈T N (ν 6= 0) such that ν ∈ / Hk⊥ .

Since ν ∈ L20 , ν ∈ Hk . That is, ν is of the form ν( · ) = αT f ( · ), for some α ∈ IRk and

10

f ( · ) := ( f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ) )T . Since ν ∈T N , there exists a differentiable

path {pt } ⊂ Pθ∗ and {rt } such that

and

From (5.71),

pt ( · ) − p( · )

rt ( · ) = − αT f ( · ) .

tp( · )

Hence

p ( · ) − p( · )
pt ( · ) − p( · )

t T
T

krt kL1 = −α =
α f ( · ) −

f ( · )
(5.73)

tp( · ) tp( · )

1

L1 L

p ( · ) − p( · )

t

≥ kαT f ( · )kL1 −
−
= kαT f ( · )kL1 > 0 .

tp( · )
1

L

· )−p( · )

kg( · )kL1 1 and that − pt ( tp( ·)

1 = 0, which can easily be verified. Clearly, (5.73)

L

contradicts (5.72). The lemma follows by reductio ad absurdum. u

t

1

R R

For, kf − gkL1 = |f − g|pdλ ≥ |f | − |g|pdλ = kf kL1 − kgkL1 .

5.5. APPENDICES 143

Lemma 10 Under the semiparametric model with L2 restrictions, for each (θ, z) ∈ Θ×Z,

2

Hk⊥ (θ, z) ⊆T N (θ, z) . (5.74)

Proof: Take (θ, z) ∈ Θ × Z and ν ∈ Hk⊥ (θ, z) ∩ Cc fixed and denote by p the density

of Pθz . Here Cc is the class of continuous infinitely many differentiable functions from X

into IR with compact support. For simplicity we drop p, θ and z from the notation along

this proof. Define for each t ∈ IR+ the generalized sequence {pt }t∈IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough pt ∈ Pθ∗ , i.e. ν is in the L2 tangent set at (θ, z) and we

conclude that

n o 2

Hk⊥ (θ, z) ∩ Cc ⊆TN0 (θ, z) .

Since Cc is dense in L2 (Pθz ) (because X is a locally compact Hausdorff space, see Rudin,

1987), taking the closure in the expression above yields (5.74) and proves the lemma.

We verify next that for t small enough pt satisfies the conditions (5.1)-(5.6). Since

ν is continuous and compact supported, it is bounded and so is the restriction of the

continuous function p to the support of ν. Hence ν( · )p( · ) is bounded. Moreover, the

restriction of the strictly positive and continuous function p( · ) to the compact support

of ν is bounded away form zero. Hence for t small enough pt is strictly positive, i.e. (5.1)

holds.

Condition (5.2) holds for all t ∈ IR+ , because ν is continuous and infinitely many differ-

entiable.

The function pt (for each t ∈ IR+ ) integrates 1 because ν integrates 0 with respect to

p( · )λ(d · ).

The functions pt satisfy condition (5.4) because for each fj

Z Z Z

fj2 (x)pt (x)λ(dx) = fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)

X X X

Z Z

= fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)

X supp(ν)

Z Z

≤ fj2 (x)p(x)λ(dx) + tξ fj2 (x)p(x)λ(dx) ,

X X

where supp(ν) is the support of ν and ξ is a superior bounded for the bounded function

ν.

The condition (5.5) holds for pt because ν is orthogonal to each fj (j = 1, . . . , k).

144 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

The following argument shows that for t small enough pt satisfies condition (5.6). We

have, for i = 1, . . . , k,

Z Z Z

gi (x)pt (x)λ(dx) − gi (x)p(x)λ(dx) = t gi (x)ν(x)p(x)λ(dx)

X X X

Z

≤ tξ gi (x)p(x)λ(dx) ,

X

function ν. Hence for t small enough

the distance between X i g (x)p t (x)λ(dx) and g

X i (x)p(x)λ(dx) is arbitrarily small. Since

X gi (x)p(x)λ(dx) ∈ Bi (θ) and Bi (θ) is a real open set, X gi (x)pt (x)λ(dx) ∈ Bi (θ), for t

R R

small enough. u

t

5.5. APPENDICES 145

The following theorem corresponds to theorem 19.

IR×IR+ −→ IR2 of the form (5.68) is a regular estimating function, provided the functions

αij (i = 1, 2 and j = 1, . . . , k) are differentiable and for all (µ, σ) ∈ IR × IR+ and

Xk k

X

jα1j (µ, σ) mj−1 jα2j (µ, σ) mj

j=1 j=1

(5.75)

Xk k

X

− jα2j (µ, σ) mj−1 jα1j (µ, σ) mj 6= 0,

j=1 j=1

Proof:

Let Ψ : IR×IR×IR+ −→ IR2 be a function in the form (5.68), with the functions αij (for

i = 1, 2 and j = 1, . . . , k) being differentiable and satisfying (5.75) for all (µ, σ) ∈ IR×IR+ .

We show that in this case Ψ is a regular estimating function.

· −µ j

Condition (i) holds because, for j = 1, . . . , k, the functions σ

− mj are in

2

L0 (Pµσa ).

j

The differentiability of the functions αij and of · −µ

σ

− mj implies condition (ii).

We check the first part of condition (iii) by direct calculation (i.e. the part related

with differentiability with respect to µ). Take i = 1, 2 and (µ, σ) ∈ IR × IR+ fixed. Since

Ψ is unbiased,

∂ 1 x−µ

Z

ψi (x; µ, σ) a dλ(x) = 0 .

∂µ σ σ

∂ 1 x−µ

Z

ψi (x; µ, σ) a dλ(x) = 0 . (5.76)

∂µ σ σ

Note that

Z ( )

∂ 1 x−µ

ψi (x; µ, σ) a dλ(x)

∂µ σ σ

146 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

)

k

( j

∂ X x−µ 1 x−µ

Z

= αij (µ, σ) − mj a dλ(x)

∂µ j=1 σ σ σ

k

"Z ( ) ( j ) #

∂ x−µ 1 x−µ

X

= αij (µ, σ) − mj a dλ(x)

j=1 ∂µ σ σ σ

k

(Z j−1 )

−j x−µ 1 x−µ

X

+ αij (µ, σ) a dλ(x)

j=1 σ σ σ σ

k

X −j

= αij (µ, σ)mj−1 . (5.77)

j=1 σ

( )

∂ 1 x−µ

Z

ψi (x; µ, σ) a dλ(x)

∂µ σ σ

k Z

( j )

x−µ −1 0 x − µ

X

= αij (µ, σ) − mj a dλ(x)

j=1 σ σ2 σ

k

X αij (µ, σ)

= − mk−1 . (5.78)

j=1 σ

Inserting (5.77) and (5.78) into the left hand of (5.76) leads to the conclusion that the

equality (5.76) holds and the proof of the first part of (iii) is concluded.

We verify the second part of condition (iii). Since Ψ is unbiased,

∂ 1 x−µ

Z

ψi (x; µ, σ) a dλ(x) = 0 .

∂σ σ σ

Hence the proof of condition the second part of (iii) reduces to the verification of

∂ 1 x−µ

Z

ψi (x; µ, σ) a dλ(x) = 0 . (5.79)

∂σ σ σ

Proceeding in a similar way as in the proof of the first part of the property (iii), after

some routinary calculations, one obtains

Z ( k

)

∂ 1 x−µ −j

X

ψi (x; µ, σ) a dλ(x) = αij (µ, σ)mj . (5.80)

∂σ σ σ j=1 σ

k

( )

∂ 1 x−µ j

Z X

ψi (x; µ, σ) a dλ(x) = αij (µ, σ)mj . (5.81)

∂µ σ σ j=1 σ

5.5. APPENDICES 147

Using (5.80) and (5.81) in the left hand side of (5.79) we conclude that (5.79) holds.

We verify property (iv). From the previous calculations (see (5.77) and (5.80) ) we

have

" P #

k Pk

−1 jα1j (µ, σ)mj−1 jα1j (µ, σ)mj

Eµσa {∇Ψ( · ; µ, σ)} = Pj=1

k

j=1

Pk .

σ j=1 jα2j (µ, σ)mj−1 j=1 jα2j (µ, σ)mj

6 0 if and only if (5.75) holds.

We use lemma 7.1 in Lehmann (1983) 2 to prove property (v). Note that the functions

( k )

· −µ x−µ

,... − mk (5.82)

σ σ

are affinely independent in the sense of the referred lemma, i.e. there do not exist con-

stants a1 , . . . , ak and b such that

k

( j )

X x−µ

aj − mj =b

j=1 σ

For i = 1, 2 at least one among the numbers

do not vanish, otherwise the determinant in (5.75 vanishes. Moreover, the vectors

and

(α21 (µ, σ), . . . , α2k (µ, σ))

are not equal, otherwise the determinant given in (5.75) vanishes. This, together with

the affinely independency of the functions given in (5.82), implies that ψ1 ( · ; µ, σ) and

ψ2 ( · ; µ, σ) are affinely independent. Since ψ1 ( · ; µ, σ) and ψ2 ( · ; µ, σ) are in L2 (Pµσa ), we

conclude that the covariance matrix of ψ1 ( · ; µ , σ) and ψ2 ( · ; µ, σ) is positive definite. u t

2

Lemma 7.1 (Lehmann, 1983, page 124) - For any random variables X1 , . . . , Xn with finite

second moments, the covariance matrix ... is positive semidefinite; it is positive definite if andPonly if the

Xi are affinely independent, that is, there do not exist constants (a1 , . . . , ar ) and b such that ai Xi = b

with probability 1.

148 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Chapter 6

Models with L2 Restrictions

6.1 Introduction

We study in this chapter two extensions of the L2 - restricted models which we call:

extended L2 - restricted models and partial parametric models. The first is obtained by

imposing an additional condition involving a function of the expected values of a group

of functions to a L2 - restricted model. Examples of extended L2 - restricted models are

the covariance selection model defined on a pure location model and regression models

with link. The second class of models considered here is constructed by assuming that the

density is known (or known to lie in a parametric model) in a region of the sample space

and belongs to a semiparametric L2 - restricted model out of this region. A typical example

of partial knowledge model are the trimmed models derived from a location model.

As it will be apparent from the text, the mathematical structure of the models con-

sidered in this chapter becomes more complex than the case of L2 - restricted models.

Consider a model defined as a L2 - restricted model with the additional restrictions given

by (6.1) and (6.2) below. More precisely, suppose that each submodel Pθ∗ is the class

of functions p : X −→ IR+ such that (5.1)-(5.6) and the following additional conditions

hold. Assume that there are r, s ∈ N ∪ {0}, the functions b1 , . . . , br : X −→ IR in L2 (p),

h1 , . . . , hs : X × Z −→ IR, b : IRr −→ IR and a continuous h : IRs −→ IR such that

Z Z

b b1 (x)p(x)λ(dx), . . . , br (x)p(x)λ(dx) = G(θ) (6.1)

X X

149

150 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

and

Z Z

∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (6.2)

X X

The model described above is called an extended L2 - restricted model. Excluding the

trivial choices for g and h, the conditions (6.1) and (6.2) cannot be expressed in terms of

the conditions of the L2 - restricted models. Hence the model described above is indeed

an extension of the class of L2 - restricted models.

Example 9 (Covariance selection model on the location model) Let us consider

the model described in section 5.3.3 given by the multivariate location model with the

additional constraint

Z Z

f ij q

x1 x1 p(x)λ(dx), . . . , q

xq xq p(x)λ(dx) = 0 ; (6.3)

IR IR

Z Z

kl

f x1 x1 p(x)λ(dx), . . . , xq xq p(x)λ(dx) 6= 0 . (6.4)

IRq IRq

Clearly (6.3) is a condition of the type of (6.1) and (6.4) is of the type of (6.2). We

conclude that the covariance selection model defined on the location model is an extended

L2 - restricted model. We will work more with this example at the end of this section.

The calculation of the nuisance tangent spaces for an extended L2 - restricted model

is in general a difficult task. We give in the following a sequence of lemmas that will,

under regularity conditions, yield a characterization of the intersection (over the values

of the nuisance parameter) of the nuisance tangent spaces for each value of the interest

parameter. This will allow us to give a representation of the regular estimating functions

and to study the effect of the introduction of restrictions of the type of (6.1) and (6.2) on

the problem of estimation via estimating functions. The proof of the lemmas are given in

appendix 6.4.1.

It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote

Hk (θ, z) = Hk = span {f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk )}

and

(Hk + Br )(θ, z) = (Hk + Br )

( )

f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),

= span .

b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )

Moreover, Hk⊥ (θ, z) = Hk⊥ and (Hk + Br )⊥ (θ, z) = (Hk + Br )T denote the orthogonal

complement of Hk (θ, z) and (Hk + Br )(θ, z) in L20 (Pθz ), respectively. Note that Hk (θ, z)

in fact does not depend on z, and we write sometimes simply Hk (θ).

6.2. EXTENDED L2 - RESTRICTED MODELS 151

2

i) (Hk + Br )⊥ (θ, z) ⊆TN (θ, z) ;

2

ii) TN⊥ (θ, z) ⊆ (Hk + Br )(θ, z) ;

2

iii) ∩z∈Z TN⊥ (θ, z) ⊆ ∩z∈Z (Hk + Br )(θ, z) = Hk (θ), provided for each bi (i = 1, . . . , r)

there exist zi and wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ).

1

i) TN (θ, z) ⊆ Hk (θ);

1

ii) Hk (θ) ⊆TN⊥ (θ, z) ;

1

iii) Hk (θ) ⊆ ∩z∈Z TN⊥ (θ, z).

Theorem 20 Under an extended L2 - restricted model that satisfies the condition given

in lemma 11 iii),

i) For each θ ∈ Θ,

2 1

TN⊥ TN⊥

\ \

(θ, z) = (θ, z) = Hk (θ) .

z∈Z z∈Z

then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,

k

X

ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .

j=1

The theorem shows that the estimation via regular estimating functions is not affected

by the introduction of the constraints given by (6.1) and (6.2).

152 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

the three dimensional (q = 3) covariance selection model defined on the location model.

The additional conditions added to the location model are

x1 x3 p(x)λ(dx) · IR3 x2 x3 p(x)λ(dx)

R R

IR3

− IR3 x1 x2 p(x)λ(dx) · IR3 x3 x3 p(x)λ(dx) = 0 ,

R R

(6.5)

R R

IR3

− IR3 x1 x3 p(x)λ(dx) · IR3 x2 x2 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞)

R R

(6.6)

and

x1 x3 p(x)λ(dx) · IR3 x1 x2 p(x)λ(dx)

R R

IR3

− IR3 x1 x1 p(x)λ(dx) · IR3 x2 x3 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞) .

R R

(6.7)

Condition (6.5) corresponds to (6.1) with

b1 (x) = x1 x3 , b2 (x) = x2 x3 , b3 (x) = x1 x2 , b4 (x) = x3 x3

(where x = (x1 , x2 , x3 )T ) and

b(b1 , b2 , b3 , b4 ) = b1 b2 − b3 b4 .

Defining also b5 (x) = x2 x2 and b6 (x) = x1 x1 , we have that (6.5)-(6.7) are equivalent to

Ep (b1 )Ep (b2 ) − Ep (b3 )Ep (b4 ) = 0 ,

Ep (b2 )Ep (b3 ) − Ep (b1 )Ep (b5 ) 6= 0

and

Ep (b1 )Ep (b2 ) − Ep (b6 )Ep (b2 ) 6= 0 ,

respectively. Note that the condition of part iii) of lemma 11 is satisfied. We verify

condition iii) for b1 . Take p, q ∈ Pθ∗ such that

Ep (b1 ) = Ep (b2 ) = Ep (b2 ) = 1

and

1

Eq (b1 ) = Eq (b2 ) = Eq (b2 ) = .

2

Assume moreover that

1

Ep (b5 ) = Eq (b5 )Ep (b6 ) = Eq (b6 ) = .

4

6.2. EXTENDED L2 - RESTRICTED MODELS 153

A direct verification shows that conditions (6.5)-(6.7) hold for both p and q, however

Ep (bi ) 6= Eq (bi ), for i = 1, . . . , 4.

We conclude that theorem 20 holds for the covariance selection model described above

(it holds also for covariance selection models defined on a location model in general) and

hence the estimation of the mean via estimating functions is not affected by the constraints

defined via the inverse of the covariance matrix. This is in accordance with the literature

for covariance selection models.

There is a situation in which we can calculate the nuisance tangent space for an

extended L2 - restriction model.

A situation where theorem 21 applies is where s = 1 and g is a bijection, as for

example in the regression models with link and in the classic proportional odds models for

continuous variables used in genetics. Note that when theorem 21 applies, the theory for

estimation through regular estimating functions and regular asymptotic linear estimating

sequences developed in section 5.4 for L2 - restricted models apply in this case with some

minor obvious modifications.

154 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

We study in this section some semiparametric models defined by imposing a restriction of

a different nature than the previous restrictions considered in this thesis. The restriction

says essentially that the distribution is known (or known to be in a certain parametric

model) in a certain region of the sample space. A typical example is the trimmed model

that we study in example 11.

Consider the class of probability measures

P = {Pθz : θ ∈ Θ ⊆ IRq , z ∈ Z} = ∪θ∈Θ Pθ

on (X , A), as before. Suppose that for each θ ∈ Θ there is a function fθ : X −→ IR+ and

a measurable set Iθ such that the class of densities Pθ∗ is given by

Pθ∗ = {p : X −→ IR such that (5.1)-(5.6) and (6.8) hold} .

The additional condition in the definition above says that the density p is equal to fθ in

the set Iθ . More precisely, consider the condition

∀x ∈ Iθ , p(x) = fθ (x), λ − a.e. . (6.8)

The family P is termed a partial parametric model. Note that if (6.8) is suppressed from

the definition above, the model becomes a L2 - restricted model.

Example 11 (Location models with arbitrary tails) Suppose that X = IR, λ is the

Lebesgue measure on the real line, Θ = IR and for each θ ∈ IR

Pθ∗ = {p : X −→ IR such that (6.9)-(6.15) hold} .

Suppose furthermore that

∀x ∈ X , p(x) > 0; (6.9)

p is of class C 1 ; (6.10)

Z

p(x)λ(dx) = 1; (6.11)

X

Z

x2 p(x)λ(dx) < ∞; (6.12)

IR

Z

xp(x)λ(dx) = θ; (6.13)

IR

6.3. PARTIAL PARAMETRIC MODELS 155

)2

p0 (x)

Z (

p(x)λ(dx) < ∞; (6.14)

IR p(x)

and

where Iθ = [θ − ξ1 , θ + ξ2 ] (for

R 1

ξ > 0 and ξ2 > 0) and fθ ( · ) = f0 ( · − θ) for a given

probability density f0 ( · ) with IR xf0 (x)λ(dx) = 0.

The model described above is composed by distributions that coincide with the location

model generated by the density f0 , in the central intervalR Iθ and possesses free tails. If

we define ξ1 and ξ2 in such a way that −∞ f0 (x)λ(dx) = ξ∞

R ξ1

2

f0 (x)λ(dx) = α/2, for some

α ∈ (0, 1), then the model is called the α-trimmed model.

The following theorem characterizes the nuisance tangent spaces and their orthogonal

complements in L20 for the partial parametric models. It is convenient to introduce the

following notation, for each θ ∈ Θ and z ∈ Z,

Iθc = X \ Iθ ;

n o

Hk⊥ (θ, z) = Hk⊥ = ν ∈ L20 (Pθz ) : ν ∈ Hk⊥ (θ, z) and supp(ν) ⊆ Iθc ;

n o⊥

Hk (θ, z) = Hk = Hk⊥ (θ, z) .

Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions

that are almost surely equal and we adopt the convention that supp(ν) ⊆ A means that

ν( · )χAc ( · ) = 0, λ- almost everywhere.

and m ∈ [1, ∞],

m

i) T N (θ, z) = Hk⊥ (θ, z);

m⊥

ii) T N (θ, z) = Hk (θ, z). Moreover,

" #

{fi ( · )χIθc ( · ) − Ei (θ)} : i = 1, . . . , k}

Hk (θ, z) = span ,

∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

R

where Ei (θ) = Iθc

156 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

Since

Z

Pθz

Ei (θ) = (x)λ(dx)

fi (x)

Iθ c dλ

Z

Pθz

= Mi (θ) − fi (x) (x)λ(dx)

Iθ dλ

Z

= Mi (θ) − fi (x)fθ (x)λ(dx) ,

Iθ

q⊥

the function Ei (θ) does not depend on the nuisance parameter. Furthermore, T N (θ, z)

does not depend on the nuisance parameter z and we write sometimes simply Hk (θ).

and m ∈ [1, ∞],

m⊥

i) ∩z∈Z T N (θ, z) = Hk (θ);

ii) Any regular estimating function Ψ, with components ψ1 , . . . , ψq , can be written in the

form, for i = 1, . . . , q,

k

X

ψi ( · ) = ξi ( · ; θ) + αij (θ){fi ( · )χIθc ( · ) − Ei (θ)} , (6.16)

i=1

The regular estimating function Ψ in part ii) of the corollary above has matricial

representation

where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .

Theorem 23 Consider a partial parametric model. Suppose that the function E is dif-

ferentiable. Then we have for any regular estimating function with representation (6.17)

and for each θ ∈ Θ and z ∈ Z:

−1

JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (6.18)

6.3. PARTIAL PARAMETRIC MODELS 157

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating

functions by taking

We calculate next the efficient score function. Let l/θ ( · ; θ, z) = l/θ be the parametric

partial score evaluated at (θ, z). We assume that for each z ∈ Z, l/θ ( · ; · , z) is regular, in

the sense that the properties of regular estimating functions hold. Take (θ, z) fixed. Note

that, for i = 1, . . . , q,

∂ Z

Eθz l/θi χ Iθc = p(x; θ, z)λ(dx) (6.20)

∂θi Iθc

∂ Z

= p(x; θ, z)λ(dx) = 0 .

∂θi Iθc

The differentiation under the integral sign is allowed because it is assumed that the

parametric partial score is regular. Let ξ1θz , . . . , ξkθz be the result of an orthonormalization

process applied to f1 χIθc − E1 (θ), . . . , fk χIθc − Ek (θ). We have, for i = 1, . . . , k and j =

1, . . . , q,

Z

∂

hξiθz , l/θj iθz = ξ θz (x)

c i

{p(x; θ, z)}λ(dx)

Iθ ∂θj

(from (6.20) )

Z

∂

= {ξiθz (x)}p(x; θ, z)λ(dx) := ℵij (θ, z) .

Iθ ∂θ j

c

n

ljE ( · ; θ, z) = ℵij (θ, z)ξiθz ( · ) + χIθ ( · )l/θj ( · ; θ) .

X

i=1

where A(θ, z) is the matrix formed by the ℵij (θ, z)s and ξ θz ( · ) = (ξ1θz ( · ), . . . , ξqθz ( · ))T .

The covariance of the efficient score function is given by,

158 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

ated at (θ, z) that will allow us to determine whether the semiparametric Cramèr-Rao

bound is attained by regular estimating functions. We use the functions ξiθz ( · ) in-

stead of the functions {fi ( · )χIθc ( · ) − Ei }s. Since each ξiθz ( · ) is a linear combination

of f1 ( · )χIθc ( · ) − E1 , . . . , fk ( · )χIθc ( · ) − Ek , there exists a non singular matrix γ θz such

that

n o

f ( · )χIθc ( · ) − e(θ) = γ θz ξ θz ( · ) .

The matrix γ θz is the matrix of change of basis. Now define the functions f1θz , . . . , fkθz :

X −→ IR such that f θz ( · ) = (f1θz ( · ), . . . , fkθz ( · ))T and f ( · ) = γ θz f θz ( · ). Clearly, for

i = 1, . . . , k, ξiθz ( · ) = fiθz ( · )χIθc ( · ). For any θ0 ∈ Θ, z ∈ Z and i = 1, . . . , k, we have

Z

fiθz (x)p(x; θ0 , z)λ(dx) = Miθz (θ0 ) , (6.23)

X

that is, the integral above does not depend on the choice of the nuisance parameter z. The

condition (6.23) can be used to characterize alternatively the model instead of condition

(5.5). The nuisance tangent spaces can be characterized alternatively in the following

way, for each (θ0 , z0 ) ∈ Θ × Z,

{fiθz ( · )χIθc ( · ) − Eiθz (θ0 )} : i = 1, . . . , k}

" #

= span 0 ,

∪{f ∈ L20 (Pθ0 z0 ) : supp(f ) ⊆ Iθ0 }

where

Z

Pθ 0 z 0

Eiθz (θ0 ) = Miθz − fiθz (x) (x)λ(dx) .

Iθ0 dλ

for each θ0 ∈ Θ,

where E θz (θ0 ) = γ θz E(θ0 ). Using the theory already developed the Godambe information

of Ψ at (θ, z) is given by,

6.3. PARTIAL PARAMETRIC MODELS 159

Theorem 24 Consider a partial parametric model. Suppose that the function E is differ-

entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating

functions at (θ, z) ∈ Θ × Z if and only if,

Example 12 (Location models with arbitrary tails) Consider the location model with

arbitrary tails introduced in example 11. We have k = 1, f1 ( · ) = ( · ) and M1 (θ) = θ.

Suppose that ξ1 = ξ2 and that f0 ( · ) is symmetric about the origin. We have then,

Z θ+ξ1

E1 (θ) = θ − xf0 (x − θ)λ(dx) = θ(1 − Q) ,

θ−ξ1

R ξ1

where Q = −ξ1 f0 (x)λ(dx). Any regular estimating function Ψ is of the form

1

γ θpθ = .

k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

Hence,

(1 − Q)

∇E θz (θ)T = ∇E θz (θ)T {γ θz }T ) = .

k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

Z

∂

A(θ, pθ ) = j

{ξiθz (x)}p(x; θ, z)λ(dx)

c

Iθ ∂θ

(1 − Q)

= .

k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

regular estimating functions. An optimal estimating function is obtained by taking

f00 ( · − θ)

ξ( · , θ) = − χ(θ−ξ1 ,θ+ξ1 ) ( · ) .

f0 ( · − θ)

160 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

6.4 Appendices

6.4.1 Technical proofs related with extended L2 - restricted mo-

dels

Lemma 13 (lemma 11) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,

2

i) (Hk + Br )⊥ (θ, z) ⊆TN (θ, z) ;

2

ii) TN⊥ (θ, z) ⊆ (Hk + Br )(θ, z) ;

2

iii) ∩z∈Z TN⊥ (θ, z) ⊆ ∩z∈Z (Hk + Br )(θ, z) = Hk (θ), provided for each bi (i = 1, . . . , r)

there exist zi and wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ).

Proof:

i) Take (θ, z) ∈ Θ × Z fixed and ν ∈ (Hk + Br )⊥ (θ, z). Define for each t ∈ IR+ the

generalized sequence {pt }t∈IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough pt ∈ Pθ∗ . The functions pt (for t small) satisfy the

conditions (5.1)-(5.6) as verified in the proof of lemma 10. We check the conditions (6.1)

and (6.2). For i = 1, . . . , r,

Z Z Z

bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)

X X X

Z

= bi (x)p(x)λ(dx)

X

Z Z Z

hi (x)pt (x)λ(dx) − hi (x)p(x)λ(dx) = t

hi (x)ν(x)p(x)λ(dx) .

X X X

R R

Hence for t small enough X hi (x)pt (x)λ(dx) is arbitrarily close to X hi (x)p(x)λ(dx) and

(6.2) follows from the continuity of h.

ii) Straightforward from i).

iii) For each i ∈ {1, . . . , r} there are zi , wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ). This implies

that bi is not a constant function and that

span {bi ( · ) − Eθzi (bi )} ∩ span {bi ( · ) − Eθwi (bi )} = {0} . (6.25)

6.4. APPENDICES 161

Hence,

( )

f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),

= ∩z∈Z span

b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )

( )

f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ),

= ∩z∈Z span

b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )

∪ [∩z∈Z span {b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )}]

= Hk (θ)

u

t

Lemma 14 (lemma 12) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,

1

i) TN (θ, z) ⊆ Hk (θ);

1

ii) Hk (θ) ⊆TN⊥ (θ, z) ;

1

iii) Hk (θ) ⊆ ∩z∈Z TN⊥ (θ, z).

Proof:

i) The same argument of lemma 9 yields this part of the lemma.

ii) Straightforward from i).

iii) Follows immediately from taking the interception (over z ∈ Z) in both sides of ii)

and observing that Hk (θ) does not depend on z. u

t

Theorem 25 (theorem 20) Under an extended L2 - restricted model that satisfies the con-

dition given in lemma 11 iii),

162 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

i) For each θ ∈ Θ,

2 1

∩z∈Z TN⊥ (θ, z) = ∩z∈Z TN⊥ (θ, z) = Hk (θ) .

then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,

k

X

ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .

j=1

Proof:

2 1

i) Take (θ, z) ∈ Θ × Z fixed. Since T N (θ, z) ⊆T N (θ, z), taking intersections yields

2 1

∩z∈Z T N (θ, z) ⊆ ∩z∈Z T N (θ, z) . (6.26)

Combining (6.26) with lemma 13 item iii) and lemma 14 item iii) implies the first part

of theorem 25

ii) Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , q} fixed. The result follows from item i) and

2

because ψi ( · ; θ) ∈ ∩z∈Z T N (θ, z) (see chapter 3). u

t

Theorem 26 Under an extended L2 - restricted model if the function b is injective, then

TN (θ, z) = [span{f1 , . . . , fk , g1 , . . . , gs }]⊥ .

Proof:

10 L1

“⊆” Take ν ∈T N . Then, for each t ν = pttp−p − rt , where pt ∈ Pθ∗ and rt −→ 0. Using the

same argument given in the proof of lemma 9 we conclude that hfi , νi = 0. Moreover,

Z Z

b b1 (x)pt (x)λ(dx), . . . , bs (x)pt (x)λ(dx)

X X

Z Z

= b b1 (x)p(x)λ(dx), . . . , bs (x)p(x)λ(dx) .

X X

Z Z

bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) .

X X

6.4. APPENDICES 163

Hence, for i = 1, . . . , s,

1

Z Z Z

hbi , νi = bi (x)pt (x)λ(dx) − bi (x)p(x)λ(dx) − rt (x)p(x)λ(dx)

t Z X X X

= − rt (x)p(x)λ(dx) .

X

Using the same argument given in the proof of lemma 9 we conclude that hbi , νi = 0.

“⊇” Take ν ∈ [span{f1 , . . . , fk , g1 , . . . , gs }]⊥ ∩ Cc . Define, for each t ∈ IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough, pt ∈ Pθ∗ . Note that the argument given in the proof of

lemma 10 shows that the conditions (5.1)-(5.6) hold. To check condition (6.1), observe

that, for i = 1, . . . , s,

Z Z Z

bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)

X X X

Z

= bi (x)p(x)λ(dx) ,

X

hence (6.1) holds. Reasoning in a similar way we conclude that (6.2) holds too, which

concludes the proof. u

t

164 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

The following three lemmas prove theorem 22 in section 6.3.

Lemma 15 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z,

" #

{fi ( · )χIθc ( · ) − Ei (θ) : i = 1, . . . , k}

Hk (θ, z) = span ,

∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

R

where Ei (θ) = Iθc

Proof: Take (θ, z) ∈ Θ × Z fixed. We suppress θ and z from the notation in of this proof.

Define

" #

{fi ( · )χIθc ( · ) − Ei (θ) : i = 1, . . . , k}

Fk = .

∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

Hk⊥ , supp(ν) ⊆ Iθc and ν ∈ Hk⊥ . For i = 1, . . . , k,

hν, fi χIθc − Ei i = hν, fi χIθc i − hν, Ei i = hν, fi χIθc i

= (since supp(ν) ⊆ Iθc ) = hν, fi i

= (since ν ∈ Hk⊥ ) = 0 .

Moreover, for any f with supp(f ) ⊆ Iθ ,

hν, f i = 0 ,

because supp(ν) ⊆ Iθc . Hence Fk ⊆ {Hk⊥ }⊥ = Hk . Since Hk is a vector space, span(Fk ) ⊆

Hk .

We prove now that Hk (θ, z) ⊆ span(Fk ). The proof reduces to show that Fk⊥ ⊆ Hk⊥ .

To see that, note that Fk⊥ ⊆ Hk⊥ implies that span(Fk⊥ ) ⊆ Hk⊥ , which implies that 1

{span(Fk )}⊥ ⊆ Hk⊥ , which implies that Hk ⊆ span(Fk ).

Take ν ∈ Fk⊥ . We have 2

for any f such that supp(f ) ⊆ Iθ , hν, f iθz = 0 ; (6.27)

and

for i = 1, . . . , k, hν, fi χIθc − Ei i = hν, fi χIθc i = 0 . (6.28)

1

Note that [span(A)]⊥ ⊆ span(A⊥ ). For, {a ∈ [span(A)]⊥ } =⇒ {∀b ∈ span(A), a ⊥ b} =⇒ {∀b ∈

A, a ⊥ b} =⇒ {a ∈ A⊥ } =⇒ {a ∈ span(A⊥ )}.

2

Since (A ∪ B)⊥ ⊆ A⊥ ∩ B ⊥ , for any A and B.

6.4. APPENDICES 165

We conclude that Fk⊥ ⊆ Hk⊥ . We check whether supp(ν) ⊆ Iθc . Using (6.27), for any

g ∈ L20 (Pθz ), hν, gχIθc iθz = 0. In particular ν is orthogonal to any function in an orthogonal

basis of the space of functions in L20 (Pθz ) with support contained in Iθ . We conclude that

ν( · )χIθ ( · ) ≡ 0, and hence supp(ν) ⊆ Iθc . u

t

1

⊥

T n (θ, z) ⊆ Hk (θ, z) .

Proof: The lemma follows from an argument similar to the proof of lemma 9 in chapter

5. u

t

q

Hk⊥ (θ, z) ⊆T N (θ, z) .

q0

Proof: Take ν ∈ Hk⊥ (θ, z) ∩ Cc . We prove that ν ∈T N (θ, z). Define, for each t ∈ IR+ ,

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough, pt ∈ Pθ∗ , which proves the lemma. The conditions

(5.1)-(5.6) are satisfied by pt , for t small enough, as shown in the proof of lemma 10.

The additional condition (6.8) holds because supp(ν) ⊆ Iθc (by definition of Hk⊥ ). Hence

Hk⊥ ∩ Cc ⊆ TN0 and the lemma follows by taking the closure. u

t

Theorem 27 Consider a partial parametric model. Suppose that the function E is dif-

ferentiable. Then we have for any regular estimating function with representation (6.17)

and for each θ ∈ Θ and z ∈ Z:

−1

JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}Covθz (f χIθc ){∇E(θ)} . (6.29)

166 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating

functions by taking

Proof: The first part follows immediately from the definition of Godambe information

and the representation (6.17).

We prove the second part. Note that the second term of the right hand of the expres-

sion (6.29) does not depend on the choice of Ψ, i.e. it depends only on the functions E

and f that are characteristics of the model and not of the estimating function. Hence, to

maximize the Godambe information of a regular estimating function it is enough to max-

imize Jξ , i.e. we can concentrate on the interval Iθ where the model is parametric. The

rest of the proof follows from the classic theory of estimating functions for semiparametric

models, for example, theorem 4.9 in Jørgensen and Labouriau (1996, pages 150-152). The

details are given below.

We prove that Jξ is maximized by

Define

T T

ξ = ξ u + ξ a = ξ1u , . . . , ξqu + ξ1a , . . . , ξqa ,

We claim that Sξa (θ, z) = 0. The claim follows from the following. For i, j = 1, . . . , q,

∂ Z a

0 = ξ (x)p(x; θ, z)λ(dx)

∂θi Iθ j

( differentiating under the integral sign)

Z

∂

= {ξja (x)p(x; θ, z)}λ(dx)

Iθ ∂θ i

Z ( )

∂ a

Z

= i

{ξ j (x) p(x; θ, z)}λ(dx) + ξja (x)Ui (x)p(x; θ, z)}λ(dx)

Iθ ∂θ Iθ

( )

Z

∂

= {ξ a (x) p(x; θ, z)}λ(dx) = Sξa (θ, z) .

Iθ ∂θi j

6.4. APPENDICES 167

Note that the differentiation under the integral sign performed above is allowed because

Ψ is a regular estimating function.

We have then

= Sξ−1u (θ, z){Vξ a (θ, z)Vξ u (θ, z)}Sξ−Tu (θ, z)

= Jξ−1u (θ, z) + Sξ−1u (θ, z)Vξ a (θ, z)Sξ−Tu (θ, z)

≥ Jξ u (θ, z) .

Hence

The last inequality in the expression above is due to the fact that the components of ξ u

are in the space spanned by the components of U. Obviously the equality in the last

expression is obtained if ξ is equivalent to U. u

t

168 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

Bibliography

bolae. Scand.J.Statist. 5, 151-157.

[2] Barndorff-Nielsen, O.E. ; Jensen, J.L. and Sørensen, M. (1990). Parametric modelling

of turbulence. Phil. Trans. R. Soc. Lond. A 332, 439-445.

[3] Begun, J.M,; Hall, W.J.; Huang, W.M. and Wellner, J.A. (1983). Information and

asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, 432-

452.

[4] Bickel, P.J.; Klaassen, C.A.J.; Ritov, Y. and Wellner, J.A. (1993). Efficient and

Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,

London.

[5] Billingsley, P. (1986). Probability and Measure. Second edition. John Wiley and Sons.

New York.

[6] Chow, Y.S. and Teicher, H. (1978). Probability Theory: Independence, Interchange-

ability, Martingales. Springer-Verlag, Heidelberg.

Princeton.

[8] Dieudonne, J. (1960). Foundations of Modern Analysis. Academic Press, New York.

[9] Dunford, N. and Schwartz, J.T. (1958). Linear Operators, Part I . Interscience, New

York.

Statist. Soc. Ser. B 22, 139–153.

[11] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Proc. Royal Soc.

London Ser. A 144, 285–307.

169

170 BIBLIOGRAPHY

[12] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-

mation. Ann. Math. Statist. 81, 1208–1212.

[13] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating

equations. Biometrika 63, 277–284.

[14] Godambe, V.P. (1980). On sufficiency and ancillarity in the presence of a nuisance

parameter. Biometrika 67, 269–276.

[15] Godambe, V.P. (1984). On ancillarity and Fisher information in the presence of a

nuisance parameter. Biometrika 71, 626–629.

[16] Godambe, V.P. and Thompson, M.E. (1974). Estimating equations in the presence

of a nuisance parameter. Ann. Statist. 2, 568–571.

[17] Godambe, V.P. and Thompson, M.E. (1976). Some aspects of the theory of estimating

equations. J. Statist. Plann. Inference 2, 95–104.

[18] Hájek, J. (1962). Asymptotically most powerful rank-order tests. Ann. Math. Statist.

33 1124-1147.

[20] Jørgensen, B. and Labouriau, R. (1995). Exponential Families and Theoretical Infer-

ence. Lecture notes at the University of British Columbia, Vancouver.

analysis based on exponential dispersion models. J. Roy. Statist. Soc. Ser. B 58 ,

573-592.

[22] Kendall, G.M. and Stuart, A. (1952). The Advanced Theory of Statistics. Vol. 1.

Charles Griffin, London.

[23] Kimball, B.K. (1946). Sufficient statistical estimation functions for the parameters

of the distribution of maximum values. Ann. Math. Statist. 17, 299–309.

IMPA.

BIBLIOGRAPHY 171

quasi estimating functions. To appear.

[28] LeCam, L. (1966). Likelihood functions for large number of independent observations.

In: Research Papers in Statistics. Festschrift for J. Neyman (F.N. David, ed.), 167-

187, Wiley, London.

[29] Luenberg, D.G. (1969). Optimization by Vector Space Methods. John Wiley and Sons.

New York.

[30] Lukacs, E. (1975). Stochastic Convergence. Second Edition. Academic Press, Inc;

New York.

[31] Lundbye-Christensen (1991). A multivariate growth curve model for pregnancy. Bio-

metrics 47, 637-657.

[32] McLeish, D.L. and Small, C.G. (1987). The Theory and Applications of Statistical

inference Functions. Lecture Notes in Statistics 44, Springer-Verlag, New York.

in Statistics 13. Springer-Verlag.

[34] Pfanzagl, J. (1985). Asymptotic Expansions for General Statistical Models. Lecture

Notes in Statistics 31. Springer-Verlag.

ments. Lecture Notes in Statistics 63. Springer-Verlag.

[36] Plausonio, A. (1996) De Re Ætiopia. 74th edition. Editora Rodeziana, São Paulo-

Barcelona.

[37] Rudin, W. (1966). Real and Complex Analysis. McGraw-Hill, New York.

New York.

[40] Stein, C. (1956). Efficient nonparametric testing and estimation. Proc. Third Berkeley

Symp. Math. Statist. 1, 187-195, Univ. California Berkeley.

[41] Vaart, A.W. van der (1988). Estimating a real parameter in a class of semiparametric

models. Ann. Statist. 16 4 , 1450-1474.

172 BIBLIOGRAPHY

[42] Vaart, A.W. van der (1988). Efficiency and Hadamard differentiability. Scand. J.

Statist. 18, 63-75.

[43] Vaart, A.W. van der (1988). Statistical Estimation in Large Parameter Spaces. CWI

Tracts 44, Amsterdam.

- Technique of Collecting DataTransféré parsman sakra
- Welsh and Knight Magnitude-based InferenceTransféré parMonicaOlcese
- Handbook 2011 WebTransféré parAnuj Sachdev
- tabel-durbin-watson.pdfTransféré parAdi Sucipto Tqns
- we18Transféré parbabak84
- Class 2 SamplingTransféré parjbrunomaciel1957
- US Federal Trade Commission: FTCDataQualityActPetitionTransféré parftc
- Lancaster University - MS Data Science HandbookTransféré parpaul
- VI-01 Intro and Periodogram_2007Transféré parAamir Habib
- IK-27-11-2150Relations of the parameters of the /-K distribution for irradiance fluctuations to physical parameters of the turbulenceTransféré parpeppas4643
- [7]Statistical Modelling of Spectrum OccupancyTransféré parbzsahil
- 1992 Smith GelfandTransféré pardydysimion
- Environmental Contour Lines a Method for EstimatinTransféré parmarios
- Logistic Parameters EstimationTransféré parSameer Al-Subh
- 1981-3821-bpsr-9-3-0178Transféré parNosirjon Juraev
- UMUCStats-SyllabusTransféré parldlewis
- Factorial DesignsTransféré parsmktriang 3
- EP1110-2-7_01May1988Transféré parCinthia Chávez
- JAVA- Data Streaming Alert Intrusion DetectionTransféré parUsha Baburaj
- Exam C_1106Transféré par1br4h1m0v1c
- 02c Probabilistic Inventory Models (1)Transféré parSeema
- Markov-switching Trade PolicyTransféré parMuhammad Rahmat
- CURENT_LectureNotes_AliAbur.pdfTransféré parsrilakshmisiri
- Notes 02Transféré parMuly Adin
- Probabilistic Programming IntroductionTransféré parGanesh Arora
- 10.1.1.200.8740Transféré parbaharmarine
- sampling designTransféré parnehaarunagarwal
- GG_How_1995Transféré parcsuciava
- Kriging metamodeling in simulation_ A review.pdfTransféré parJavier Rigau
- dc09_pitblado_svyTransféré parEdwin Johny Asnate Salazar

- Disciplines, Disasters and EM Chapter-Public Health and MedicineTransféré parGiovanni Anggasta
- 63enTransféré parmikefirmino
- wcci_kovatscsillagcritical2016finalTransféré parmikefirmino
- PhotoshopTransféré parjamile0601
- University Traffic and Parking CodesTransféré parmikefirmino
- Examples of Writing of Aramaic -10th Cent. to BCTransféré parmikefirmino
- Course-Notes-PresentationSkills.pdfTransféré parmikefirmino
- 1809-4392-aa-12-2-0377Transféré parmikefirmino
- Budism in Sri LankaTransféré parmikefirmino
- nikola_parizkova.pdfTransféré parmikefirmino
- deuscrfmTransféré parmikefirmino
- Julia_Rabello_Buci.pdfTransféré parmikefirmino
- Guia Conheça a História - Edição 01 - Abril 2019.pdfTransféré parmikefirmino
- Creatividad (como enseñarla) en InglesTransféré parValerio Quintero
- Creatividad (como enseñarla) en InglesTransféré parValerio Quintero
- Series_Parallel_9_14.pdfTransféré parAnonymous kUdnKn7M5
- Curso de PassesTransféré parapi-3860892
- Hp Jul03 StigmataTransféré parApsopela Sandivera
- The Proton and Helium AnomaliesTransféré parmikefirmino
- Paciente Identificando DoençaTransféré parmikefirmino
- Radiestesia e Numeros de GrabovoiTransféré parJanaína Lima
- Grammar for Academic Writing IsmTransféré parBharat Kumar
- Ativando DNA - Transformação PessoalTransféré parmikefirmino
- delicias_sem_gluten_II_miriam_pereira.pdfTransféré parLéo Batista
- Invasion of ManchuriaTransféré parmikefirmino
- Mitotic CheckpointsTransféré parmikefirmino
- Spleen.pdfTransféré parFahmiati Arifna
- Communists and Japan Invasion ManchuriaTransféré parmikefirmino
- BSc Curriculum in Game DevelopmentTransféré parmikefirmino

- how coffee affects heightTransféré parapi-356924090
- Data Spss RevTransféré parAdi Prakoso
- Analysis of Variance (Anova) StudentsTransféré parMiszz Bella
- Correlations and Inferential Statistics_Workshop1Transféré parpemea2008
- Crude Oil ReviewTransféré parArun Prakash
- forecasting.pptTransféré parAnkit Nayak
- Bayesians, Frequentists, And ScientistsTransféré parmatrix_testing_ans
- EcMetHW1Transféré parSilvio de Paula
- Inferential Statistics for PsychologyTransféré parMay Chan
- inferential statistics project2Transféré parapi-362845526
- MultilevelModeling IntroTransféré parЖасмина Тачева
- Introduction to Bayesian Image AnalysisTransféré paramit_mehndiratta@sify
- 자본시장론(L1, L2, L2-1, L2-2, L3)Transféré paryi
- BCO105 S261503 Assessment Item 3.docxTransféré parMarsha Torres
- Stata Lecture2Transféré parlukyindonesia
- Pinang Oke (-Blanko)Transféré parrevina petrina
- Order StatisticsTransféré parAnwaarAhmad
- ASTM E739.18040Transféré parArturo Mendoza
- Tracking Signal in ForecastingTransféré parsergio2008
- How to Use Excel to Conduct an AnovaTransféré parkingofdeal
- Hasil Regresi Enter & StepwiseTransféré parMega Cattleya PA Islami
- PSEUDOREPLICACIONTransféré parDavid Mero del Valle
- WassermanTransféré parSrijit Sanyal
- les5e_ptb_07Transféré parJiger Shah
- Course ContentTransféré paredniel maratas
- c1Transféré partichien
- Textbook Chap 1.pdfTransféré pariLeegend
- Hypothesis TestingTransféré parjomwankunda
- Expected Shortfall BacktestTransféré parMateusz Piwoński
- Ch9 Evans BA1e Case SolutionTransféré paryarli7777