Vous êtes sur la page 1sur 185

Statistiques et Analyse de Données

Statistics and Data Analysis

Julien Reygner

Année universitaire 2020–2021


Avant-propos

Cette année, le cours se déroule intégralement en visioconférence, sur le créneau du vendredi


de 13h à 15h, avec l’outil de classe virtuelle de Microsoft Teams.
Les visioconférences ont lieu en « amphi » (toute la promotion) pour le cours, ou en « petite
classe » (quatre groupes) pour la correction des exercices et des TP. Le calendrier des amphis et
petites classes est condensé ci-dessous. Le programme détaillé des séances est donné p. v.

13h 13h45 14h15 14h

Vendredi 25 septembre
Vendredi 2 octobre
Vendredi 9 octobre
Vendredi 16 octobre
Vendredi 23 octobre
Amphi
Vendredi 6 novembre
Petite classe
Vendredi 13 novembre
Examen
Vendredi 27 novembre
Vendredi 11 décembre
Vendredi 18 décembre
Vendredi 8 janvier
Vendredi 15 janvier
Vendredi 22 janvier

Toutes les informations relatives au cours sont disponibles sur l’équipe Teams STAT - Statis-
tiques et Analyse de Données.

Préparation des exercices et TP

Les exercices et TP sont à faire dans le courant de la semaine entre deux séances. Leur pré-
paration est obligatoire : avant le début de chaque séance, vous devez déposer un document
sur Teams dans le dossier partagé correspondant à la séance de votre groupe de petite classe.
Pour les exercices, ce document doit être au format PDF (scan lisible de votre rédaction manus-
crite, ou bien rédaction directe sur ordinateur) ; pour les TP, vous devez déposer votre notebook
au format R Markdown. Selon les cas, votre document doit être nommé nom_prenom.pdf ou
nom_prenom.rmd.

Illustration de couverture par Randall Munroe, http://xkcd.com/552.


ii

TP informatiques
Les TP informatiques se font avec le logiciel R, toutes les informations nécessaires à son instal-
lation sont mises sur Teams. Ces TP ont pour objectif principal d’illustrer les notions du cours en
familiarisant les élèves à l’usage de ce langage, des ressources supplémentaires sont proposées sur
Teams pour les élèves qui souhaiteraient approfondir leur pratique.

Classes inversées
Une partie des trois dernières séances de cours est consacrée à la présentation d’exposés de classe
inversée : vous ferez le cours à vos camarades. Six sujets, décrits dans l’Annexe B, p. 141, devront
être présentés par des groupes de 3 ou 4 élèves, au cours d’un exposé d’une quinzaine de minutes,
suivi d’une discussion d’une dizaine de minutes avec le reste de la classe. Vous trouverez des
consignes détaillées pour la préparation des exposés dans l’introduction de l’Annexe B (à lire
impérativement, dès maintenant). Les méthodes présentées au cours de ces exposés seront au
programme de l’examen final : gardez donc à l’esprit que vos camarades attendront de vous le
même effort de pédagogie que celui que vous exigez de vos professeurs !

Modalités d’évaluation
La note de module est composée à 75% de la note à l’examen final et à 25% de l’évaluation de
l’exposé de classe inversée.

Cours en anglais
Les notes de cours et l’intégralité des documents pédagogiques liés au cours sont rédigés en an-
glais, afin de sensibiliser les étudiants à la pratique de l’anglais scientifique, ce qui sera très utile
pour ceux qui souhaitent partir un semestre ou une année à l’étranger. Les petites classes restent
cependant enseignées en français. Nous conseillons de préparer les slides — pardon, les transpa-
rents — des exposés de classe inversée en anglais, mais de conserver le français pour l’exposé
oral. Enfin, les étudiants sont libres de choisir le français ou l’anglais pour rédiger leur copie à
l’examen final.
Certains termes spécifiques n’ont pas forcément de correspondance français/anglais : les en-
seignants de petite classe sont là pour éclairer les possibles ambiguïtés que cela pourrait induire.
À titre d’exemple, soulignons dès maintenant une subtilité classique : les termes positive et
negative en anglais désignent respectivement des nombres strictement positifs et strictement né-
gatifs ; pour parler d’un nombre positif ou nul, on emploie le terme nonnegative — et évidemment
nonpositive pour les nombres négatifs ou nuls. La même règle s’applique à la monotonie des fonc-
tions : increasing et decreasing désignent respectivement des fonctions strictement croissantes et
strictement décroissantes, pour les fonctions croissantes et décroissantes au sens large, on utilise
nondecreasing et nonincreasing. Une autre différence notable entre les conventions anglophone
et francophone concerne l’emploi d’une parenthèse pour représenter la borne ouverte d’un inter-
valle : ce que l’on noterait [0, 1[ en français est noté [0, 1) en anglais.

Polycopié
Ce polycopié est destiné aux élèves de l’École des Ponts. Une version électronique est disponible
sur Teams, et nous sommes généralement très heureux de la transmettre à tout étudiant à qui elle
pourrait être utile. Nous vous remercions cependant de ne pas diffuser la version électronique sur
des pages web publiques.
iii

Des exercices « au fil du cours » sont inclus dans le corps des chapitres, il s’agit d’applications
directes du cours et ne sont pas corrigés. Les exercices en fin de chapitre sont généralement un
peu plus originaux ou profonds, ils sont corrigés dans l’Annexe C (à l’exception des exercices
obligatoires, qui sont corrigés en classe).
Enfin, certains passages du polycopié sont signalés par un astérisque : ils sont hors du pro-
gramme du cours, mais il pourra vous être utile d’y revenir dans la suite de votre scolarité.

Office hours
Dans la mesure de ce que permettra la situation sanitaire, il est prévu de mettre en place un créneau
d’office hour en présentiel pour compléter le cours à distance. Les modalités seront décidées lors
du premier amphi.
iv
Programme des séances

Les amphis et petites classes sont respectivement signalés par les pictogrammes  et L dans le
programme ci-dessous.
Pour chaque séance d’amphi, on donne dans un encadré les principales notions à connaître. À
la fin (ou mieux, tout au long) du semestre, n’hésitez pas à consulter ce programme afin de vous
assurer que vous avez les idées claires sur chacune d’entre elles.
Les exercices et TP à préparer sont indiqués avec le pictogramme ↸.
Le calendrier de préparation des classes inversées est signalé par le pictogramme :.

Séance n◦ 1. Vendredi 25 septembre

 13h–13h30 : amphi. Présentation du déroulement du cours.

L 13h30–15h : petite classe.

– Rappels d’algèbre linéaire et de probabilités (check-in list, p. 3).


– Compléments sur les statistiques gaussiennes (Annexe A, p. 135).
– Installation de R et R Studio.
– Début du TP de prise en main de R.

Séance n◦ 2. Vendredi 2 octobre

↸ Travail préparatoire. Terminer le TP de prise en main de R.

 13h–14h15 : amphi. Analyse de données.


– Cadre général de l’analyse de données. Moyenne et covariance empiriques, propriétés de la
matrice de covariance empirique. Corrélation empirique, interprétation.
– Analyse en composantes principales : objectif et mise en pratique. Théorème de caractérisation
du sous-espace optimal (sans preuve). Interprétation des notions de loadings et scores. Propriétés
des composantes principales. Sphère et cercle des corrélations, plans factoriels.
– Objectif des méthodes de clustering. k-means : principe et propriété de monotonie, mise en
œuvre sur un exemple. Principe de la classification hiérarchique ascendante.

L 14h15–15h : petite classe.

– Retour sur le TP de prise en main de R.


– Début du TP sur l’analyse de données.
vi

Séance n◦ 3. Vendredi 9 octobre

↸ Travail préparatoire. Terminer le TP sur l’analyse de données.

: Classes inversées. Lire les résumés rapides de chacun des six sujets (Annexe B, p. 141),
former des groupes et choisir un sujet.

L 13h–13h45 : petite classe. Correction du TP sur l’analyse de données.

 13h45–15h : amphi. Principes de l’estimation paramétrique.


– Cadre général de l’estimation paramétrique. Notions de modèle, d’estimateur. Biais, risque qua-
dratique, décomposition biais-variance. Exemples de la moyenne et de la variance empiriques.
– Consistance et normalité asymptotique d’un estimateur, méthode Delta (schéma de preuve).
– Construction d’estimateurs par la méthode des moments.

Séance n◦ 4. Vendredi 16 octobre

↸ Travail préparatoire. Exercices 2.A.2 (questions 1 et 2) et 2.A.3.

L 13h–13h45 : petite classe. Correction des exercices.

 13h45–15h : amphi. Le principe du maximum de vraisemblance.


– Définition de la vraisemblance. Estimateur du maximum de vraisemblance (EMV). Exemples
du modèle exponentiel et du modèle uniforme.
– Modèle régulier, information de Fisher.
– Borne de Fréchet–Darmois–Cramér–Rao (schéma de preuve), estimateur efficace.
– Optimalité de l’EMV dans les modèles réguliers (schéma de preuve).

Séance n◦ 5. Vendredi 23 octobre

↸ Travail préparatoire. Exercices 2.A.2 (question 3), 2.A.5 (sauf question 5).

L 13h–13h45 : petite classe. Correction des exercices.

 13h45–15h : amphi. Intervalles de confiance.


– Définition des intervalles de confiance exact et asymptotique.
– Statistique libre, fonction pivotale, construction d’un intervalle de confiance exact.
– Construction d’un intervalle de confiance asymptotique à partir d’un estimateur asymptotique-
ment normal (schéma de preuve).

Séance n◦ 6. Vendredi 6 novembre

↸ Travail préparatoire.

– Exercice 2.A.6 (questions 1 et 2).


– Début du TP sur l’estimation paramétrique (tracé de l’histogramme de l’échantillon et
de la densité correspondante).

Approfondissement facultatif. Exercice 2.A.6 (question 3).

L 13h–15h : petite classe.

– Correction des exercices.


– Début du TP sur l’estimation paramétrique.
vii

Séance n◦ 7. Vendredi 13 novembre

↸ Travail préparatoire.

– Exercice 2.A.5 (question 5).


– Terminer le TP sur l’estimation paramétrique.

L 13h–13h45 : petite classe. Correction de l’exercice et fin du TP sur l’estimation paramé-


trique.

 13h45–15h : amphi. Formalisme des tests d’hypothèse.


– Objectif et formalisme général. Erreurs et risques de première et seconde espèce. Niveau, puis-
sance et p-valeur.
– Procédure générale de construction d’un test. Choix de la statistique de test en fonction des
hypothèses. Dualité avec les intervalles de confiance.

Séance n◦ 8. Vendredi 27 novembre

↸ Travail préparatoire. Exercices 3.A.1, 3.3.2 et 3.3.4.


Approfondissement facultatif. Exercice 3.3.3.

: Classes inversées. Envoyer à votre enseignant de petite classe un mail décrivant la base de
données sur laquelle vous avez choisi d’appliquer la méthode que vous devez présenter, et
les questions que vous souhaitez traiter.

L 13h–13h45 : petite classe. Correction des exercices.

 13h45–15h : amphi. Tests d’hypothèse dans le cadre non-paramétrique.


– Principes généraux des tests dans le cadre non-paramétrique.
– Tests du χ2 : comportement asymptotique de la distance du χ2 entre la mesure empirique et la
loi des données (sans preuve), test d’adéquation à une loi, test d’adéquation à une famille de lois.
– Tests de Kolmogorov : comportement asymptotique de la fonction de répartition empirique
(théorèmes de Glivenko–Cantelli et de Donsker, sans preuve), test de Kolmogorov asymptotique.
Fonction quantile, liberté de la statistique de Kolmogorov sous H0 (preuve), test de Kolmogorov
non-asymptotique. Correction de Lilliefors pour le test d’adéquation à une famille de lois.

Séance n◦ 9. Vendredi 11 décembre

↸ Travail préparatoire. Exercices 4.A.2 et 4.A.5 (lire la Section 4.2.4 avant de commencer cet
exercice).

L 13h–13h45 : petite classe. Correction des exercices.

 13h45–15h : amphi. Régression linéaire.


– Principe général de la régression, cadre linéaire.
– Régression linéaire simple : calcul des estimateurs des moindres carrés, coefficient de détermi-
nation.
– Régression linéaire multiple : interprétation géométrique de l’estimateur des moindres carrés
et du coefficient de détermination.
– Modèle linéaire gaussien : loi des estimateurs de β et σ 2 (sans preuve), intervalle de confiance
pour la prédiction. Tests de Student et de Fisher pour la sélection de variables.
viii

Séance n◦ 10. Vendredi 18 décembre

↸ Travail préparatoire. Faire le TP sur la régression linéaire.

L 13h–15h : petite classe. Correction du TP sur la régression linéaire.

: Classes invsersées.

– Sujet n◦ 1 : test d’indépendance du χ2 .


– Sujet n◦ 2 : tests de comparaison de proportions.

Séance n◦ 11. Vendredi 8 janvier

↸ Travail préparatoire. Révisions : Exercices 1 p. 131 et 2 p. 132.

L 13h–15h : petite classe. Correction des exercices.

: Classes invsersées.

– Sujet n◦ 3 : tests de Student et de Fisher.


– Sujet n◦ 4 : tests d’homogénéité non-paramétriques.

Séance n◦ 12. Vendredi 15 janvier

↸ Travail préparatoire. Révisions : Exercices 3 p. 132 et 5 p. 134.

L 13h–15h : petite classe. Correction du check-out.

: Classes invsersées.

– Sujet n◦ 5 : analyse de variance dans le modèle gaussien.


– Sujet n◦ 6 : régression logistique.

Séance n◦ 13. Vendredi 22 janvier, examen final

↸ Les exercices signalés par le pictogramme 1 dans tout le poly, ainsi que l’exercice 4 p. 133,
constituent un excellent programme de révision. Les sujets des examens des années précé-
dentes, disponibles sur Educnet, sont également recommandés.

å Les modalités de l’examen dépendront probablement de la situation sanitaire...


Contents

Introduction 1

 Check-in list 3

1 Data analysis 7
1.1 Empirical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Parametric estimation 27
2.1 General definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Maximum Likelihood Estimation and efficiency . . . . . . . . . . . . . . . . . . 33
2.3 * Sufficient statistics and the Rao–Blackwell Theorem . . . . . . . . . . . . . . 40
2.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 * Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3 Hypothesis testing 65
3.1 General formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 General construction of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Examples in the Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 * Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4 Nonparametric tests 83
4.1 Models with a finite state space: the χ2 test . . . . . . . . . . . . . . . . . . . . 84
4.2 Continuous models on the line: the Kolmogorov test . . . . . . . . . . . . . . . 88
4.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Linear and logistic regression 101


5.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
x Contents

6 Independence and homogeneity tests 115


6.1 Independence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Two-sample homogeneity tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3 Many-sample homogeneity tests . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

 Check-out list 131

A Complements on Gaussian statistics 135


A.1 Gaussian-related distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.2 Gaussian vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.3 Multidimensional Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 137
A.4 Cochran’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

B Sujets de classe inversée 141


B.1 Test d’indépendance du χ2 (Section 6.1.1) . . . . . . . . . . . . . . . . . . . . . 142
B.2 Tests de comparaison de proportions (Section 6.2.1) . . . . . . . . . . . . . . . . 142
B.3 Tests de Student et de Fisher (Section 6.2.2) . . . . . . . . . . . . . . . . . . . . 143
B.4 Tests d’homogénéité non-paramétriques (Section 6.2.3) . . . . . . . . . . . . . . 143
B.5 Analyse de variance dans le modèle gaussien (Section 6.3.2) . . . . . . . . . . . 144
B.6 Régression logistique (Section 5.2) . . . . . . . . . . . . . . . . . . . . . . . . . 144

C Correction of the exercises 145


C.1 Correction of the exercises of Chapter 1 . . . . . . . . . . . . . . . . . . . . . . 145
C.2 Correction of the exercises of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 151
C.3 Correction of the exercises of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 157
C.4 Correction of the exercises of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . 162
C.5 Correction of the exercises of Chapter 5 . . . . . . . . . . . . . . . . . . . . . . 164
C.6 Correction of the exercises of Chapter 6 . . . . . . . . . . . . . . . . . . . . . . 167

Bibliography 171
Introduction

General purpose of these notes


Statistics are often presented as a dual version of probability theory: in probability theory, one
starts from a probability measure P and tries to establish some properties of samples X1 , . . . , Xn
from this distribution; while in statistics, only the values of X1 , . . . , Xn are observed, from which
one tries to infer some informations on the unknown underlying probability measure P .
Although it is true that this so-called statistical inference problem remains at the heart of
the statistical analysis of any random experiment, and that probability theory was historically
developed as an instrument to provide this problem with a technical framework, modern statistics,
which in particular have to take into account large amounts of data, raise a much larger diversity
of methodological questions, and involve a much broader spectrum of mathematical tools (such as
linear algebra and optimisation), as well as other sciences, especially computer science.
The purpose of this course is to initiate the students to a few of these questions and meth-
ods, which we believe constitute the necessary background for any engineer working with data.
We aim to describe both their practical implementation and the context in which these methods
are adapted, and the mathematical framework underlying their application. We thereby hope to
illustrate how abstract notions and statements, such as orthogonal projections, spectral decompo-
sition, the Law of Large Numbers or the Central Limit Theorem, may be fruitfully applied to very
practical situations.
However, we insist on the fact that this course is oriented toward the presentation of methods
rather than theorems: the latter shall always come as a theoretical support to the understanding
of the former. In this respect, at any point of the course, students should be able to stop, ask
themselves: ‘What is the practical use of what I am currently doing?’, and answer this question by
producing a simple example.

What to do with data?


The standard task of a statistician may be summarised in the following steps:

(i) collect and visualise some data in order to construct a model;

(ii) estimate the parameters of the model;

(iii) formulate and test hypotheses;

(iv) use the result of these tests to predict future outcomes.

For low-dimensional data, elementary tools such as histograms and scatter plots1 allow to ad-
dress the first step by giving a rapid qualitative overview of a data set. However, when the number
1
Nuages de points en français.
2 Introduction

of variables collected during an experiment is large, visualising and extracting pertinent informa-
tion from data becomes more intricate. The techniques of data analysis exposed in Chapter 1
allow to address these questions.
The collection and visualisation of data generally allows to construct a model for the phe-
nomenon under study, which is based on assumptions: for example, that certain variables are
independent, or that the distribution of some variables belong to a given family, such as Gaus-
sian or exponential laws. Once the model is constructed, its parameters (for instance, the mean
and variance of the variables which are assumed to be Gaussian) have to be estimated from the
observation of the data. The framework of parametric estimation, presented in Chapter 2, al-
lows to carry out this estimation procedure with quantitative error estimates, through the notion of
confidence intervals in particular.
The first purpose of the construction of a model and of the estimation of its parameters is to
improve the understanding of the phenomenon studied by the experiment. For example, when
analysing the results of a survey2 , one may define a model by assuming that picking n people
at random and asking then ‘Will you vote for Candidate X?’ yields a sample of n independent
and identically distributed Bernoulli variables, with an unknown parameter p which corresponds
to the actual proportion of the population who will vote for Candidate X. It is then of interest to
determine whether p is larger or smaller than 1/2 in order to know whether Candidate X will win
the election. The theory of hypothesis testing introduced in Chapter 3 provides the framework to
address such questions. Hypothesis testing also allows to assess the validity of a model; in particu-
lar, nonparametric tests discussed in Chapter 4 may permit to determine whether a sample of data
is actually distributed according to a given probability law, while independence and homogeneity
tests presented in Chapter 6 are designed to study the dependence between variables.
Last, once a model is constructed and validated through parameter estimation and hypothesis
testing, it can be employed to predictive purposes. For instance, assume that at each experiment,
pairs of variables (xi , yi ), i = 1, . . . , n, are collected, and that the model has validated a functional
relation of the form yi ≃ f (xi ), where the symbol ≃ denotes the fact that some fluctuations are
not entirely captured by the function f . In this context, estimating the function f is a regression
problem, and its resolution may be expected to allow to predict values of y corresponding to a set
of variables x which has not been observed during the first n experiments. Two specific regression
problems, namely linear and logistic, are presented in Chapter 5.

Afterword
These notes are based on the former polycopié of the course coordinated by Jean-François Delmas.
They benefited from many useful discussions with Cristina Butucea, Guillaume Obozinski and
Arnaud Guyader, as well as the lecturers (past and present) of the course: Christophe Denis,
Vincent Feuillard, Patrick Hoscheit, Guillaume Perrin and Gabriel Stoltz. I wish to warmly thank
all of them for their comments and help.
If you find any typo, mistake or imprecision in the text, or if you have any comment on its
contents, please let me know: julien.reygner@enpc.fr. Thank you!

2
Sondage en français.
 Check-in list

In this short preliminary chapter, we introduce some notation, in particular of linear algebra and
probability theory, which will be used throughout the notes. We also recall a few definitions and
leave to the reader a number of quick questions (and enough room to fill in the blanks), the answer
to which should be a good refresher before delving into the body of these notes.

Linear algebra
For n, p ≥ 1, we denote by Rn×p the space of matrices with n rows and p columns. The transpose
of a matrix A ∈ Rn×p is denoted by A⊤ ∈ Rp×n .

Symmetric matrices
We denote by h·, ·i the usual scalar product on Rp .
 What is the definition of the Euclidean norm k · k on Rp induced by h·, ·i?

 For A ∈ Rp×p and u, v ∈ Rp , express hAu, vi in terms of A⊤ .

The matrix A ∈ Rp×p is symmetric if A⊤ = A.


 What does the Spectral Theorem say about symmetric matrices?

A symmetric matrix A ∈ Rp×p is called nonnegative if, for all v ∈ Rp , hAv, vi ≥ 0.


 How to characterise the fact that a symmetric matrix is nonnegative in terms of its eigenval-
ues?

Orthogonal projections
Let E be a finite-dimensional linear space3 , endowed with a scalar product h·, ·i, and let H be a
linear subspace4 of E.
3
Espace vectoriel en français.
4
Sous-espace vectoriel en français.
4  Check-in list

 What is the definition of the space H ⊥ ?

For x ∈ E, the orthogonal projection of x onto H is the unique xH ∈ H such that x − xH ∈ H ⊥ .

 Which equivalent definition(s) of xH do you know?

 If (e1 , . . . , ek ) is an orthonormal basis of H, give an explicit expression of xH .

 If H = Span(e) is the linear space generated by some vector e ∈ E such that kek = 1,
what is the operator ee⊤ ?

Probability theory
A probability space is a triple (Ω, A, P), where Ω is a set, A is a σ-algebra on Ω, and P is a
probability measure on (Ω, A).

Random variables
A random variable with values in a measurable space X is a measurable function X : Ω → X.

 What is the definition of the law PX of a random variable X?

 What is the definition of the Bernoulli and Binomial distributions?

When X = Rd , a random variable X is said to have a density pX : Rd → [0, +∞) if, for any
measurable subset B ⊂ Rd ,
Z
P(X ∈ B) = 1{x∈B} pX (x)dx,
x∈Rd

or equivalently, for any measurable and bounded function f : Rd → R,


Z
E[f (X)] = f (x)pX (x)dx.
x∈Rd

 What are the properties of a density function?


5

 What is the density of the following distributions:

– uniform U[a, b],

– exponential E(λ),

– Gamma Γ(a, λ),

– Beta β(a, b),

– Gaussian N(µ, σ 2 )?

 If X ∼ N(µ, σ 2 ) and a, b ∈ R, what is the law of aX + b?

 If X ∼ E(λ) and a ∈ R, what is the law of aX?

 What is the definition of a Gaussian vector?

Variance and covariance


The variance of a random variable X ∈ R is defined by

Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 .

The covariance of two random variables X, Y ∈ R is defined by

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ].

It is symmetric and bilinear.

 Expand Var(X + Y ).

The covariance matrix of a random vector X = (X1 , . . . , Xd ) ∈ Rd is the matrix with coefficients
Cov(Xi , Xj ).

 If X = (X1 , . . . , Xd ) is a random vector with covariance matrix K ∈ Rd×d , compute


Cov(hu, Xi, hv, Xi) for u, v ∈ Rd .
6  Check-in list

The variance of a random vector X ∈ Rd is defined by

Var(X) = E[kX − E[X]k2 ] = E[kXk2 ] − kE[X]k2 .

 Express Var(X) in terms of K.

Independence
A family of random variables (Xi )i∈I , which take their respective values in some spaces Xi , is
called independent if, for any finite set of distinct indices i1 , . . . , ik ∈ I, for any measurable sets
B1 ⊂ Xi1 , . . . , Bk ⊂ Xik ,

P(Xi1 ∈ B1 , . . . , Xik ∈ Bk ) = P(Xi1 ∈ B1 ) · · · P(Xik ∈ Bk ),

or equivalently, for any measurable and bounded functions f1 : Xi1 → R, . . . fk : Xik → R,

E[f1 (Xi1 ) · · · fk (Xik )] = E[f1 (Xi1 )] · · · E[fk (Xik )].

A family of independent variables having the same law is called iid5 .

 If X, Y ∈ R are independent, what is the value of Cov(X, Y )? What about the converse
statement?

1 Pn
 If X1 , . . . , Xn are iid, what are the expectation and the variance of n i=1 Xi ?

 If X and Y are independent Gaussian random variables, what is the law of X + Y ?

 If X1 , . . . , Xn are iid according to E(λ), what is the law of X1 + · · · + Xn ?

Convergence of random variables


A sequence of random variables (Xn )n≥1 is said to converge to some random variable X:

• almost surely if P(lim Xn = X) = 1;

• in probability if, for any ǫ > 0, lim P(kXn − Xk ≥ ǫ) = 0;

• in Lp if lim E[kXn − Xkp ] = 0;

• in distribution if, for any continuous and bounded function f , lim E[f (Xn )] = E[f (X)].
5
Independent and Identically Distributed
7

 What are the relations between these notions of convergence?

 What does the notation ‘Xn → N(0, 1) in distribution’ mean?

 What does Slutsky’s Theorem say?

 State the (strong) Law of Large Numbers and the Central Limit Theorem.
8  Check-in list
Chapter 1

Data analysis

The expression ‘data analysis’ covers a set of techniques allowing to extract and summarise the
information contained in data bases. In this chapter, we shall focus on data bases taking the form
of a table with n rows and p columns:

• each row represents a data, that is to say an individual among a population, or the result of
a random experiment among a series of identically distributed experiments, etc.;

• each column represents a characteristics of the individual, which we shall also call a fea-
ture1 .

We only address quantitative variables, which means that the cells of the table may only contain
numbers, as opposed to categorical variables2 . The features can represent quantities of a different
nature, for instance physical variables that are not expressed in the same unit of measurement.
In this chapter we shall work with the example of the results of men’s decathlon at the 2016
Summer Olympics, reproduced in Table 1.1. We may extract from this table two arrays with
n = 23 rows and p = 10 columns: a first array with raw results at each event (expressed in seconds
or in metres), for which comparing the numbers in two different columns makes no sense; and a
second array with the results converted in points, in which case the features are said homogeneous.
The table is seen as a matrix xn ∈ Rn×p . Its rows are denoted by x1 , . . . , xn and its columns
are denoted by x1 , . . . , xp . The number at the i-th row and j-th column is xji .
The usual scalar product on Rp (seen as a space of row vectors) is denoted by h·, ·i, the as-
sociated Euclidean norm is denoted by k · k. For any (row) vector x ∈ Rp , x⊤ is the (column)
transposed vector.

1.1 Empirical correlation


Definition 1.1.1 (Empirical mean and covariance). The empirical mean of x1 , . . . , xn is the (row)
vector of Rp defined by
n
1X
xn = xi .
n
i=1

Its coordinates are denoted by xn = (x1n , . . . , xpn ).


1
Descripteur en français.
2
The term categorical variable refers to a variable which provides a qualitative information on the experiment, not
necessarily described by a numerical quantity; for instance a colour or the gender of an individual.
10 Data analysis

Name 100m L. jump Shot put H. jump 400m 110m h. Discus Pole vault Javelin 1500m
10.46 7.94 14.73 2.01 46.07 13.8 45.49 5.2 59.77 263.3
Eaton
985 1045 773 813 1005 1000 777 972 734 789
10.81 7.6 15.76 2.04 48.28 14.02 46.78 5.4 65.04 265.5
Mayer
903 960 836 840 896 972 804 1035 814 774
10.3 7.67 13.66 2.04 47.35 13.58 44.93 4.7 63.19 264.9
Warner
1023 977 708 840 941 1029 765 819 786 778
10.78 7.69 14.2 2.1 46.75 14.62 43.25 5 64.6 271.2
Kazmirek
910 982 741 896 971 896 731 910 807 736
10.75 7.52 13.78 2.1 47.98 14.15 42.39 4.6 66.49 254.6
Bourrada
917 940 715 896 910 955 713 790 836 849
11.21 7.14 14.27 2.07 48.15 14.48 47.07 4.9 72.32 268.3
Suarez
814 847 745 868 902 913 810 880 925 756
10.71 7.49 13.44 2.1 49.83 14.77 49.42 5.2 60.92 283
Ziemek
926 932 694 896 822 878 858 972 752 662
11.24 7.66 12.84 2.16 49.63 15.01 43.58 5.4 62.09 274.2
v. d. Plaetsen
808 975 657 953 832 848 738 1035 769 717
10.93 7.42 14.77 2.07 49.14 14.79 45.1 4.5 69.92 270.5
Felix
876 915 776 868 855 875 769 760 888 741
10.77 7.48 15.26 1.92 48.14 14.17 45.1 4.9 57.28 271.5
A. de Araujo
912 930 806 731 902 953 769 880 697 735
11.01 7.45 14.92 2.19 48.78 14.57 39.91 5 51.29 262
Taiwo
858 922 785 982 872 902 663 910 608 798
11.06 7.35 15.11 2.04 49.51 14.37 44.13 4.7 68.2 274.4
Helcelet
847 898 796 840 837 927 749 819 862 716
11.17 7.07 15.41 1.98 49.34 14.82 42.23 5.1 61.91 280.5
Auzeil
823 830 815 785 845 871 710 941 767 677
10.86 7.47 11.49 2.13 48.18 14.3 38.89 4.9 51.82 272.1
Dubler
892 927 575 925 900 936 642 880 616 731
10.87 6.97 15.03 1.98 49.02 14.12 44.66 4.5 64.13 293.1
Abele
890 807 792 785 860 959 760 760 800 600
10.83 7.11 14.8 1.98 49.8 15.74 53.24 4.4 63.54 284.7
Victor
899 840 777 785 824 762 938 731 791 651
11.32 7.33 13.69 2.01 50.81 14.99 46.31 5.2 60.15 286.3
Tonnesen
791 893 709 813 778 851 794 972 740 641
10.81 6.83 14.58 2.01 48.69 14.25 40.34 4.5 64.7 285
Garcia
903 774 764 813 876 942 671 760 809 649
10.84 7.33 13.4 1.89 48.61 14.39 38.09 4.8 61.83 273.5
Distelberger
897 893 692 705 880 925 626 849 765 722
11.3 6.83 14.14 1.98 50.43 15.09 49.9 4.9 66.63 286.3
Ushiro
795 774 737 785 795 839 868 880 838 641
10.88 6.73 14.17 2.01 50.18 15.09 48.32 4.5 56.68 282.3
Wiesiolek
888 750 739 813 806 839 835 760 688 666
11.04 7.13 12 1.92 48.93 14.57 34.91 4.7 51.24 258.4
Nakamura
852 845 606 731 865 902 562 819 607 823
10.82 7.02 13.88 1.77 50.32 16.51 42.96 4.5 46.42 279.4
Saluri
901 818 721 602 800 676 725 760 536 684

Table 1.1: Overall results for the 23 best ranked athletes at men’s decathlon during the 2016
Summer Olympics. Each cell contains the raw result (in seconds or metres according to the event)
on the first row, and the corresponding number of points on the second row. The cells with bold
text indicate the best performance in each event. Source: Wikipedia.
1.1 Empirical correlation 11

The empirical covariance matrix of x1 , . . . , xn is the matrix Kn ∈ Rp×p with coefficients


n
′ 1X j ′ ′
Knjj = (xi − xjn )(xji − xjn ).
n
i=1

Exercise 1.1.2. Show that


n
1X
Kn = (xi − xn )⊤ (xi − xn ),
n
i=1
so that Kn writes as a sum of n symmetric matrices of rank 1. ◦
The empirical variance of x1 , . . . , xn is the nonnegative real number
n
1X
σn2 = kxi − xn k2 .
n
i=1
Pp
It naturally satisfies σn2 = tr Kn = j=1 Knjj , where each diagonal coefficient (σnj )2 = Knjj is
the empirical variance of the feature xj .
The empirical covariance matrix of x1 , . . . , xn satisfies the following properties.
Proposition 1.1.3 (Properties of the empirical covariance matrix). The matrix Kn is symmetric, it
also writes
n
1X ⊤
Kn = xi xi − x⊤n xn ,
n
i=1

and its eigenvalues3 are real and nonnegative.


Proof. That Kn is symmetric and satisfies the claimed identity is an immediate consequence of
Exercise 1.1.2. Since Kn is symmetric, the Spectral Theorem ensures that its eigenvalues are real.
Now let λ be one of these eigenvalues, and u ∈ Rp an associated (row) eigenvector4 , that is to say
such that uKn = λu. Then on the one hand,
huKn , ui = hλu, ui = λkuk2 ,
while on the other hand, following Exercise 1.1.2,
n n
1X 1X
huKn , ui = hu(xi − xn )⊤ (xi − xn ), ui = |u(xi − xn )⊤ |2 ≥ 0.
n n
i=1 i=1

We deduce that λkuk2 ≥ 0, and since kuk2 > 0 by the definition of an eigenvector, then λ ≥
0.

Remark 1.1.4. To highlight the connection with the usual notions of expectation and covariance,
one may introduce the probability measure
n
1X
µn = δxi
n
i=1

on Rp , called the empirical distribution


of the sample x1 , . . . , xn . Here, δx is the Dirac distribution
in x, defined by the identity δx (A) = 1{x∈A} for any Borel set A of Rp . It is easily checked that
if ξ is a random variable distributed according to µn , its expectation E[ξ] is equal to xn , and its
covariance matrix is equal to Kn . It is therefore not surprising that xn and Kn have the usual
properties of an expectation and a covariance matrix. ◦
3
Valeurs propres en français.
4
Vecteur propre en français.
12 Data analysis

We now introduce the notion of empirical correlation, which plays an essential role in data
analysis.

Definition 1.1.5 (Empirical correlation). For all j, j ′ ∈ {1, . . . , p}, the empirical correlation

between the features xj and xj is defined by
′ ′
j Knjj
j′ Knjj
Corr(x , x ) = p q = j j′ .
′ ′
Knjj Knj j σn σn

This number is sometimes referred to as the Bravais–Pearson correlation coefficient.



Remark 1.1.6 (Degenerate case). When σnj = 0 or σnj = 0, we define Corr(xj , xj ) = 0.

The empirical correlation coefficient possesses a first geometrical interpretation: indeed, with
the notation  
1
 .. 
1n =  .  ∈ Rn ,
1

it is easily observed that Corr(xj , xj ) is the cosine of the angle made in Rn by the (column)

vectors xj − xjn 1n and xj − xjn 1n . In particular, the empirical correlation takes its values between

−1 and 1, and it is close to 1 (respectively −1, and 0) if the vectors are aligned (respectively aligned
in opposite directions, and orthogonal).
These remarks also lead to the statistical interpretation of the empirical correlation that a
positive correlation implies that the features typically take simultaneously large or small values; a
negative correlation implies that the features typically vary in opposite directions; and a correlation
close to 0 implies that the value of one feature has a weak influence on the value of the other.
Positive correlations are often interpreted as the trace of a causal relation between two phe-
nomena: for instance, the release of carbon dioxyde in the atmosphere is strongly correlated with
global temperatures, and it is natural to assume that global warming is, at least partially, caused
by the increase in the emission of polluting gases. However, basing this conclusion only on the
observation of the correlation is a methodological error, for several reasons — among which is
the fact that the correlation is symmetric, so that it would be equally legitimate to conclude that
global warming is responsible for the increase of carbon dioxyde emissions. Figure 1.1 depicts an
example of two data sets with a strong empirical correlation but no actual causality.

Number of people who drowned by falling into a pool


correlates with
Films Nicolas Cage appeared in
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
140 drownings 6 lms
Swimming pool drownings

4 lms
Nicholas Cage

120 drownings

100 drownings 2 lms

80 drownings 0 lms
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Nicholas Cage Swimming pool drownings


tylervigen.com

Figure 1.1: An example of two data sets with a strong correlation, taken from the website Spurious
Correlations: http://www.tylervigen.com/spurious-correlations.
1.1 Empirical correlation 13

In order to visualise the pairwise correlations between the features, one may represent all

scatter plots5 (xji , xji ), 1 ≤ i ≤ n, for all pairs of indices (j, j ′ ). The example of decathlon results
is shown on Figure 1.2. Another representation of pairwise correlations is sometimes provided by
heatmaps6 , in which cells are coloured according to the value of the corresponding correlation.
Scatter plots and heatmaps are a first possible representation of the relations between the dif-
ferent features, but on the one hand it does not allow to see the relations between groups of k ≥ 3
features, and on the other hand its analysis may become difficult when the number p of features is
large. Principal Component Analysis, presented in the next section, provides a more global view
of the set of points x1 , . . . , xn in Rp , which is also based on the analysis of empirical correlations.

750 950 600 900 700 950 750 950 600 800

1000
R100m

800
750 950

LongJump

800
ShotPut

600
900

HighJump
600

800 950
R400m
950

R110mH
700

900
Discus

600
750 950

PoleVault
900

Javelin
600
800

R1500m
600

800 1000 600 800 800 950 600 900 600 900

Figure 1.2: Scatter plot of the results of Table 1.1 converted in points. The results for 400m and
110m hurdles appear to be very positively correlated, those for 100m and shot put look rather
negatively correlated, and those for long jump and discus throw look rather uncorrelated.

5
Nuages de points en français.
6
Type help("heatmap") in R.
14 Data analysis

1.2 Principal Component Analysis


The purpose of Principal Component Analysis (PCA) is to look for orthogonal projections of a set
of n points x1 , . . . , xn in Rp onto an affine subspace with dimension k ≤ p that best preserve the
original shape of the set, in order for example to reduce the dimensionality of the data while losing
‘as less information as possible’. On the example of Figure 1.3, the projections of a set of n = 3
points in the plane R2 onto two different lines are shown. Intuitively, one would like to select the
line on which the projected points seem more away from each other. This notion of deformation
will be measured through the empirical variance of the projected set7 .
+

D1
+
+
D2

Figure 1.3: A set of n = 3 points in the plane, projected onto the affine lines D1 and D2 . The
projection onto D1 seems to yield a better representation of the original set than the projection
onto D2 .

1.2.1 Normalising the PCA


Let us first remark that the shape of the set of points x1 , . . . , xn in Rp , and thus of its orthogonal
projection onto any affine subspace H, does not change if all points are translated simultaneously
in Rp . It is therefore equivalent to work with x1 , . . . , xn or with the centered set x1 −xn , . . . , xn −
xn .
When the features x1 , . . . , xp are not homogeneous, the choice of their units has a dramatic
influence over the shape of the set of points. Indeed, keeping for instance the first two columns
of Table 1.1, the scatter plot obtained in R2 will look very different if the results of long jump
are expressed in centimetres or in metres! In such a situation, it is recommended to remove the
part of arbitrary which lies in the choice of units by normalising each feature, that is to say by
replacing xj − xjn with (xj − xjn )/σnj , for all j ∈ {1, . . . , p}. After this transformation, the values
of different features can be compared. The PCA is called normalised.
When the features x1 , . . . , xp are homogeneous, for instance the results of decathlon converted
into points, they can already be compared with each other, and the normalisation step is no longer
necessary. On the contrary, in this case, normalising the data may result in a loss of information.
It is thus preferable to keep the PCA unnormalised.
In the sequel, we shall always assume that the set of points is centered, so that xn = 0.
However, we do not necessarily assume that the data are normalised.
7
Inertie en français — on pourra réfléchir au lien avec la notion de moment d’inertie d’un solide en mécanique.
1.2 Principal Component Analysis 15

1.2.2 Quantifying the notion of deformation


It is obvious on Figure 1.3 that the shape of the projection of x1 , . . . , xn onto the affine subspace
H does not change if H is translated in Rp . As a consequence, it is sufficient to restrict our study
to affine subspaces containing the origin, that is to say, linear subspaces8 .
Let H be a linear subspace of Rp , and for all i ∈ {1, . . . , n}, let xH
i be the orthogonal projec-
tion of xi onto H. Since it is assumed that xn = 0, the empirical variance of x1 , . . . , xn writes
n
1X
σn2 = kxi k2 ,
n
i=1

while the empirical variance of xH H


1 , . . . , xn writes

n
1X H 2
(σnH )2 = kxi k .
n
i=1

Exercise 1.2.1. If the PCA is normalised, show that σn2 = p. ◦

By Pythagoras’s Theorem,
n
1X
σn2 = (σnH )2 + kxi − xH 2
i k ,
n
i=1

see Figure 1.4.

xi

xH
i H

Figure 1.4: xH
i is the orthogonal projection of xi onto H.

PCA is based on the idea that the deformation of the set of points by the orthogonal projection
is measured by the quantity σn2 − (σnH )2 . As a consequence, given k ≤ p, we shall look for the
linear subspace H, with dimension k, which maximises (σnH )2 .

1.2.3 Computation of optimal subspaces


As a preliminary to the statement of the main result of PCA below, let us recall that following the
Spectral Theorem, the empirical covariance matrix Kn is diagonalisable in an orthonormal basis
(e1 , . . . , ep ) of Rp . We denote by λ1 ≥ · · · ≥ λp ≥ 0 the eigenvectors ranked nonincreasingly,
which are nonnegative according to Proposition 1.1.3. The eigenvectors ej are considered as row
vectors, and thus satisfy
ej Kn = λj ej , j ∈ {1, . . . , p}.
8
Sous-espaces vectoriels en français.
16 Data analysis

Theorem 1.2.2 (Computation of optimal subspaces). For all k ∈ {1, . . . , p}, let Hk = Span(e1 , . . . , ek )
be the linear space spanned9 by e1 , . . . , ek . We have
k
X
(σnHk )2 = λj ,
j=1

and for all linear spaces H with dimension k,

(σnH )2 ≤ (σnHk )2 .

Hence, the computation of optimal subspaces is reduced to the diagonalisation of the empirical
covariance matrix.

Proof. For all i ∈ {1, . . . , n}, we have


k
X k
X
kxH k 2
i k = hxi , ej i2 = hej x⊤
i xi , ej i,
j=1 j=1

therefore
n k k k
1 XX X X
(σnHk )2 = hej x⊤
i x i , ej i = hej Kn , ej i = λj ,
n
i=1 j=1 j=1 j=1

whence the first identity.


We now let H be a linear subspace of Rp with dimension k, and fix an orthonormal basis
(f1 , . . . , fk ) of H. By definition,
n k
1 XX
(σnH )2 = hxi , fj i2 .
n
i=1 j=1
Pp Pp
Writing fj = l=1 hfj , el iel for all j ∈ {1, . . . , k}, we get hxi , fj i = l=1 hfj , el ihxi , el i,
whence
n k p
1 XX X
(σnH )2 = hfj , el ihxi , el ihfj , el′ ihxi , el′ i
n ′ i=1 j=1 l,l =1
k
X p
X n
1X
= hfj , el ihfj , el′ i hel x⊤
i xi , el′ i
n
j=1 l,l′ =1 i=1
k
X p
X
= hfj , el ihfj , el′ ihel Kn , el′ i.
j=1 l,l′ =1

But for all l, l′ ∈ {1, . . . , p}, (


0 if l 6= l′ ,
hel Kn , el′ i =
λl if l = l′ .
As a consequence,
p
k X
X p
X
(σnH )2 = λl hfj , el i2 = λl keH 2
l k ,
j=1 l=1 l=1

9
Sous-espace vectoriel engendré en français.
1.2 Principal Component Analysis 17

where eH
l is the orthogonal projection of el onto H. P
To complete the proof, it remains to show that pl=1 λl keH 2
l k is bounded from above by
Pk
l=1 λl . To this aim, we rely on the following two remarks:
(i) by Pythagoras’ Theorem, keH 2 2
l k ≤ kel k = 1;
Pp H 2
Pp Pk 2
Pk Pp 2
Pk 2
(ii) l=1 kel k = l=1 j=1 hfj , el i = j=1 l=1 hel , fj i = j=1 kfj k = k;
which allow to write
p
X k
X p
X
λl keH
l k
2
= λl keH
l k
2
+ λl keH
l k
2

l=1 l=1 l=k+1


k
X Xp
≤ λl keH 2
l k + λk keH
l k
2

l=1 l=k+1
k k
!
X X
= λl keH 2
l k + λk k− keH
l k
2

l=1 l=1
Xk

= λl keH 2 H 2
l k + λk (1 − kel k )
l=1
k
X k
 X
≤ λl keH
l k2
+ λl (1 − keH 2
l k ) = λl ,
l=1 l=1

where we have used (ii) at the third line, (i) at the fifth line, and the fact that λ1 ≥ · · · ≥ λp . This
inequality completes the proof.

1.2.4 Interpretation of PCA


The vectors e1 , . . . , ep provided by Theorem 1.2.2 are called loadings, and must be understood as
‘implicit features’ of the model, aggregating the information of several original features. For all
i ∈ {1, . . . , n}, the coordinate hxi , el i of xi onto the axis el is called the score of xi on the l-th
principal component.
For all l ∈ {1, . . . , p}, we denote by (e1l , . . . , epl ) the coordinates of the vector el in the canon-
ical basis of Rp .
Definition 1.2.3 (Principal component). The l-th principal component is the (column) vector cl ∈
Rn defined by
Xp
l
c = ejl xj .
j=1

The l-th principal component is the column vector of Rn which contains the scores of the n
individuals on the l-th loading.
Proposition 1.2.4 (Properties of principal components). The principal components are uncorre-

lated: Corr(cl , cl ) = 0 if l 6= l′ . Besides, for all j ∈ {1, . . . , p},
√ j
l j λl el
∀l ∈ {1, . . . , p}, Corr(c , x ) = ,
σnj
and
X p
Corr(cl , xj )2 = 1. (∗)
l=1
18 Data analysis

Proof. We first let l, l′ ∈ {1, . . . , p}, and write


n n p p
(
1 X l l′ 1 X X j j j′ j′ X ′ ′ 0 if l 6= l′ ,
ci ci = el xi el′ xi = ejl ejl′ Knjj = hel Kn , el′ i =
n
i=1
n ′ i=1 j,j =1 ′ j,j =1
λl if l = l′ .


This implies that Corr(cl , cl ) = 0 if l 6= l′ . Notice that this result also shows that if λl = 0, then
cl = 0.
Now for l, j ∈ {1, . . . , p},
n n p
1X l j 1 XX k k j
ci xi = el xi xi = (el Kn )j = λl ejl ,
n n
i=1 i=1 k=1
Pn Pn j 2
and by the first computation above, 1
n
l 2
i=1 (ci ) = λl while 1
n i=1 (xi ) = Knjj = (σnj )2 . As a
consequence,
√ j
l j λl ej λl el
Corr(c , x ) = √ pl = ,
λl Knjj σnj
therefore
p p
X X λl (ej )2
Corr(cl , xj )2 = l

l=1 l=1 (σnj )2


p
1 X
= jj
(el Kn )j ejl
Kn l=1
p
X
1
= jj
ekl Knkj ejl
Kn k,l=1
p p
1 X kj X k j
= Kn el el .
Knjj k=1 l=1
Pp k j
Now, since the family (e1 , . . . , ep ) is orthonormal, the matrix with coefficients l=1 el el is the
identity, so that (
Xp
j 0 if k 6= j,
ekl el =
l=1
1 if k = j.
We finally conclude that
p
X 1
Corr(cl , xj )2 = Knjj = 1,
l=1
Knjj
which is the formula of Equation (∗).

Remark 1.2.5 (Correlation sphere). Notice that Equation (∗) implies that, for all j ∈ {1, . . . , p},
the point of Rp with coordinates Corr(c1 , xj ), . . . , Corr(cp , xj ) is located on the unit sphere of
Rp . This sphere is called the correlation sphere, see Figure 1.5. ◦

The plane spanned by (c1 , c2 ) is called the first factorial plane. In order to visualise the cor-
relations between the original features and the principal components, it is customary to represent
the points of coordinates (Corr(c1 , xj ), Corr(c2 , xj )), for j ∈ {1, . . . , p}, in this plane. By Re-
mark 1.2.5, these points are located within the unit circle, which is called the correlation circle of
1.2 Principal Component Analysis 19

c3

c2

c1

Figure 1.5: The correlation sphere for p = 3: the dot on the sphere is the point with coordi-
nates Corr(c1 , xj ), Corr(c2 , xj ), Corr(c3 , xj ); its projection onto the first factorial plane gives the
position of xj within the correlation circle.
1.0

1.0

Javelin
Discus

ShotPut R100m
0.5

0.5

R400m
R110mH

Javelin
Correlation with c2

Correlation with c3

HighJump R1500m
ShotPut

PoleVault
0.0

0.0

R110mH
LongJump
LongJump
R100m R400m
HighJump Discus

R1500m
−0.5

−0.5

PoleVault
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Correlation with c1 Correlation with c2

Figure 1.6: Correlation circles of the first three principal components for the unnormalised results
of decathlon.
20 Data analysis

the first factorial plane. The closer to the circle a point is, the better the feature xj is explained by
the principal components c1 and c2 . On the contrary, if a point is close to the origin, the corre-
sponding feature is weakly correlated to the first two principal components, and it may be useful
to plot the correlation circle in subsequent factorial planes so as to make the correlations with
subsequent principal components appear. How to select the total number of principal components
to take into account will be discussed in Section 1.2.5.
For the example of decathlon results, the correlation circles of the first two factorial planes are
plotted on Figure 1.6. We observe that the first principal component is highly correlated with the
results of long jump, 400m, 110m hurdles, high jump, 1500m and pole vault. We may therefore
suggest to interpret the score of an athlete on this component as an indication of his ability to run
relatively short distances, and jump. The second principal component is positively correlated to
the results to throw events, so that the score of an athlete on this component quantifies his ability
to throw things. The third principal component allows to discriminate, among the athletes having
the same score on the first component, between jumpers and runners.
The projections of the set of points x1 , . . . , xn on the first two factorial planes are plotted on
Figure 1.7. Eaton, Mayer and Warner appear to be the best runners/jumpers (in the sense of the
first principal component), but their score on the third principal component allows to refine their
profile by showing that Warner is more a runner, Mayer a jumper, and Eaton is between both.
Likewise, the scores on the second principal component allow to identify Suarez and Victor as
having a thrower profile.
150

Suarez Warner
Victor
Bourrada
100

Felix Abele
100

Helcelet Felix
Mayer
Abele
Ziemek Helcelet
50

A. d. Araujo
Eaton
Auzeil Suarez
Kazmirek Victor
Kazmirek
0

v. d. Plaetsen
0

Bourrada
Warner
e2

e3

A. d. Araujo Eaton
−50

Dubler Mayer
Auzeil
Taiwo
−100
−100

Taiwo Ziemek
−150
−200
−200

Dubler v. d. Plaetsen

−200 −100 0 100 200 −200 −100 0 100

e1 e2

Figure 1.7: Projections of the set of points x1 , . . . , xn on the first two factorial planes. For the
sake of legibility, only the first 16 athletes are represented.

1.2.5 Proportion of variance explained by PCA


Definition 1.2.6 (Proportion of variance explained by axes). For all k ∈ {1, . . . , p}, the proportion
of variance explained by the first k axes of PCA is the ratio
k
X
λj
(σnHk )2 j=1
= p .
σn2 X
λj
j=1
1.3 Clustering methods 21

We call scree plot10 the graphical representation of the eigenvalues λ1 , . . . , λp . It is plotted


on Figure 1.8 for the example of decathlon. It is usually observed that this plot exhibits an in-
flection in the decrease after a few eigenvalues, which allows to separate the axes containing the
actual information on the data (those before the inflection) from the axes merely corresponding to
statistical fluctuations (those after the inflection). Thus, only the factorial planes corresponding to
principal components before the inflection need to be taken into account.
3.5

15000
3.0
2.5

10000
2.0
Lambda

Lambda
1.5

5000
1.0
0.5
0.0

0
2 4 6 8 10 2 4 6 8 10

Eigenvalue index Eigenvalue index

Figure 1.8: Scree plots for the PCA on the data of Table 1.1, with normalised raw results on the
left, and unnormalised results converted into points on the right. An inflection in the decrease
is observed at the fourth eigenvalue. The proportion of variance explained by the 4 first axes is
79.5% for raw results, and 81.0% for converted results.

1.3 Clustering methods


For the data of decathlon, PCA allows to define scores of athletes on principal components, which
are interpreted as quantitative indicators of their performance in various types of efforts (sprint-
ing/jumping, throwing, etc.). Depending on the score of each athlete on the few first principal
components, one may determine his profile: sprinter, thrower, jumper, endurance runner... Nat-
urally, athletes with the same profile should have close features, and therefore the corresponding
points xi should be close to each other in Rp . It is the purpose of clustering methods to determine
automatically such groups of individuals with similar features, by constructing a partition of the
set of individuals into classes inside which the features are similar.
The number of partitions of a set of n points is called the Bell number and denoted by Bn .

Exercise 1.3.1. Show that the Bell numbers satisfy the recursive relation
n  
X n
Bn+1 = Bk ,
k
k=0

for all n ≥ 0. Hint: fix one of the n + 1 points of the set, and for all k ∈ {1, . . . , n + 1}, compute
how many partitions are such that the class containing this given point has cardinality k. ◦
10
Diagramme d’éboulis en français.
22 Data analysis

+
+ +
+ +
+

Figure 1.9: Clustering of 7 points of R2 in 3 classes.

The result of Exercise 1.3.1 indicates that Bn grows very fast. In practice, it is therefore
unconceivable to compare all possible partitions. We describe two families of clustering methods:
the k-means algorithm, and hierarchical clustering.

1.3.1 k-means algorithm


We assume that the number k of classes is predetermined, and we look for the partition S1 , . . . , Sk
which minimises the within-cluster Sum of Squares11
k X
X 1 X
WCSS = kxi − xj k2 , xj = xi .
|Sj |
j=1 i∈Sj i∈Sj
P
Notice that, for each class Sj , the quantity i∈Sj kxi − xj k2 is nothing but the empirical variance
of the data set {xi , i ∈ Sj }.
The k-means algorithm12 , also known as Lloyd’s algorithm, works as follows.
(0) (0)
• Initialise the algorithm with a given partition S1 , . . . , Sk of {1, . . . , n}, with which the
(0) (0)
means x1 , . . . , xk are associated.
(t) (t) (t) (t)
• Given a partition S1 , . . . , Sk with associated means x1 , . . . , xk , the (t + 1)-th iteration
of the algorithm is divided into two steps:
(t)
– Assignment step: for each i ∈ {1, . . . , n}, find the index J(i) of the mean xj which
is closest to xi (if there are several such indices, take the smallest).
(t+1)
– Update step: define Sj = {i ∈ {1, . . . , n} : J(i) = j}, and compute the associ-
(t+1) (t+1) (t+1)
ated means x1 , . . . , xk . If a class Sj is empty, simply remove it from the
partition.
(t+1) (t)
• Stop the algorithm when Sj = Sj for all j.
Notice that the algorithm actually returns a partition of {1, . . . , n} into k′ ≤ k classes because of
the possibility to remove empty classes.
Exercise 1.3.2. Apply the algorithm on the set of points of Figure 1.9 with various choices for the
(0) (0) (0)
initial classes S1 , S2 , S3 . ◦
11
Inertie intraclasse en français.
12
Centres mobiles ou nuées dynamiques en français.
1.3 Clustering methods 23

Lemma 1.3.3 (Monotonicity of the WCSS). For all t ≥ 0, let WCSS(t) denote the within-cluster
(t) (t)
Sum of Squares of the partition S1 , . . . , Sk . Then WCSS(t+1) ≤ WCSS(t) .
(t)
Proof. The assignment step ensures that, for all j ∈ {1, . . . , k}, for all i ∈ Sj ,
(t) (t)
kxi − xj k2 ≥ kxi − xJ(i) k2 .
As a consequence,
k X
X (t)
WCSS(t) = kxi − xj k2
j=1 i∈S (t)
j

k X
X (t)
≥ kxi − xJ(i) k2
j=1 i∈S (t)
j
n
X (t)
= kxi − xJ(i) k2
i=1
Xk X (t)
= kxi − xj k2 ,
j=1 i∈S (t+1)
j

(t+1)
where we have used the definition of the classes Sj in the updated step at the last line. For all
(t+1)
j ∈ {1, . . . , k}, for all i ∈ Sj ,
D E
(t) (t+1) 2 (t+1) (t+1) (t) (t+1) (t)
kxi − xj k2 = kxi − xj k + 2 xi − xj , xj − xj + kxj − xj k2
D E
(t+1) 2 (t+1) (t+1) (t)
≥ kxi − xj k + 2 xi − xj , xj − xj ,
(t+1)
and the definition of xj shows that
* +
X D (t+1) (t+1) (t)
E X  (t+1)

(t+1) (t)
xi − xj , xj − xj = xi − xj , xj − xj = 0.
(t+1) (t+1)
i∈Sj i∈Sj

As a consequence,
k
X X k
X X
(t) (t+1) 2
kxi − xj k2 ≥ kxi − xj k = WCSS(t+1) ,
j=1 i∈S (t+1) j=1 i∈S (t+1)
j j

which completes the proof.

The statement of Lemma 1.3.3 is not sufficient to ensure that the number of iterations of the al-
(t) (t)
gorithm is finite, as one might imagine that after a certain index t0 , the sequence {S1 , . . . , Sk ; t ≥
t0 } be periodic, visiting partitions which all have the same WCSS. This point is addressed in Ex-
ercise 1.A.6, where it is proved that such periodic situations cannot occur.
In general, the outcome of the algorithm is not the optimal partition: it is only some kind of
‘local minimum’ of the WCSS, which heavily depends on the choice of the initial partition. In
order to find an acceptable solution, it is customary to draw several different initial configurations
(say at random), run the k-means algorithm in parallel, and select the final configuration with the
smallest WCSS.
24 Data analysis

Suarez
Victor

Felix

100
Helcelet
Mayer
Abele
Ziemek

Auzeil
Kazmirek
v. d. Plaetsen
0
Bourrada
Warner
e2

A. d. Araujo Eaton
−100

Taiwo
−200

Dubler

−200 −100 0 100 200

e1

Figure 1.10: Outcome of one realisation of the clustering by the k-means algorithm for 5 classes
on the results of decathlon. The athletes are represented in the first factorial plane.

1.3.2 Hierarchical clustering


Unlike the k-means algorithm, for which the number of classes is predetermined, hierarchical
methods provide a sequence of partitions with increasing, or decreasing, number of classes. We
focus on the agglomerative algorithm, which is a ‘bottom up’ approach:
(0) (0)
• start with n classes S1 , . . . , Sn containing one element each,
(t) (t)
• at the (t + 1)-th step, select among the n − t remaining classes the pair (Sj , Sl ) which
(t) (t) (t) (t)
minimises a given distance d(Sj , Sl ), and aggregate the classes Sj and Sl into a single
class.

The algorithm depends on a choice of a notion of distance d(S, S ′ ) between two set of points
S and S ′ . Among the many possible choices of distances13 , the Ward distance

|S||S ′ | 1 X 1 X
dW (S, S ′ ) = kxS − xS ′ k2 , xS := x, xS ′ := x,
|S| + |S ′ | |S| |S ′ | ′
x∈S x∈S

has convenient properties related to the WCSS.

Lemma 1.3.4 (Ward distance and WCSS). Let S and S ′ be two set of points with respective means
xS and xS ′ , and let xS∪S ′ refer to the mean of S ∪ S ′ . We have
X X X
kx − xS∪S ′ k2 = kx − xS k2 + kx − xS ′ k2 + dW (S, S ′ ).
x∈S∪S ′ x∈S x∈S ′
13
See for example https://en.wikipedia.org/wiki/Hierarchical_clustering.
1.3 Clustering methods 25

Cluster Dendrogram

600
500
400
300
Height

Saluri
Nakamura
200

Dubler
Taiwo
Ushiro

Kazmirek

Ziemek
v. d. Plaetsen
Distelberger
A. d. Araujo

Eaton
Mayer
Victor
Wiesiolek
100

Bourrada
Warner
Auzeil
Tonnesen

Suarez
Abele
Garcia

Helcelet
Felix

distances
hclust (*, "complete")

Figure 1.11: Dendrogram of the agglomerative hierarchical clustering on the results of decathlon.

Proof. We first write


X X X
kx − xS∪S ′ k2 = kx − xS + xS − xS∪S ′ k2 + kx − xS ′ + xS ′ − xS∪S ′ k2
x∈S∪S ′ x∈S x∈S ′
X 
= kx − xS k2 + 2 hx − xS , xS − xS∪S ′ i + kxS − xS∪S ′ k2
x∈S
X 
+ kx − xS ′ k2 + 2 hx − xS ′ , xS ′ − xS∪S ′ i + kxS ′ − xS∪S ′ k2 .
x∈S ′
We now remark that
* +
X X
hx − xS , xS − xS∪S ′ i = (x − xS ), xS − xS∪S ′ = 0,
x∈S x∈S

and a similar identity holds for the sum over so that S′,
X X X
kx−xS∪S ′ k2 = kx−xS k2 +|S|kxS −xS∪S ′ k2 + kx−xS ′ k2 +|S ′ |kxS ′ −xS∪S ′ k2 .
x∈S∪S ′ x∈S x∈S ′
On the other hand,
1 X |S| X |S ′ | X
xS∪S ′ = x = x + x,
|S ∪ S ′ | ′
|S| + |S ′ | |S| + |S ′ | ′
x∈S∪S x∈S x∈S
therefore
 
2 ′ 2 |S||S ′ |2 |S|2 |S ′ |
|S|kxS − xS∪S ′ k + |S |kxS ′ − xS∪S ′ k = + kxS − xS ′ k2
(|S| + |S ′ |)2 (|S| + |S ′ |)2
= dW (S, S ′ ),
which completes the proof.
26 Data analysis

As a corollary of Lemma 1.3.4, it is an easy exercise to deduce that the Ward distance allows
to minimise the increase of the WCSS along the iterations of the agglomerative algorithm.

Exercise 1.3.5. For all t ∈ {0, . . . , n − 1}, let WCSS(t) denote the WCSS of the partition
(t) (t)
S1 , . . . , Sn−t returned by the algorithm. For all t ∈ {0, . . . , n − 2}, for any j 6= l ∈ {1, . . . , n −
(t) (t) (t)
t}, we denote by WCSSjl the WCSS of the partition in which the classes Sj and Sl have been
merged. Show that
(t)
WCSS(t+1) = min WCSSjl . ◦
j6=l

Once again, the partitions of the sequence returned by the algorithm are generally not optimal.
In contrast with the k-means algorithm, they do not depend on an arbitrary choice of the initial
configuration. However, the algorithmic complexity of the method makes is not very useful for
large data sets.

Exercise 1.3.6. Find an example of a data set where the k-means algorithm, or the agglomerative
hierarchical classification, do not provide optimal results. ◦

1.A Exercises
Exercise 1.A.1. Let Kn be the covariance matrix of x1 , . . . , xn . Show that the row space14 of
Kn , namely the set of (row) vectors {uKn : u ∈ Rp } is the linear subspace of Rp spanned by
x1 − xn , . . . , xn − xn . ◦

Exercise 1.A.2. We observe the set of pairs (x1i , x2i ), i ∈ {1, . . . , n}, which are related by the
identity x2i = αx1i + β, where α, β ∈ R. What is the value of the empirical correlation between
the features x1 and x2 ? ◦

Exercise 1.A.3. We observe the set of pairs x1i , x2i , i ∈ {1, . . . , n}, which are related by the
identity x2i = sin(πkx1i /2). For k large, draw a scatter plot of the sample, using a statistical
software, and assuming that the x1i are randomly distributed over [0, 1]. Compute the empirical
correlation. What do you conclude about the interpretation of the empirical correlation? ◦

Exercise 1.A.4 (Spearman’s coefficient). For data sets which are related by a monotonic but non-
linear function, the Bravais–Pearson coefficient may be inadequate to represent the dependency
between the features. Let us take a set of pairs x1i , x2i , i ∈ {1, . . . , n}, such that both vectors of
features have pairwise distinct coordinates. For any index i ∈ {1, . . . , n}, we let ri1 and ri2 be
the respective rank of x1i and x2i among the series x11 , . . . , x1n and x21 , . . . , x2n sorted in increasing
order; in other words, if σ 1 and σ 2 are the permutations of {1, . . . , n} such that

x1σ1 (1) < · · · < x1σ1 (n) , x2σ2 (1) < · · · < x2σ2 (n) ,

then for all i ∈ {1, . . . , n},


σ 1 (ri1 ) = σ 2 (ri2 ) = i.
We now define the Spearman coefficient by

rs = Corr(r 1 , r 2 ).

1. Compute r 1 , r 2 and the Spearman coefficient for the set (15, 9), (12, 7), (18, 8).
14
Image à droite en français.
1.A Exercises 27

2. If x2i = f (x1i ) with f increasing, what is the value of rs ? And is f is decreasing?

3. Show that
X n
6
rs = 1 − (ri1 − ri2 )2 . ◦
n(n2 − 1)
i=1

Exercise 1.A.5 (Dobinski’s formula). Let X be a random variable with Poisson distribution of
parameter 1. For all n ≥ 0, we define bn = E[X n ].

1. Express bn+1 as a function of bn , . . . , b0 .

2. Deduce the Dobinski formula


+∞
1 X kn
Bn =
e k!
k=0

for the n-th Bell number. ◦

Exercise 1.A.6. Recall the k-means algorithm from Section 1.3.1.


(0) (0)
1. Take a square with consecutive corners x1 , x2 , x3 , x4 , and let S1 = {1, 3} and S2 =
{2, 4}. Compute WCSS(0) and WCSS(1) .

2. In general, using the proof of Lemma 1.3.3, show that if WCSS(t+1) = WCSS(t) , then the
(t+1) (t+1) (t+2) (t+2)
partitions S1 , . . . , Sk and S1 , . . . , Sk necessarily coincide.

3. Conclude on the finiteness of the k-means algorithm. ◦

1 Exercise 1.A.7 (Mixing PCA and clustering). Let n ≥ 1, K ∈ {1, . . . , n} and n1 , . . . , nK ≥ 1


be such that n1 + · · · + nK = n. Let µ1 , . . . , µK ∈ Rp be pairwise distinct, and ǫ1 , . . . , ǫn ∈ Rp .
Let us define
ρ = max kǫi k.
1≤i≤n

For all k ∈ {1, . . . , K}, we write Ik = {n1 + · · · + nk−1 + 1, . . . , n1 + · · · + nk } and for all
i ∈ Ik , we define xi = µk + ǫi .

1. When ρ ≪ min1≤j<j ′ ≤p kµj −µj ′ k, describe the general shape of the set of points (xi )1≤i≤n
in Rp .

2. Give an upper bound on the WCSS of the partition {I1 , . . . , IK } of {1, . . . , n} in terms of
ρ. Show that when ρ → 0, this partition has the smallest WCSS.
P
3. Express the vector xn ∈ Rp in terms of µ1 , . . . , µK and ǫn = n1 ni=1 ǫi .

4. Show that the empirical covariance matrix Kn satisfies Kn = Kn0 + O(ρ), and give the
expression of the matrix Kn0 .

5. Describe the shape of the scree plot of the Principal Component Analysis of this data set
when ρ → 0. You may admit that the function A 7→ Spectrum of A is Lipschitz continuous
on the space of symmetric matrices.

6. Interpreting, for each individual xi , the vector µi as ‘actual information’ and the vector ǫi
as ‘noise’, what do you conclude? ◦
28 Data analysis

1.B Summary
1.B.1 Statistical description of a data set
The data set is the n × p matrix xn , with coefficients xji , i ∈ {1, . . . , n}, j ∈ {1, . . . , p}.
P
• Empirical mean: row vector xn = n1 ni=1 xi in Rp .
P
• Empirical covariance: symmetric nonnegative matrix Kn = n1 ni=1 (xi − xn )⊤ (xi − xn ).
q
jj ′ ′ ′
• Empirical correlation between features x and x : Corr(x , x ) = Kn / Knjj Knj j in
j j ′ j j ′

[−1, 1].

1.B.2 Principal Component Analysis


Assume that the set of points is centered, that is to say xn = 0.

• Purpose: find the linear subspace H of dimension k ≤ p which maximises (σnH )2 =


1 Pn H 2 H
n i=1 kxi k , where xi is the orthogonal projection of xi onto H.

• Optimal choice: diagonalise Kn in an orthonormal basis (e1 , . . . , ep ) associated with the


eigenvalues λ1 ≥ · · · ≥ λp ≥ 0 and take H = Span(e1 , . . . , ek ).

• The loadings (e1 , . . . , ep ) are interpreted as ‘aggregated features’ of the model, their relation
with the original features is visualised in the correlation circles.

1.B.3 Clustering
• Purpose: find a partition into k classes with similar features.

• Measure of similarity within the classes: WCSS.

• k-means allows to find a partition into a fixed number k of classes.

• Hierarchical methods provide a sequence of partitions with varying number of classes.


Chapter 2

Parametric estimation

The basic problematic of statistical inference can be summarised as follows: given independent
realisations x1 , . . . , xn of a random variable X with unknown distribution P , how to reconstruct,
or estimate, the probability measure P ? The knowledge of P is useful to understand the physical
origin of the random phenomenon under study, and to predict the probability of occurrence of its
future outcomes.
The main technical issue of the estimation of P is that one has to search within the set of
probability measures on a given state space, which may be huge. In the context of parametric
estimation, this search is restricted to a low-dimensional space by assuming that P has a certain
predetermined shape, which only depends on a few parameters.

2.1 General definitions


2.1.1 Statistical model
We denote by X the space in which the data, namely the observed random variables X1 , . . . , Xn ,
take their values. Generally, X can be a finite set, the set of integers, [0, +∞), R, Rl , or even
an infinite-dimensional space such as C([0, +∞)) if one observes sample paths, which is a usual
situation for instance in finance.
Throughout theses notes, we will always assume that X is endowed with a certain σ-algebra1 ,
with respect to which the random variables X1 , . . . , Xn are properly defined. In particular, when
X is finite or countably infinite, this σ-algebra will be the set of all parts of X (the discrete σ-
algebra); when X = Rl , we shall work with the Borel σ-algebra. It will always be implicit that the
probability measures P or Pθ which we manipulate are defined on this σ-algebra.

Definition 2.1.1 (Parametric model). A parametric model is a set of probability measures

P = {Pθ , θ ∈ Θ}

on the state space X, indexed by a set of parameters Θ ⊂ Rq .

Once a parametric model is fixed, the issue of estimating a probability measure P is thus
reduced to the estimation of a parameter θ ∈ Rq .

Example 2.1.2 (The Exponential model). The lifespan of a lightbulb is a positive random vari-
able X > 0. If one assumes that this variable has an exponential distribution E(λ) for some
1
Tribu en français.
30 Parametric estimation

unknown parameter λ > 0, then the statistician’s task amounts to estimating the value of λ from
the observation of the values of independent random variables X1 , . . . , Xn distributed according
to E(λ).

The choice of a model depends on the physical features of the phenomenon under study. Ta-
ble 2.1 depicts a few possible (and classical) choices, depending on the type of the data.

Type of data State space X Possible model P Parameter set Θ


Binary {0, 1} Bernoulli B(p) p ∈ [0, 1]
Life span, without ageing (0, +∞) Exponential E(λ) λ ∈ (0, +∞)
Life span, with ageing (0, +∞) Gamma Γ(a, λ) (a, λ) ∈ (0, +∞) × (0, +∞)
Real-valued variable R Gaussian N(µ, σ 2 ) (µ, σ 2 ) ∈ R × (0, +∞)
Vector-valued variable Rd Gaussian Nd (m, K) (m, K) ∈ Rd × Rd×d

Table 2.1: Possible parametric models for various types of data.

Throughout the chapter, a parametric model

P = {Pθ , θ ∈ Θ}, Θ ⊂ Rq ,

is fixed. For all θ ∈ Θ, it is convenient to denote by Pθ the probability measure under which, for
all n ≥ 1, the random variables X1 , . . . , Xn ∈ X are iid according to Pθ . The notation Eθ , Varθ ,
etc. is defined accordingly. We shall also write

Xn = (X1 , . . . , Xn ) ∈ Xn

the sample of size n.

2.1.2 Estimator
Definition 2.1.3 (Statistic). A statistic Tn is a random variable which writes

Tn = tn (Xn ),

where the function tn on Xn does not depend on θ.

In other words, a statistic is a random variable whose value can be computed from the obser-
vation of Xn without knowing θ.
As an example with X = R, the empirical mean
n
1X
Xn = Xi
n
i=1

and the empirical variance


n n
1X 1X 2 2
Vn = (Xi − X n )2 = Xi − X n
n n
i=1 i=1
Pn
are statistics. On the contrary, the quantity n1 i=1 (Xi − Eθ [X1 ])2 is not a statistic, because
Eθ [X1 ] generally depends on the value of the parameter θ.
In spite of the reduction to the parametric framework, it may happen that the task of estimating
θ remain difficult, or that one be only interested in partial information on θ. For instance, in the
2.1 General definitions 31

Gaussian model, one may be only willing to estimate µ and not σ 2 , even if the latter is not known;
or in a more general context, it may be interesting to estimate directly some quantities such as
Eθ [X1 ] or Pθ (X1 ≥ x) without resorting to the estimation of the complete value of θ. For these
reasons, we fix a function g : Θ → Rd and turn our interest to the estimation of g(θ). We denote
by k · k and h·, ·i the Euclidean norm and the scalar product in Rd .
Definition 2.1.4 (Estimator of g(θ)). An estimator of g(θ) is a statistic Zn which takes its values
in g(Θ).
Exercise 2.1.5. In the Gaussian model {N(µ, σ 2 ), µ ∈ R, σ 2 > 0}, we let g(µ, σ 2 ) = µ. Which
of the random variables 0, µ, X1 , X n , are estimators of g(µ, σ 2 )? ◦
The exercise above shows that several estimators can exist for g(θ), some of which are intu-
itively better than others. The quality of an estimator can be measured with several criteria.
Definition 2.1.6 (Bias of an estimator). Let Zn be an estimator of g(θ) such that, for all θ ∈ Θ,
Eθ [kZn k] < +∞. The bias of Zn is the function b(Zn ; ·) : Θ → Rd defined by
b(Zn ; θ) = Eθ [Zn ] − g(θ).
If b(Zn ; ·) ≡ 0, the estimator Zn is said unbiased.
Example 2.1.7 (Empirical mean and variance). The empirical mean X n defined above is easily
seen to be an unbiased estimator of Eθ [X1 ]. On the contrary, it is an elementary exercise to show
that, for all θ ∈ Θ,  
1
Eθ [Vn ] = 1 − Varθ [X1 ],
n
so that as soon as Varθ [X1 ] > 0, the empirical variance is biased. This motivates the introduction
of the unbiased estimator of the variance
n
1 X
Sn2 = (Xi − X n )2 .
n−1
i=1

In the latter example, the biased estimator of the variance is easily corrected into an unbiased
estimator. However, we shall see in Exercise 2.A.3 that there are situations in which there is no
unbiased estimator.
Definition 2.1.8 (Mean Squared Error). Let Zn be an estimator of g(θ) such that, for all θ ∈ Θ,
Eθ [kZn k2 ] < +∞. The Mean Squared Error2 (MSE) of Zn is the function R(Zn ; ·) : Θ →
[0, +∞) defined by
R(Zn ; θ) = Eθ [kZn − g(θ)k2 ].
The MSE naturally measures a certain distance between Zn and g(θ), and therefore is an
indication of the quality of the estimator Zn . It has the general shape
R(Zn ; θ) = Eθ [ℓ(Zn ; g(θ))],
with ℓ : Rd × Rd → [0, +∞) given by ℓ(z, z ′ ) = kz − z ′ k2 . In this formulation, ℓ is called a loss
function. By taking loss functions that are not quadratic, one can construct different risk functions
R(Zn ; ·), which provide other measures of the accuracy of the estimator Zn .
The choice of a quadratic loss function has the advantage to entail a decomposition of the MSE
into a term of bias and a term of variance. We recall that the variance of a random vector Z ∈ Rd
is defined by Var(Z) = E[kZ − E[Z]k2 ].
2
Risque quadratique en français.
32 Parametric estimation

Proposition 2.1.9 (Bias and variance decomposition of the MSE). Let Zn be an estimator of g(θ)
such that, for all θ ∈ Θ, Eθ [kZn k2 ] < +∞. For all θ ∈ Θ,

R(Zn ; θ) = kb(Zn ; θ)k2 + Varθ [Zn ].

Proof. Starting from the definition of the MSE, we write


 
R(Zn ; θ) = Eθ k(Zn − Eθ [Zn ]) − (g(θ) − Eθ [Zn ])k2
 
= Eθ kZn − Eθ [Zn ]k2 − 2hZn − Eθ [Zn ], g(θ) − Eθ [Zn ]i + kg(θ) − Eθ [Zn ]k2
 
= Eθ kZn − Eθ [Zn ]k2 + kg(θ) − Eθ [Zn ]k2 ,

which leads to the expected decomposition.

Exercise 2.1.10 (Comparison of two estimators). Assume that X = R and, for all θ ∈ Θ, Eθ [|X1 |] <
+∞. Let g(θ) = Eθ [X1 ]. Compute the MSE of the estimators X1 and X n of g(θ). ◦

In the previous exercise, X n has a lower MSE than X1 , uniformly with respect to the parameter
θ. We now present an example of a family of estimators for which it is not possible to minimise
the MSE uniformly over the parameter.

Exercise 2.1.11 (On the bias-variance tradeoff). In the Bernoulli model {B(p), p ∈ [0, 1]}, we
consider the ‘natural estimator’
n
1X
pbn = Xi
n
i=1

of p, and introduce the family of estimators

h
pbhn = (1 − h)b
pn + , h ∈ [0, 1].
2

1. Express pbhn in the extremal cases h = 0 and h = 1.

phn ; p) and Varp [b


2. Compute b(b phn ].

phn ; p)2 minimal? And the variance?


3. For which value of h is b(b

4. For p ∈ [0, 1], compute the value h∗n (p) of h which minimises the MSE R(b
phn ; p). ◦

Exercise 2.1.11 shows that in general, the bias and the variance cannot be simultaneously
minimised, and a trade-off between the bias and the variance must be considered.

2.1.3 Asymptotic properties


Definition 2.1.12 (Consistency). The estimator Zn of g(θ) is consistent if, for all θ ∈ Θ, Zn
converges to g(θ) in Pθ -probability, that is to say

∀ǫ > 0, lim Pθ (kZn − g(θ)k ≥ ǫ) = 0.


n→+∞

It is strongly consistent if the convergence holds Pθ -almost surely, that is to say


 
Pθ lim Zn = g(θ) = 1.
n→+∞
2.1 General definitions 33

Remark 2.1.13 (Consistency and MSE). If R(Zn ; θ) converges to 0 for all θ ∈ Θ, then Zn is
consistent. ◦
Exercise 2.1.14. If X = R and Eθ [X1 ] and Varθ [X1 ] exist, show that the empirical mean and
variance are strongly consistent. ◦
More generally, it is often the case that an estimator can be written as a continuous function
of the empirical mean of a sequence of iid random variables, in which case consistency properties
can be deduced from the Law of Large Numbers.
Definition 2.1.15 (Asymptotic normality). A consistent estimator Zn of g(θ) is asymptotically
normal if, for all θ ∈ Θ, there exists a symmetric and nonnegative matrix K(θ) ∈ Rd×d such

that n(Zn − g(θ)) converges in distribution, under Pθ , to the d-dimensional Gaussian measure
Nd (0, K(θ)). The matrix-valued function θ 7→ K(θ) is called the asymptotic covariance of Zn .
When d = 1, K(θ) is a nonnegative scalar, called the asymptotic variance of Zn . In the sequel
of the course, the notion of asymptotic normality shall play a central role in the construction of
asymptotic confidence intervals and tests.
Asymptotic normality results are almost always obtained by applying the Central Limit The-
orem, possibly in its multidimensional version recalled in Appendix A.3, p. 137. Sometimes, it
needs to be combined with the Delta Method.
Theorem 2.1.16 (Delta Method). Let (ζn )n≥1 be a sequence of Rk -valued random variables and

a ∈ Rk such that ζn → a in probability and n(ζn − a) converges in distribution to some random
vector G ∈ Rk . Let φ : Rk → Rd be a C 1 function. Then

lim n (φ(ζn ) − φ(a)) = φ′ (a)G, in distribution,
n→+∞

where φ′ (a) is the d × k matrix with coefficients


 ∂φi
φ′ (a) ij = (a), i = 1, . . . , d, j = 1, . . . , k.
∂xj

Proof. In order to avoid technical arguments, we assume that φ is C 2 , with globally bounded
second derivatives3 .
Let i ∈ {1, . . . , d}. For all n ≥ 1,
Z 1
√ √ d
n(φi (ζn ) − φi (a)) = n φi ((1 − t)a + tζn )dt
t=0 dt
 Z 1 

= n(ζn − a), ∇φi ((1 − t)a + tζn )dt .
t=0

Owing to our regularity assumption, there exists a constant C ≥ 0 such that


Z 1 Z 1
C

∇φi ((1 − t)a + tζn )dt − ∇φi (a) ≤ C |(1 − t)a + tζn − a|dt = |ζn − a|.
2
t=0 t=0

As a consequence
Z 1
lim ∇φi ((1 − t)a + tζn )dt = ∇φi (a), in probability,
n→+∞ t=0

and the conclusion follows from Slutsky’s Theorem.


3
For a proof in a general framework, we refer to [6, Theorem 3.1, p. 26].
34 Parametric estimation

Examples of applications of this theorem are given in the next section. In particular, we shall
generally use the combination of the Central Limit Theorem and the Delta Method under the
following form: if X1 , . . . , Xn ∈ R are iid with finite variance and φ is C 1 , then
√    
n φ X n − φ (E[X1 ]) → N 0, φ′ (E[X1 ])2 Var(X1 ) , in distribution.

2.1.4 The method of moments


The method of moments is a natural procedure to construct estimators. We start by detailing the
example of the Exponential model {E(λ), λ > 0}, in which we look for an estimator of λ. By the
Law of Large Numbers, for all λ > 0,
1
lim X n = Eλ [X1 ] = , Pλ -almost surely,
n→+∞ λ
so that
1
lim = λ, Pλ -almost surely,
n→+∞ X n

by continuity of the function x 7→ 1/x. As a consequence, λ en = 1/X n is a strongly consistent


4
estimator of the parameter λ. Combining the Central Limit Theorem with the Delta Method, it
can furthermore be checked that this estimator is asymptotically normal. Indeed, the Central Limit
Theorem first asserts that
 
√ 1
n Xn − → G ∼ N(0, 1/λ2 ) in distribution;
λ
second, applying the Delta Method with ζn = X n , φ(x) = 1/x and a = 1/λ yields
√   √ 
n λen − λ = n φ(X n ) − φ(1/λ) → φ′ (1/λ)G = λ2 G in distribution,

from which we conclude that the estimator λ en is asymptotically normal, with asymptotic variance
′ 2 2
φ (1/λ) /λ = λ . 2

The abstract generalisation of this procedure is called the method of moments. For the estima-
tion of g(θ) ∈ Rd in the model {Pθ , θ ∈ Θ}, it consists in finding functions ϕ and m such that,
for all θ ∈ Θ,
Eθ [ϕ(X1 )] = m(g(θ)).
P
Then the Law of Large Numbers allows to approximate m(g(θ)) with n1 ni=1 ϕ(Xi ), so that as
soon as m has a continuous inverse function m−1 ,
n
!
1 X
Zn = m−1 ϕ(Xi )
n
i=1

is a strongly consistent estimator of g(θ). Under further regularity assumptions on m−1 , the
Central Limit Theorem and the Delta Method show that Zn is asymptotically normal.
Exercise 2.1.17. Write the general expression of the asymptotic variance of Zn in terms of Eθ [ϕ(X1 )]
and Varθ [ϕ(X1 )]. ◦
When constructing an estimator with the methods of moments, one usually tries the ‘simplest’
functions ϕ, for which the computation of Eθ [ϕ(X1 )] is possible or easy: typically, ϕ(x) = x or
ϕ(x) = |x|2 , |x|3 ... are natural candidates. The method can also be employed with functions of
the form ϕ(x) = 1{x≤x0 } for given x0 , see Exercise 2.A.8 for instance.
4
But it is an instructive exercise to check that it is biased.
2.2 Maximum Likelihood Estimation and efficiency 35

2.2 Maximum Likelihood Estimation and efficiency


2.2.1 The Maximum Likelihood Estimator
In this section, we assume that:
• either X ⊂ Rl , l ≥ 1, and for all θ ∈ Θ, the probability measure Pθ possesses a density
p(x; θ) with respect to the Lebesgue measure on Rl ;

• or X is a countable space, and we define p(x; θ) = Pθ ({x}) = Pθ (X1 = x) for all x ∈ X.


Furthermore, for the sake of simplicity we take g(θ) = θ ∈ Rq .
Definition 2.2.1 (Likelihood of an observation). Let xn = (x1 , . . . , xn ) ∈ Xn be a possible value
of the sample Xn = (X1 , . . . , Xn ). The likelihood5 of this realisation is the function Ln (xn ; ·) :
Θ → [0, +∞) defined by
n
Y
Ln (xn ; θ) = p(xi ; θ).
i=1

The Maximum Likelihood Estimation relies on the principle of selecting the parameter which
makes the observed realisation of the sample the most likely.
Definition 2.2.2 (Maximum Likelihood Estimator). Assume that, for all xn = (x1 , . . . , xn ) ∈ Xn ,
the function θ 7→ Ln (xn ; θ) reaches a global maximum at θ = θn (xn ). The Maximum Likelihood
Estimator (MLE) of θ is the statistic defined by

θbn = θn (Xn ).

Remark 2.2.3 (On the notation). In this definition, the notation θn refers to the deterministic
function Xn → Θ, while θbn denotes the random variable obtained by applying the function θn to
the random vector Xn . ◦
When the function θ 7→ Ln (xn ; θ) is differentiable, θn (xn ) can be computed by looking for
the points at which the derivative of Ln (xn ; θ) vanishes — without forgetting to check that these
points actually correspond to a maximum! In this perspective, it may be more convenient to take
the derivative of the log-likelihood

ℓn (xn ; θ) = log Ln (xn ; θ),

rather that Ln (xn ; θ), because the product over i in the definition of the likelihood is turned into a
sum. Since the logarithm is increasing, both approaches are equivalent.
Remark 2.2.4 (Notions of Z- and M-estimator). Moment estimators are defined as the solution to
an algebraic equation: such estimators are called Z-estimators (because they are Zeroes of a func-
tion). In contrast, the MLE is given as the solution to an optimisation problem: such estimators
are called M-estimators (because they are Maxima of a function). The Least Square Estimator in-
troduced in the context of linear regression detailed in Section 5.1 of Chapter 5 is another example
of an M-estimator.
In the academic examples of the present notes, the resolution of algebraic equations or op-
timisation problem can generally be made analytically (to the notable exception of the logistic
regression described in Chapter 5). This is however not possible in general, and computing Z- or
M-estimators often requires to employ a numerical method. ◦
5
Vraisemblance en français.
36 Parametric estimation

Example 2.2.5 (The Exponential model). The likelihood of a realisation xn = (x1 , . . . , xn ) ∈


(0, +∞)n in the Exponential model {E(λ), λ > 0} writes
n n
!
Y X
n
Ln (xn ; λ) = λ exp(−λxi ) = λ exp −λ xi .
i=1 i=1

The log-likelihood is
n
X
ℓn (xn ; λ) = n log λ − λ xi ,
i=1

it is C1 on (0, +∞) and


n
d n X
ℓn (xn ; λ) = − xi
dλ λ
i=1
vanishes if and only if λ is equal to
n
λn (xn ) = Pn .
i=1 xi

Since
d d
ℓn (xn ; λ) > 0 if λ < λn (xn ), ℓn (xn ; λ) < 0 if λ > λn (xn ),
dλ dλ
the log-likelihood — and therefore the likelihood — actually attains its maximum at λn (xn ). As a
consequence, the MLE of λ is
bn = λn (Xn ) = 1 .
λ
Xn
The MLE coincides with moment estimator obtained in Section 2.1.4. We shall see in Exam-
ple 2.2.9 that it is not always the case.
Exercise 2.2.6 (The Bernoulli model). Compute the MLE of p in the Bernoulli model {B(p), p ∈
[0, 1]}. Show that it is strongly consistent and asymptotically normal. ◦
Example 2.2.7 (The Gaussian model). The likelihood of a realisation xn = (x1 , . . . , xn ) ∈ Rn
in the Gaussian model {N(µ, σ 2 ), µ ∈ R, σ 2 > 0} writes
n   n
!
Y 1 (xi − µ)2 1 X
2 2 −n/2 2
Ln (xn ; µ, σ ) = √ exp − = (2πσ ) exp − 2 (xi − µ) .
2πσ 2 2σ 2 2σ
i=1 i=1

The log-likelihood is
n
2 n 2 1 X
ℓn (xn ; µ, σ ) = − log(2πσ ) − 2 (xi − µ)2 ,
2 2σ
i=1

it is C 1 on R × (0, +∞) and satisfies


n n
∂ 2 1 X ∂ 2 n 1 X
ℓn (xn ; µ, σ ) = 2 (xi − µ), ℓn (xn ; µ, σ ) = − 2 + 4 (xi − µ)2 .
∂µ σ ∂σ 2 2σ 2σ
i=1 i=1

Both derivatives vanish if and only


n n
1X 1X
µ = µn (xn ) = xi , σ 2 = σn2 (xn ) = (xi − µn (xn ))2 ,
n n
i=1 i=1
2.2 Maximum Likelihood Estimation and efficiency 37

and the fact that the log-likelihood actually attains its maximum at the point (µn (xn ), σn2 (xn ))
can be checked by studying the sign of the Hessian matrix of ℓn (xn ; µ, σ 2 ) at this point, which is
left as an exercise to the reader. As a conclusion, the MLE of (µ, σ 2 ) is given by

bn = µn (Xn ) = X n ,
µ bn2 = σn2 (Xn ) = Vn ,
σ

with the notation of p. 28.

Remark 2.2.8. By Proposition A.4.3, p. 139, the estimators µ bn2 are independent.
bn and σ ◦

Example 2.2.9 (The Uniform model). We consider the set of uniform distributions {U([0, θ]), θ >
0}. For a given θ > 0, the random variables X1 , . . . , Xn take their values in [0, θ], Pθ -almost
surely; but since θ can a priori take any positive value, we have to work with the state space
X = [0, +∞). For any xn = (x1 , . . . , xn ) ∈ [0, +∞)n , the likelihood of the realisation xn writes
n
Y 1
Ln (xn ; θ) = 1{xi ≤θ} = θ −n 1{max1≤i≤n xi ≤θ} .
θ
i=1

As a function of θ > 0, the log-likelihood is not differentiable at the point max1≤i≤n xi , therefore
the method of the previous examples cannot be applied. On the other hand, the study of the
variations of the function θ 7→ Ln (xn ; θ) is straightforward (see Figure 2.1) and shows that the
latter attains its maximum for θ = θn (xn ) = max1≤i≤n xi . As a consequence, the MLE is given
by
θbn = max Xi .
1≤i≤n

θ −n

max xi
1≤i≤n θ
Figure 2.1: The likelihood of the Uniform model.

As a comparison, using the remark that Eθ [X1 ] = θ/2, we immediately obtain the moment
estimator θen = 2X n , which is very different from the MLE.

Exercise 2.2.10. Consider the MLE θbn in the Uniform model of Example 2.2.9.

1. Show that θbn is consistent.

2. Using the remark that the sequence (θbn )n≥1 is monotonic, show that consistency holds in
the strong sense.

3. For all q ≥ 0, compute Pθ (n(θ − θbn ) ≥ q).


38 Parametric estimation

4. Deduce that n(θ − θbn ) converges in distribution6 to E(1/θ).

5. Is θbn asymptotically normal? ◦

In the Uniform model of Example 2.2.9, the support of the law of X1 depends on the param-
eter θ, which makes the likelihood not differentiable with respect to θ. Trying to differentiate
the likelihood, or the log-likelihood, with respect to θ, then leads to wrong results. Models for
which the support depends on the parameter generally belong to the class of nonregular models,
as introduced in the next paragraph.

2.2.2 Regular models and Fisher information


We recall that L1 (x1 ; θ) = p(x1 ; θ) (respectively, ℓ1 (x1 ; θ) = log p(x1 ; θ)) denotes the likeli-
hood (respectively, the log-likelihood) of a single observation x1 . We denote by (θ1 , . . . , θq ) the
coordinates of the parameter θ ∈ Θ ⊂ Rq , and for any smooth function ϕ : Θ → R, we shall write
 
∂ ∂
∇θ ϕ(θ) = ϕ(θ), . . . , ϕ(θ) ∈ Rq .
∂θ1 ∂θq

Definition 2.2.11 (Regular model). A parametric model is regular if:

(i) Θ is an open subset of Rq ,

(ii) for all x1 ∈ X, for all θ ∈ Θ, L1 (x1 ; θ) > 0,

(iii) for all x1 ∈ X, the function θ 7→ ℓ1 (x1 ; θ) is C 1 on Θ,

(iv) for all θ ∈ Θ, Eθ [k∇θ ℓ1 (X1 ; θ)k2 ] < +∞.

Example 2.2.12 (Regular models). The Bernoulli, Exponential and Gaussian models are regular,
but the Uniform model of Example 2.2.9 is not (why?).

Definition 2.2.13 (Score). The score of a regular model is the random vector

∇θ ℓ1 (X1 ; θ) ∈ Rq ,

with coordinates
∂ 1 ∂
ℓ1 (X1 ; θ) = p(X1 ; θ), i ∈ {1, . . . , q}.
∂θi p(X1 ; θ) ∂θi

In the case where Pθ possesses a density p(x; θ) on Rl , assuming the validity of the computa-
tion Z Z
∂ ∂
p(x; θ)dx = p(x; θ)dx = 0 (∗)
∂θi x∈Rl x∈Rl ∂θi
| {z }
=1

provides the identity


Eθ [∇θ ℓ1 (X1 ; θ)] = 0, (∗∗)
which we will admit in general.
6
You may admit that if a sequence of random variables Xn ∈ R with cumulative distribution function Fn is such
that Fn (x) → F (x) for all x ∈ R at which F is continuous, then Xn converges in distribution to the random variable
X ∈ R with cumulative distribution function F .
2.2 Maximum Likelihood Estimation and efficiency 39

Definition 2.2.14 (Fisher information). For a regular model, the Fisher information I(θ) is the
covariance matrix of the score, with coefficients

 
∂ ∂
Iij (θ) = Eθ ℓ1 (X1 ; θ) ℓ1 (X1 ; θ) .
∂θi ∂θj

Exercise 2.2.15. Taking the derivative with respect to θj of Equation (∗), show that (whenever the
computation is justified) the coefficients of the Fisher information also write

 
∂2
Iij (θ) = −Eθ ℓ1 (X1 ; θ) . ◦
∂θi ∂θj

Example 2.2.16. In the Exponential model, we have

∀x1 > 0, ℓ1 (x1 ; λ) = log λ − λx1 ,

so that the score writes λ−1 − X1 . As a consequence, I(λ) = 1/λ2 .

2.2.3 Efficient estimator

Throughout this paragraph, we take g : Θ → R. In this context, Exercise 2.1.11 shows that,
in general, an estimator cannot minimise the MSE uniformly in θ, because of the bias-variance
tradeoff. However, if one restricts itself to the class of unbiased estimators, then the search for
an estimator with minimal MSE is reduced to problem of variance minimisation. Exercise 2.1.10
provides an example of two unbiased estimators, one of which has a lower MSE uniformly in θ.
For regular models, the Fisher information provides an interesting lower bound on the variance of
an unbiased estimator.

Theorem 2.2.17 (Fréchet–Darmois–Carmér–Rao (FDCR) bound). For a regular model such that
I(θ) is positive definite for all θ ∈ Θ, let g : Θ → R be smooth, and let Zn = zn (Xn ) be an
unbiased estimator of g(θ). For all θ ∈ Θ,

1
Varθ [Zn ] ≥ h∇g(θ), I −1 (θ)∇g(θ)i,
n

where I −1 (θ) is the inverse of the Fisher information.

Proof. We assume that we are in the case where X is a subset of Rl and for all θ ∈ Θ, the
probability measure Pθ possesses a density with respect to the Lebesgue measure on Rl . The
adaptation of the proof to the case of discrete random variables is straightforward.
Since Zn is unbiased,
Z
g(θ) = Eθ [Zn ] = zn (xn )Ln (xn ; θ)dxn ,
xn ∈Xn
40 Parametric estimation

so that, for all k ∈ {1, . . . , q},


Z
∇g(θ) = zn (xn )∇θ Ln (xn ; θ)dxn
xn ∈Xn
Z n
!
Y
= zn (xn )∇θ p(xi ; θ) dxn
xn ∈Xn i=1
 
Z n
X Y
= zn (xn )  ∇θ p(xi ; θ) p(xj ; θ) dxn
xn ∈Xn i=1 j6=i
Z n
! n
X Y
= zn (xn ) ∇θ ℓ1 (xi ; θ) p(xi ; θ)dxn
xn ∈Xn i=1 j=1
" n
#
X
= E θ Zn ∇θ ℓ1 (Xi ; θ) .
i=1

On the other hand, Equation (∗∗) on p. 36 implies that


" n
#
X
0 = Eθ g(θ) ∇θ ℓ1 (Xi ; θ) ,
i=1

so that " #
n
X
∇g(θ) = Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ) .
i=1
Thus,
* " n
# +

X
−1 −1
∇g(θ), I (θ)∇g(θ) = Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ) , I (θ)∇g(θ)
" * i=1
n
+#
X
−1
= Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ) ,
i=1

and by the Cauchy–Schwarz inequality, we get


* +2 
Xn

2  
∇g(θ), I −1 (θ)∇g(θ) ≤ Eθ (Zn − g(θ))2 Eθ  ∇θ ℓ1 (Xi ; θ), I −1 (θ)∇g(θ)  .
i=1

Since Zn is unbiased,  
Eθ (Zn − g(θ))2 = Varθ [Zn ],
while using Equation (∗∗) on p. 36 again together with the Definition 2.2.14 of I(θ) shows that
* +2  "* n +#
X n X
−1 −1
Eθ  ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ)  = Varθ ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ)
i=1 i=1


= n Varθ ∇θ ℓ1 (X1 ; θ), I −1 (θ)∇g(θ)


= n I −1 (θ)∇g(θ), I(θ)I −1 (θ)∇g(θ)


= n I −1 (θ)∇g(θ), ∇g(θ) ,

which completes the proof.


2.2 Maximum Likelihood Estimation and efficiency 41

Definition 2.2.18 (Efficient estimator). An unbiased estimator Zn of g(θ) such that Varθ [Zn ] =
h∇g(θ), I −1 (θ)∇g(θ)i/n is called efficient7 .

Theorem 2.2.17 shows that efficient estimators have the smallest possible MSE among the
class of unbiased estimators.

Exercise 2.2.19. Check that in the Bernoulli model, the estimator X n of p is efficient. ◦

There is also an asymptotic notion of efficiency.

Definition 2.2.20 (Asymptotically efficient estimator). A consistent estimator Zn of g(θ) such that

n(Zn −g(θ)) converges in distribution to a random variable with variance h∇g(θ), I −1 (θ)∇g(θ)i
is called asymptotically efficient.

2.2.4 Asymptotic efficiency of the MLE


In regular models, the following statement is generally true:

The MLE θbn is consistent and asymptotically normal, with asymptotic covariance matrix I −1 (θ).

This fact is most easily observed case-by-case by:

(i) computing the asymptotic covariance K(θ) of θbn (often using the Central Limit Theorem
and the Delta Method) on the one hand;

(ii) computing the Fisher information I(θ) on the other hand;

(iii) finally checking that K(θ) = I −1 (θ).

A heuristic justification of why this assertion holds in general is sketched in Remark 2.2.23 below.

Exercise 2.2.21. Compute I(µ, σ 2 ) in the Gaussian model. Compare the results with Exam-
ple 2.2.7. ◦

We deduce the following striking statement.

Proposition 2.2.22 (Asymptotic efficiency of the MLE). Assume that θbn is consistent and asymp-
totically normal, with asymptotic covariance matrix I −1 (θ). Then for any C 1 function g : Θ → R,
the estimator Zn = g(θbn ) of g(θ) is asymptotically efficient.

The proof is a straightforward application of the Delta Method and is left as an exercise.
This proposition emphasises the interest of Maximum Likelihood Estimation, especially in regular
models.

Remark 2.2.23. The fact that, in general, θbn is consistent and asymptotically normal with asymp-
totic covariance matrix I −1 (θ), can be derived thanks to the following heuristic computation8 . For
the sake of simplicity, we assume that q = 1 so that Θ ⊂ R.
By the definition of θn (xn ) and the regularity assumption on ℓ1 (x1 ; ·), we have

d
ℓn (xn ; θn (xn )) = 0.

7
Efficace en français.
8
We refer to [6, Theorem 5.39, p. 65] for a complete proof.
42 Parametric estimation

Using the Taylor approximation

d d d2
ℓn (xn ; θn (xn )) ≃ ℓn (xn ; θ) + 2 ℓn (xn ; θ)(θn (xn ) − θ),
dθ dθ dθ
we thus deduce that
d 1 d
ℓn (Xn ; θ) √ ℓn (Xn ; θ)
√ dθ √ n dθ
b
n(θn − θ) ≃ − n 2 =− .
d 1 d2
ℓn (Xn ; θ) ℓn (Xn ; θ)
dθ 2 n dθ 2
On the one hand, Equation (∗∗) p. 36 shows that
n  !
1 d √ 1X d d
√ ℓn (Xn ; θ) = n ℓ1 (Xi ; θ) − Eθ ℓ1 (X1 ; θ) ,
n dθ n dθ dθ
i=1

which by the Central Limit Theorem converges in distribution, under Pθ , to N(0, I(θ)). On the
other hand, by the Law of Large Numbers,
n
1 d2 1 X d2
2
ℓn (Xn ; θ) = ℓ1 (Xi ; θ)
n dθ n dθ 2
i=1

converges Pθ -almost surely to


 
d2
Eθ ℓ1 (X1 ; θ) = −I(θ).
dθ 2

By Slutsky’s Theorem, we conclude that n(θbn −θ) converges in distribution to N(0, 1/I(θ)). ◦

2.3 * Sufficient statistics and the Rao–Blackwell Theorem


2.3.1 Conditional expectation
The contents of this section rely on the notion of conditional expectation, which is introduced
and studied in detail in the course Stochastic Processes and their Applications9 . We recall its
definition. Let (Ω, F, P) be a probability space and S be a random variable taking its values in
some measurable space S.

Definition 2.3.1 (Conditional expectation). For any random variable X ∈ R such that E[|X|] <
+∞, there exists an almost surely unique random variable Z ∈ R such that:

(i) E[|Z|] < +∞;

(ii) there exists a measurable function ϕ : S → R such that Z = ϕ(S);

(iii) for any bounded and measurable function ψ : S → R, E[ψ(S)X] = E[ψ(S)Z].

The random variable Z is called the conditional expectation of X given S and it is denoted by
E[X|S].

It is a nice exercise to check that the conditional expectation enjoys the following properties.
9
http://cermics.enpc.fr/~delmas/Enseig/proba2.html
2.3 * Sufficient statistics and the Rao–Blackwell Theorem 43

Proposition 2.3.2 (Properties of conditional expectation). (i) For any measurable function ρ :
S → R such that E[|ρ(S)X|] < +∞, E[ρ(S)X|S] = ρ(S)E[X|S].

(ii) Total expectation formula: E[E[X|S]] = E[X].

(iii) Linearity: E[λX + µY |S] = λE[X|S] + µE[Y |S].

(iv) Positivity: if X ≥ 0 then E[X|S] ≥ 0.

(v) Jensen’s inequality: for any convex function f : R → R, f (E[X|S]) ≤ E[f (X)|S].

Exercise 2.3.3 (Further properties). 1. Show that if X is independent from S, E[X|S] = E[X].

2. Let ρ : S → R be such that E[|ρ(S)|] < +∞. Compute E[ρ(S)|S]. ◦

In the particular case where S is a finite or countably infinite space, the function ϕ appearing
in Definition 2.3.1 has an elementary expression.

Proposition 2.3.4 (Conditional expectation in the discrete case). Assume that S is a finite or count-
ably infinite space. Let X ∈ R be a random variable such that E[|X|] < +∞, and let ϕ : S → R
be defined by

 E[X1{S=s} ] if P(S = s) > 0,
ϕ(s) = P(S = s)

0 otherwise.

Then E[X|S] = ϕ(S), almost surely.

Proof. Let Z = ϕ(S). By construction, the random variable Z satisfies the point (ii) of Defini-
tion 2.3.1. Besides,

E[|Z|] = E[|ϕ(S)|]
X
= |ϕ(s)|P(S = s)
s∈S
X
= E[X1{S=s} ]
s∈S
X
≤ E[|X|1{S=s} ]
s∈S
" #
X
= E |X| 1{S=s}
s∈S
= E[|X|] < +∞,

which proves the point (iii) of Definition 2.3.1. Finally, we let ψ : S → R be a bounded and
measurable function, and compute
X X
E [ψ(S)Z] = E [ψ(S)ϕ(S)] = ψ(s)ϕ(s)P(S = s) = ψ(s)E[X1{S=s} ] = E[Xψ(S)],
s∈S s∈S

which completes the proof.


44 Parametric estimation

2.3.2 Sufficient statistic and Halmos–Savage Theorem


We come back to the framework of statistical inference and fix a parametric model P = {Pθ , θ ∈
Θ}. Let Sn = sn (Xn ) be a statistic taking its values in a space denoted by S. Then, for any
(measurable and) bounded function f : Xn → R and for all θ ∈ Θ, by Definition 2.3.1 there exists
a function ϕθ : S → R such that

Eθ [f (Xn )|Sn ] = ϕθ (Sn ), Pθ -almost surely.

Definition 2.3.5 (Sufficient statistic). The statistic Sn is called sufficient10 for θ if, for any (mea-
surable and) bounded function f : Xn → R, the function ϕθ introduced above does not depend on
θ; in other words, if the conditional distribution of the sample Xn given Sn does not depend on
the parameter.

Intuitively, this definition means that the knowledge of the whole sample Xn does not bring
more information on the parameter θ than the mere knowledge of the value of Sn . Usually, Sn has
a low dimension, and therefore is easy to store, while the size of the sample Xn basically depends
on n and may be huge. In other words, Sn provides an exhaustive summary of the data, as far the
estimation of θ is concerned.
P
Example 2.3.6. In the Bernoulli model, we show that Sn = ni=1 Xi is a sufficient statistic for the
parameter p. This statistic takes its values in the finite set S = {0, . . . , n}. Fixing f : {0, 1}n → R
and s ∈ S, we first use Proposition 2.3.4 to write

Ep [f (Xn )|Sn ] = ϕp (Sn ),

where, for all s ∈ S,  


Ep f (Xn )1{Sn =s}
ϕp (s) = .
Pp (Sn = s)
On the one hand, Sn = X1 + · · · + Xn ∼ B(n, p) under Pp , so that
 
n s
Pp (Sn = s) = p (1 − p)n−s .
s
On the other hand,
  X
Ep f (Xn )1{Sn =s} = f (xn )1{sn (xn )=s} Pp (Xn = xn )
xn ∈Xn
X n
Y
= f (xn )1{x1 +···+xn =s} pxi (1 − p)1−xi
xn ∈Xn i=1
X
= f (xn )1{x1 +···+xn =s} ps (1 − p)n−s .
xn ∈Xn

As a consequence, X
f (xn )1{x1 +···+xn =s}
xn ∈Xn
ϕp (s) =   ,
n
s
which does not depend on p.
10
Statistique exhaustive en français.
2.3 * Sufficient statistics and the Rao–Blackwell Theorem 45


Remark 2.3.7 (On the conditional distribution of Xn given Sn ). In the example above, ns is
exactly the number of elements xn ∈ {0, 1}n such that x1 + · · · + xn = s, so that the conditional
distribution of Xn given Sn can be interpreted as the uniform distribution in the set of possible
realisations of the sample which are compatible with the prescribed value of Sn . ◦

In general, the computation of the conditional distribution of a random variable is a nice exer-
cise of probability theory, but may happen to be tedious. The next result allows to find sufficient
statistics without resorting to such computations.

Theorem 2.3.8 (Halmos–Savage factorisation Theorem). The statistic Sn = sn (Xn ) is sufficient


for θ if and only if there exist functions φ : S × Θ → R, ψ : Xn → R such that

Ln (xn ; θ) = φ(sn (xn ); θ)ψ(xn ).

Proof. We detail the proof in the case where X is a finite or countably infinite set. Then so is
S = sn (Xn ), and the conditional expectation with respect to Sn is described by Proposition 2.3.4.
The case where X ⊂ Rl follows from similar arguments, expressed in an appropriate formalism.
Necessary condition: let us assume that Sn is sufficient. Then
n
Y
Ln (xn ; θ) = p(xi ; θ)
i=1
= Pθ (Xn = xn )
= Pθ (Xn = xn , Sn = sn (xn ))
= Pθ (Xn = xn |Sn = sn (xn ))Pθ (Sn = sn (xn )).

Since Sn is sufficient,
 
Pθ (Xn = xn |Sn = sn (xn )) = Eθ 1{Xn =xn } |Sn = sn (xn ) = ψ(xn )

does not depend on θ, while

Pθ (Sn = sn (xn )) = φ(sn (xn ); θ)

only depends on xn through sn (xn ).


Sufficient condition: let us assume that Ln (xn ; θ) = φ(sn (xn ); θ)ψ(xn ). For all s ∈ S,
X X X
Pθ (Sn = s) = Pθ (Xn = xn ) = Ln (xn ; θ) = φ(s; θ) ψ(xn ).
xn ∈Xn xn ∈Xn xn ∈Xn
sn (xn )=s sn (xn )=s sn (xn )=s

On the other hand, for any bounded function f : Xn → R,


  X
Eθ f (Xn )1{sn (Xn )=s} = f (xn )1{sn (xn )=s} Pθ (Xn = xn )
xn ∈Xn
X
= f (xn )1{sn (xn )=s} Ln (xn ; θ)
xn ∈Xn
X
= φ(s; θ) f (xn )ψ(xn ),
xn ∈Xn
sn (xn )=s
46 Parametric estimation

so that
X X
φ(s; θ) f (xn )ψ(xn ) f (xn )ψ(xn )
xn ∈Xn xn ∈Xn
sn (xn )=s sn (xn )=s
Eθ [f (Xn )|Sn = s] = X = X ,
φ(s; θ) ψ(xn ) ψ(xn )
xn ∈Xn xn ∈Xn
sn (xn )=s sn (xn )=s

which does not depend on θ.

Exercise 2.3.9. Show that, in the Exponential model, X n is a sufficient statistic for λ. ◦
Exercise 2.3.10. Show that, in the Gaussian model, (X n , Vn ) is a sufficient statistic for (µ, σ 2 ). ◦
Remark 2.3.11. If the likelihood possesses the factorisation of Theorem 2.3.8, then the functions
θ 7→ Ln (xn ; θ) and θ 7→ φ(sn (xn ); θ) reach their maximum at the same points. As a consequence,
the MLE necessarily depends on xn only through sufficient statistics. ◦

2.3.3 The Rao–Blackwell Theorem


Let Sn be a sufficient statistic for θ, and let Zn = zn (Xn ) be an estimator of g(θ), where g : Θ →
R. Assume that, for any θ ∈ Θ, Eθ [|Zn |] < +∞, and define

ZnSn = Eθ [Zn |Sn ].

Since Sn is sufficient for θ, the random variable ZnSn is a statistic, and under mild assumptions
under g(Θ) (for instance, if g(Θ) is an interval), ZnSn takes its values in g(Θ) so that ZnSn is also
an estimator of g(θ).
Proposition 2.3.12 (Rao–Blackwell Theorem). The estimators Zn and ZnSn satisfy

Eθ [ZnSn ] = Eθ [Zn ], Varθ (ZnSn ) ≤ Varθ (Zn ).

As a consequence,
R(ZnSn ; θ) ≤ R(Zn ; θ).
Proof. By (ii) in Proposition 2.3.2,

Eθ [ZnSn ] = Eθ [Eθ [Zn |Sn ]] = Eθ [Zn ],

and by (v) in Proposition 2.3.2,


h 2 i h i    
Eθ ZnSn = Eθ (Eθ [Zn |Sn ])2 ≤ Eθ Eθ [Zn2 |Sn ] = Eθ Zn2 ,

so that
 h 2 i  2  
Varθ ZnSn = Eθ ZnSn − Eθ ZnSn ≤ Eθ Zn2 − Eθ [Zn ]2 = Varθ (Zn ) .

As a practical consequence of Proposition 2.3.12, it is always a good idea to take the condi-
tional expectation of an estimator given a sufficient statistic: this procedure, which improves the
MSE, is sometimes called Rao–Blackwellisation. Of course, if Zn already depends on the sam-
ple through a sufficient statistic, which by Remark 2.3.11 is in particular the case for the MLE,
then this procedure has no effect: indeed, in this case Zn writes ρ(Sn ) for some function ρ, and
therefore by (i) in Proposition 2.3.2, ZnSn = Eθ [ρ(Sn )|Sn ] = ρ(Sn ) = Zn .
2.4 Confidence intervals 47

Example 2.3.13. In the Bernoulli model, X1 is an estimator of p, and it was proved in Exam-
ple 2.3.6 that Sn = X1 + · · · + Xn is a sufficient statistic. The Rao–Blackwellisation of X1 is then
the statistic Ep [X1 |Sn ], which by Example 2.3.6 with f (xn ) = x1 writes

1 X
Ep [X1 |Sn ] = ϕ(Sn ), ϕ(s) =   x1 1{x1 +···+xn =s} .
n n
xn ∈{0,1}
s
 n−1
Among the ns elements (x1 , . . . , xn ) ∈ {0, 1}n such that x1 + · · · + xn = s, there are s−1
elements for which x1 = 1. Therefore
 
n−1
s−1 s
ϕ(s) =   = ,
n n
s

and we conclude that


Sn
Ep [X1 |Sn ] = = X n.
n

2.4 Confidence intervals


In the previous sections, we constructed estimators Zn of some parameter g(θ), which to a given
realisation xn of the sample associate a value zn (xn ). However, the mere knowledge of this value
is not very useful, as it does not provide any indication of the accuracy of the estimation. In this
section, we introduce the notion of confidence interval which provides such an indication.
Throughout this section, a parametric model P = {Pθ , θ ∈ Θ} is given. Our purpose is to
estimate the quantity g(θ), which will always be assumed to take scalar values.

2.4.1 General definitions


Let α ∈ (0, 1) be the desired level of precision for our confidence intervals. The classical values
for α are 10%, 5% and 1%.

Definition 2.4.1 (Confidence interval). A confidence interval with level 1−α for g(θ) is an interval
In = [In− , In+ ] such that:

• the boundaries In− and In+ are statistics,

• for all θ ∈ Θ, Pθ (g(θ) ∈ In ) = 1 − α.

It is to be emphasised that in the event {g(θ) ∈ In }, it is the interval In which is random, and
not the quantity g(θ).
Sometimes it is difficult, tedious or not possible to construct confidence intervals in the sense
of Definition 2.4.1, which are called exact, and one has to resort to weaker notions. Hence, an
interval In whose boundaries In− , In+ are statistics is said to be:

• an approximate confidence interval11 if for all θ ∈ Θ, Pθ (g(θ) ∈ In ) ≥ 1 − α,

• an asymptotic confidence interval if for all θ ∈ Θ, limn→+∞ Pθ (g(θ) ∈ In ) = 1 − α.


11
Intervalle de confiance par excès en français.
48 Parametric estimation

Of course, both definitions can be combined to lead to the notion of asymptotic approximate
confidence interval, such that limn→+∞ Pθ (g(θ) ∈ In ) ≥ 1 − α.
We recall the definition of a quantile, which is essential in the construction of confidence
intervals.

Definition 2.4.2 (Quantile). Let X be a real-valued random variable. For any r ∈ (0, 1), a
quantile of order r for X is a number qr such that

P(X ≤ qr ) = r.

In general a quantile need not exist, and it need not be unique either. However in most cases
of interest, the variable X possesses a density which is positive on R (or on [0, +∞)), in which
case qr exists and is unique.
Since qr only depends on the law of X, we shall often directly speak of the quantile of a
distribution (for instance the standard Gaussian distribution N(0, 1)).

2.4.2 Construction of exact confidence intervals


When one possesses an estimator Zn of g(θ) whose law under Pθ is explicit, it is generally possible
to construct exact confidence intervals for g(θ). We first introduce a few notions, then detail the
method on the example of the Gaussian model.

Definition 2.4.3 (Free random variable). A random variable Q is free if its law under Pθ does not
depend on θ.

Definition 2.4.3 does not require the random variable Q to be a statistic, and in general we use
free random variables Q which depend on the parameter θ. For example, in the Gaussian model,
the random variables
Xi − µ
Xi′ = , i = 1, . . . , n,
σ
are iid according to the standard Gaussian distribution N(0, 1), which does not depend on (µ, σ 2 ):
they are free random variables. This example is an instance of a pivotal function.

Definition 2.4.4 (Pivotal function). A pivotal function for g(θ) is a function πn : Xn × g(Θ) → R
such that πn (Xn ; g(θ)) is free.

We now detail the construction of a confidence interval for the mean µ in the Gaussian model
{N(µ, σ 2 ), (µ, σ 2 ) ∈ R × (0, +∞)}.
The MLE of µ is X n , whose law under Pµ,σ2 is N(µ, σ 2 /n). As a consequence, the random
variable
Xn − µ
ζn = p
σ 2 /n
is free, with law N(0, 1). Still, this does not mean that the function

xn − µ
πn (xn ; µ) = p
σ 2 /n

is pivotal, because following Definition 2.4.4 it should only depend on the parameter (µ, σ 2 )
through µ.
2.4 Confidence intervals 49

Let us momentarily assume that σ 2 is known. Then πn is a pivotal function. As a consequence,


for all a, b ∈ R such that a < b, we have
Z b  2
1 x
Pµ,σ2 (ζn ∈ [a, b]) = √ exp − dx,
2π x=a 2
which rewrites
r r ! Z  2
b
σ2 σ2 1 x
Pµ,σ2 Xn − b ≤ µ ≤ Xn − a =√ exp − dx.
n n 2π x=a 2

For any choice of a and b such that


Z b  
1 x2
√ exp − dx = 1 − α, (∗)
x=a2π 2
p p
the interval [X n − b σ 2 /n, X n − a σ 2 /n] is a confidence interval with level 1 − α for µ.
Thus, we have constructed infinitely many confidence intervals, parametrised by a, b satisfying
the constraint (∗). Among these intervals, it is customary to select the smallest one, as it will
provide the most precise information on the parameter. In the present case, it can be checked (see
Exercise 2.4.5 below) that the optimal choice is
a = φα/2 = −φ1−α/2 , b = φ1−α/2 ,
where φr denotes the quantile of order r of the standard Gaussian distribution N(0, 1). The usual
values of φ1−α/2 are gathered in Figure 2.2. More generally, φr is obtained with the command
qnorm(r,mean=0,sd=1) in R12 .

1−α φ1−α/2
90% 1.65
95% 1.96
99% 2.58

φα/2 = −φ1−α/2 φ1−α/2

Figure 2.2: Quantiles of the standard Gaussian distribution. The hatched area on the figure is equal
to 1 − α.

Exercise 2.4.5. Let α ∈ (0, 1/2).


1. Show that a pair (a, b) is a solution to the minimisation problem
Z b
1
min b − a such that √ exp(−x2 /2)dx = 1 − α (1)
2π x=a
if and only if a = φu and b = φu+1−α , where u is a solution to the minimisation problem
min φu+1−α − φu such that 0 < u < α. (2)
12
In R, classical probability distributions are associated with a keyword (unif, exp and norm for uniform, expo-
nential and Gaussian, respectively). In front of these keywords, the respective prefixes d, p and q return the density,
cumulative distribution function and quantile of the associated distribution, while the prefix r allows to generate samples
from this distribution.
50 Parametric estimation

2. Show that if u is a solution to (2), then φ′u+1−α = φ′u .

3. Show that there exists a unique such u and give the unique solution to (1). ◦

If we assume that σ 2 is not known, then we have to find another pivotal function, which will
no longer depend on σ 2 . A simple idea consists in replacing σ 2 by an estimator of the variance,
and check whether the resulting random variable is free.

Lemma 2.4.6 (Freeness of ξn ). The random variable


n
Xn − µ 1 X
ξn = p , Sn2 = (Xi − X n )2 ,
Sn2 /n n−1
i=1

is free, and its law is the Student distribution t(n − 1)

Proof. Let us recall the notation


Xi − µ
Xi′ = ∼ N(0, 1),
σ
and define accordingly
n n
′ 1X ′ 1 X ′ ′
Xn = Xi , Sn′2 = (Xi − X n )2 .
n n−1
i=1 i=1

We first write
Xn − µ ′
σ Xn
ξn = v , = p ′2 ,
u n Sn /n
u 1 X (Xi − X n )2
t n
n−1 σ2
i=1

which already shows that ξn is free since the right-hand side no longer depends on (µ, σ 2 ). The
exact law of ξn is given by Proposition A.4.3 in Appendix A.

Thus, the function


n
xn − µ 1 X
̺n (xn ; µ) = p , s2n = (xi − xn )2 ,
s2n /n n−1
i=1

which no longer depends on σ 2 but in turn requires to estimate the variance of the sample, is
pivotal. As a consequence, for all a, b ∈ R such that a < b,
r r ! Z
b
Sn2 Sn2
Pµ,σ2 (ξn ∈ [a, b]) = Pµ,σ2 X n − b ≤ µ ≤ Xn − a = pn−1 (x)dx,
n n x=a

where pn−1 is the density of the law t(n − 1). Once again, as soon as a and b are chosen so that
Z b
pn−1 (x)dx = 1 − α,
x=a

we get a confidence interval with level 1−α for µ. The smallest such interval is obtained by taking

a = tn−1,α/2 = −tn−1,1−α/2 , b = tn−1,1−α/2 ,


2.4 Confidence intervals 51

where tn,r denotes the quantile of order r of the Student distribution t(n − 1), which is obtained
with the command qt(r,n) in R.
As a conclusion, the confidence interval with level 1 − α for µ is
" r r #
σ2 σ2
In = X n − φ1−α/2 , X n + φ1−α/2
n n

if σ 2 is known, and
" r r #
Sn2 Sn2
In = X n − tn−1,1−α/2 , X n + tn−1,1−α/2
n n

if σ 2 is not known. Notice that tn−1,1−α/2 is larger than φ1−α/2 , so that (up to the fluctuations in
the estimation of σ 2 by Sn2 ) the latter confidence interval is larger than the former. This is natural:
less information is available in the second case, so that there is more uncertainty on the parameter
µ.
Exercise 2.4.7. The purpose of this exercise is to prove that for all n ≥ 2, for all r ∈ (1/2, 1),
tn−1,r ≥ φr .
1. Show that this is equivalent to an inequality on the cumulative distribution functions of
t(n − 1) and N(0, 1) on [0, +∞).
p
2. Show that if x > 0 and Tn ∼ t(n), P(Tn ≤ x) = E[Φ(x Yn /n)], where Φ is the
cumulative distribution function of N(0, 1) and Yn ∼ χ2 (n).
3. Conclude using Jensen’s inequality (twice). ◦
We now summarise the method to construct an exact confidence interval for g(θ).
(i) Find a pivotal function πn (xn ; g(θ)); denote by Qn the free random variable πn (Xn ; g(θ)).
(ii) Fix a, b ∈ R which minimise b − a under the constraint that P(Qn ∈ [a, b]) = 1 − α.
(iii) Rewrite the condition Qn ∈ [a, b] as g(θ) ∈ In , where the boundaries of In are statistics.
Denoting by qn,r the quantile of order r of Qn , it is often the case that the optimal choice for
a, b be the pair qn,α/2 , qn,1−α/2 , or 0, qn,1−α if Qn is the law of a nonnegative random variable.
Exercise 2.4.8. In the Gaussian model, find a confidence interval with level 1 − α for σ 2 . ◦
Exercise 2.4.9. In the Exponential model, find a confidence interval with level 1 − α for λ. ◦

2.4.3 * Construction of approximate confidence intervals


The construction of exact confidence intervals presented in the previous paragraph requires to
find a pivotal function whose law is exactly known. In some cases, this is not possible or too
complicated, and one may rely on a merely partial knowledge of the law of some estimators, for
instance bounds on the moments.
Example 2.4.10 (Beta model). In the Beta model {β(a, b), (a, b) ∈ (0, +∞)2 }, we want to con-
struct a confidence interval for
a
g(a, b) = Ea,b [X1 ] = .
a+b
Of course, a straightforward estimator of g(θ) is X n . However, the computation of the law of X n
under Pa,b is tedious and we want to avoid it.
52 Parametric estimation

In such a case, one can construct approximate confidence intervals, thanks to concentration
inequalities.

Definition 2.4.11 (Concentration inequality). A concentration inequality for a random variable Y


is an inequality of the form
P(|Y − E[Y ]| ≥ r) ≤ cY (r),
for some function cY converging to 0 when r → +∞.

A famous concentration inequality for random variables Y such that E[|Y |2 ] < +∞ is the
Bienaymé–Chebychev inequality
Var(Y )
P(|Y − E[Y ]| ≥ r) ≤ ,
r2
which follows from Markov’s inequality. We first explain how to obtain an approximate confi-
dence interval from such an inequality, for the Beta model of Example 2.4.10. For all r > 0, the
Bienaymé–Chebychev inequality yields
 
a Vara,b [X n ] Vara,b [X1 ]
Pa,b X n − ≥ √r ≤ = .
a + b n r 2 /n r2
As a consequence, taking r such that
Vara,b [X1 ]
≤α (∗)
r2
ensures that   
a r r
Pa,b ∈ Xn − √ , Xn + √ ≥ 1 − α.
a+b n n
But since Vara,b [X1 ] depends on the parameters a and b, the condition on r prevents the bounds of
the interval from being statistics. In the case of the Beta model, and more generally for bounded
random variables, the next lemma provides a universal bound on the variance.

Lemma 2.4.12 (Universal bound on the variance). Let Y be a random variable taking its values
in [0, 1]. Then
1
Var(Y ) ≤ .
4
Exercise 2.4.13. Find a variable Y ∈ [0, 1] such that Var(Y ) = 1/4. ◦

Proof of Lemma 2.4.12. We write


   
Var(Y ) = E (Y − E[Y ])2 1{Y ≤E[Y ]} + E (Y − E[Y ])2 1{Y >E[Y ]} ,

and notice that on the event Y ≤ E[Y ], we have (Y − E[Y ])2 ≤ E[Y ]2 ; while on the event
Y > E[Y ], we have (Y − E[Y ])2 ≤ (1 − E[Y ])2 . As a consequence,

Var(Y ) ≤ pa2 + (1 − p)(1 − a)2 , p = P(Y ≤ E[Y ]), a = E[Y ],

and we finally note that, for any p ∈ [0, 1], for any a ∈ [0, 1],
1
pa2 + (1 − p)(1 − a)2 ≤ ,
4
which completes the proof.
2.4 Confidence intervals 53

Exercise 2.4.14. Let Y a random variable such that E[|Y |2 ] < +∞.

1. Show that Var(Y ) = mina∈R E[|Y − a|2 ].

2. Deduce an alternative proof of Lemma 2.4.12. ◦

As a consequence of Lemma 2.4.12, it suffices to take


1
r = √ = r Cheb (α)
2 α

to ensure that (∗) holds. Thus, we get the approximate confidence interval
 
r Cheb (α) r Cheb (α)
InCheb = X n − √ , Xn + √ .
n n

As soon as E[|Y |p ] < +∞ for some p ≥ 1, Chebychev’s inequality can be immediately


generalised in
E[|Y − E[Y ]|p ]
P(|Y − E[Y ]| ≥ r) ≤ .
rp
More generally, the faster cY converges to 0, the more useful the inequality is, in the sense that it
provides sharper confidence intervals. In this respect, the Hoeffding inequality is often employed
for bounded random variables.

Lemma 2.4.15 (Hoeffding’s inequality). Let X1 , . . . , Xn be iid random variables taking their
values in [0, 1]. For all n ≥ 1, for all r ≥ 0,
n
!
X √
P (Xi − E[Xi ]) ≥ r n ≤ exp(−2r 2 ).
i=1

Proof. We first define Y1 = X1 − E[X1 ] and show that, for all λ ≥ 0,

E[exp(λY1 )] ≤ exp(λ2 /8).

To this aim, let F (λ) = log E[exp(λY1 )]. Then

E[Y1 exp(λY1 )]
F ′ (λ) = ,
E[exp(λY1 )]

and
E[Y12 exp(λY1 )]E[exp(λY1 )] − E[Y1 exp(λY1 )]2
F ′′ (λ) =
E[exp(λY1 )]2
"  #
1 E[Y1 exp(λY1 )] 2
= E Y1 − exp(λY1 )
E[exp(λY1 )] E[exp(λY1 )]
"  #
1 E[X1 exp(λY1 )] 2
= E X1 − exp(λY1 ) .
E[exp(λY1 )] E[exp(λY1 )]

Using the same arguments as in the proof of Lemma 2.4.12, we get F ′′ (λ) ≤ 1/4. Noting that
F ′ (0) = E[Y1 ] = 0 and F (0) = log 1 = 0, and integrating twice yields F (λ) ≤ λ2 /8, whence
the claimed result.
54 Parametric estimation

To complete the proof, we now write, for all λ > 0,


n
! n
! !
X √ X √
P (Xi − E[Xi ]) ≥ r n = P exp λ (Xi − E[Xi ]) ≥ exp(λr n)
i=1 i=1
" n
!#
X √
≤ E exp λ (Xi − E[Xi ]) exp(−λr n)
i=1

= E[exp(λY1 )]n exp(−λr n)
 2 
λ n √
≤ exp − λr n ,
8

where we have used the fact that y 7→ exp(λy) is increasing at the first line and the Markov

inequality at the second line. The minimum of the quantity λ2 n/8 − λr n for λ > 0 is reached

at the value λ = 4r/ n, at which point it equals −2r 2 , which completes the proof.

Remark 2.4.16. The proof of Lemma 2.4.15 conveys two remarks.


• It is not surprising that the bound on F ′′ (λ) follows from the same arguments as in the proof
of Lemma 2.4.12. Indeed, on can define a probability measure Pλ by the identity
E[1A exp(λY1 )]
Pλ (A) = ,
E[exp(λY1 )]

for all events A, and then check that F ′ (λ) = Eλ [Y1 ], F ′′ (λ) = Varλ [Y1 ] = Varλ [X1 ].
Since X1 takes its values in [0, 1], the bound on F ′′ (λ) is in fact a direct consequence of
Lemma 2.4.12.

• The combination of: the application of the increasing function y 7→ exp(λy); the use of
Markov’s inequality; and the optimisation of the result over the values of λ > 0; is a classical
method, called Chernoff’s method. It is as simple as powerful. ◦
Lemma 2.4.15 is often applied under the form of the following corollary.
Corollary 2.4.17 (Application of Hoeffding’s inequality). Let X1 , . . . , Xn be iid random vari-
ables taking their values in [0, 1]. For all n ≥ 1, for all r ≥ 0,
√ 
P X n − E[X1 ] ≥ r/ n ≤ 2 exp(−2r 2 ).

Proof. The proof follows from the union bound


√   √  √ 
P X n − E[X1 ] ≥ r/ n = P X n − E[X1 ] ≥ r/ n ∪ X n − E[X1 ] ≤ −r/ n
 √ 
≤ P X n − E[X1 ] ≥ r/ n
 √ 
+ P (1 − X n ) − E[1 − X1 ] ≥ r/ n
≤ 2 exp(−2r 2 ),

where we have applied Hoeffding’s inequality to the random variables X1 , . . . , Xn as well as to


the random variables 1 − X1 , . . . , 1 − Xn .

Applying Corollary 2.4.17 to the Beta model presented in Example 2.4.10 yields, for all r ≥ 0,
 
a r
Pa,b X n − ≥ √ ≤ 2 exp(−2r 2 ).
a + b n
2.4 Confidence intervals 55

As a consequence, taking r such that 2 exp(−2r 2 ) = α, that is


r
1 α
r = − log = r Hoeff (α),
2 2
we get  
a r

Pa,b X n − ≤√ ≥ 1 − α.
a + b n
In conclusion,  
Hoeff r Hoeff (α) r Hoeff (α)
In = Xn − √ , Xn + √
n n
is an approximate confidence interval for a/(a + b).
The comparison of the widths of both intervals, as a function of α, is plotted on Figure 2.3. As
is expected, Hoeffding’s inequality provides narrower, and thus more precise, confidence intervals.

1
5.0x10

1
10

0
5.0x10

0
10
−4 −3 −2 −1
10 10 10 10
Chebychev
Hoeffding

Figure 2.3: Log-log plot of the functions r Cheb (α) and r Hoeff (α), for α ranging from 10−4 to
10−1 .

2.4.4 Construction of asymptotic confidence intervals


In this section, we address the case where asymptotic properties of an estimator Zn of g(θ) are
available, in particular consistency and an expression for asymptotic variance.
We first state a general result for asymptotically normal estimators. We recall that φr denotes
the quantile of order r of the standard Gaussian distribution N(0, 1).
Proposition 2.4.18 (Asymptotic confidence intervals). Let Zn be a consistent and asymptotically
normal estimator of g(θ), with asymptotic variance V (θ). Assume that a consistent estimator Vbn
of the variance is available. Then, for all α ∈ (0, 1),
 s s 
b
Vn Vbn 
In = Zn − φ1−α/2 , Zn + φ1−α/2
n n
56 Parametric estimation

is an asymptotic confidence interval with level 1 − α for g(θ).

In general, it is not difficult to find a consistent estimator for V (θ): as soon as V is continuous
and θbn is a consistent estimator of θ, one can take Vbn = V (θbn ). If there is no such estimator
available, the procedure of variance stabilisation described in Exercise 2.A.9 can be applied.

Proof of Proposition 2.4.18. We start from the asymptotical normality of Zn , which writes
r
n
(Zn − g(θ)) → N(0, 1), in distribution.
V (θ)

Since Vbn converges to V (θ) in probability, Slutsky’s Theorem implies


r s r
n V (θ) n
(Zn − g(θ)) = × (Zn − g(θ)) → N(0, 1), in distribution,
Vbn Vbn V (θ)

as well. As a consequence13 , for all a, b ∈ R such that a ≤ b,


r  Z b  2
n 1 x
lim P (Zn − g(θ)) ∈ [a, b] = √ exp − dx.
n→+∞ Vbn 2π x=a 2

Following the same arguments as in the construction of exact confidence intervals for the Gaussian
model, we take b = −a = φ1−α/2 , which results in the expected confidence interval.
P
Example 2.4.19 (The Bernoulli model). We want to employ the estimator pbn = n1 ni=1 Xi to
construct an asymptotic confidence interval for p. This estimator is (strongly) consistent, and
asymptotically normal with asymptotic variance V (p) = p(1 − p). As a consequence, a consistent
estimator of the asymptotic variance is given by pbn (1 − pbn ), and we get that
" r r #
pbn (1 − pbn ) pbn (1 − pbn )
In = pbn − φ1−α/2 , pbn + φ1−α/2
n n

is an asymptotic confidence interval with level 1 − α for p.

Remark 2.4.20. For the Bernoulli model, when the size n of the sample is too small for the asymp-
totic approach to be used, approximate confidence intervals can be constructed based on Cheby-
chev’s or Hoeffding’s inequality. However, finding a pivotal function to obtain exact confidence
intervals is more difficult. ◦

The sketch of the proof of Proposition 2.4.18 allows to construct asymptotic confidence in-
tervals even when the estimator Zn is not asymptotically normal. As an example, recall that for
the Uniform model of Exercise 2.2.10, the MLE θbn = max1≤i≤n Xi is strongly consistent, and
satisfies
n(θ − θbn ) → E(1/θ), in distribution.
As a consequence, !
θbn
n 1− → E(1), in distribution,
θ
13
We recall that if the random variable ζn converges in distribution to a random variable ζ which possesses a density,
then for any interval [a, b], P(ζn ∈ [a, b]) converges to P(ζ ∈ [a, b]), see [2, Remarque 5.3.11, p. 86].
2.5 * Kernel density estimation 57

which implies, for all a, b ≥ 0 such that a ≤ b,


! ! Z
θbn b
lim P n 1 − ∈ [a, b] = exp(−x)dx = exp(−a) − exp(−b).
n→+∞ θ x=a

We then have
! " #
θbn θbn θbn
n 1− ∈ [a, b] if and only if θ∈ , ,
θ 1 − a/n 1 − b/n

and the choice of a and b such that exp(−a) − exp(−b) = 1 − α for which b − a is minimal is
a = 0, b = − log α. As a conclusion, an asymptotic confidence interval for θ is
" #
bn
θ
In = θbn , .
1 + (log α)/n

2.5 * Kernel density estimation


In some cases, the assumption that the law P of X1 , . . . , Xn belong to a parametric model P =
{Pθ , θ ∈ Θ} may not be possible, and one may be interested in the direct estimation of the
probability distribution P , without any ‘shape’ assumption. This is the problem of nonparametric
estimation14 .
In the sequel, we assume that the variables X1 , . . . , Xn take their values in the space X = R,
and we denote by PL (R) the space of probability densities with respect to the Lebesgue measure
on R. We also assume that the law P of X1 , . . . , Xn has a density p ∈ PL (R). Our purpose is
thus to construct an estimator pbn of p, that is to say a random element of PL (R) which depends
on Xn = (X1 , . . . , Xn ).
Let us define the empirical cumulative distribution function of the sample Xn by
n
1X
∀x ∈ R, Fbn (x) = 1{x≤Xi } .
n
i=1

By the strong Law of Large Numbers, for all x ∈ R, Fbn (x) converges to F (x) almost surely15 , so
that Fbn (x) is a strongly consistent (and unbiased) estimator of F (x). Since p = F ′ , it is natural
to try to estimate p by the derivative of Fbn . However, since Fbn is piecewise constant, it is not
differentiable. To overcome this difficulty, one may replace the derivative with a finite difference
approximation. The centered version of such an approximation leads to the following estimator of
p.

Definition 2.5.1 (Rosenblatt estimator). For any h > 0, the Rosenblatt estimator16 of p is the
random density pbn,h defined by

Fbn (x + h) − Fbn (x − h)
∀x ∈ R, pbn,h (x) = .
2h
14
This section is based on the first chapter of [5].
15
More properties of the random function Fbn will be studied in Section 4.2.
16
On parle aussi d’estimateur à fenêtre glissante en français.
58 Parametric estimation

Exercise 2.5.2. Check that pbn,h rewrites

n
1 X
pbn,h (x) = 1{x−h<Xi ≤x+h} ,
2nh
i=1

and deduce that pbn,h ∈ PL (R). ◦

Notice that the expression obtained in Exercise 2.5.2 writes under the abstract form

n  
1 X x − Xi
pbn,h (x) = K ,
nh h
i=1

where K(z) = 12 1{−1≤z<1} is the density of the uniform distribution on [−1, 1]. This remark
allows to generalise the Rosenblatt estimator as follows.

Definition 2.5.3 (Kernel Density Estimator). A kernel K on R is a probability density on R. The


associated Kernel Density Estimator (KDE) of p is the function pbn,h defined by

n  
1 X x − Xi
∀x ∈ R, pbn,h (x) = K .
nh h
i=1

The KDE pbn,h associated with a given kernel K is also called the Parzen–Rosenblatt estimator
of p. This estimator depends on the parameter h > 0, which is called the bandwidth17 . We have
thus naturally constructed a one-parameter family of estimators for p, similarly to (but perhaps
less artificially than) the example of Exercise 2.1.11. And just like in this example, the ‘good’
choice of h results from a bias-variance tradeoff.
Let us fix x0 ∈ R. The MSE of pbn,h (x0 ) possesses the bias-variance decomposition
h i
E (b pn,h (x0 )] − p(x0 ))2 + Var (b
pn,h (x0 ) − p(x0 ))2 = (E [b pn,h (x0 )) ,

both terms of which are studied in the next proposition.

Proposition 2.5.4 (Bias and variance of the KDE). Assume that:

(i) the density p is C 2 on R and the functions p and p′′ are bounded on R,
R R 2 K(z)dz
R 2
(ii) the kernel K satisfies z∈R zK(z)dz = 0, z∈R z < +∞, z∈R K(z) dz < +∞.

Then there exist Cb , Cv ∈ [0, +∞) depending only on p and K such that, for all x0 ∈ R, the bias
of the KDE satisfies
pn,h (x0 )] − p(x0 ))2 ≤ Cb h4 ,
(E [b

while the variance of the KDE satisfies

Cv
Var (b
pn,h (x0 )) ≤ .
nh
17
Largeur de bande en français.
2.5 * Kernel density estimation 59

Proof. We first address the bias and compute


" n  #
1 X x0 − Xi
E [b
pn,h (x0 )] = E K
nh h
i=1
Z  
1 x0 − x
= K p(x)dx
h h
Z x∈R
= K(z)p(x0 + hz)dz.
z∈R
By the Taylor–Lagrange Theorem,
2 2
p(x0 + hz) − p(x0 ) − hzp′ (x0 ) ≤ h z sup |p′′ (x)|,
2 x∈R
so that, recalling that Z Z
K(z)dz = 1, zK(z)dz = 0,
z∈R z∈R
we get Z
h2 ′′
|E [b
pn,h (x0 )] − p(x0 )| ≤ sup |p (x)| z 2 K(z)dz,
2 x∈R z∈R
which yields the first part of the proposition, with
 Z 2
1 ′′ 2
Cb = sup |p (x)| z K(z)dz .
2 x∈R z∈R

We now address the variance and write


  
1 1 x0 − X1
Var (bpn,h (x0 )) = Var K
n h h
"   #
1 1 x0 − X1 2
≤ E 2K
n h h
Z  
1 x0 − x 2
= K p(x)dx
nh2 x∈R h
Z
1
= K(z)2 p(x0 + hz)dx,
nh z∈R
which yields the second part of the proposition with
Z
Cv = sup p(x) K(z)2 dz.
x∈R z∈R

An elementary computation shows that given the size n of the sample, the optimal value h∗n
minimising the bound Cb h4 + Cv /nh on the MSE is
 
∗ Cv 1/5
hn = ,
4nCb
in which case the MSE is of order n−4/5 and thus converges to 0 slower than the usual rate n−1
obtained for asymptotically normal estimators in the parametric framework! Observe however that
the value of h∗n depends on the constants Cb and Cv , which depend themselves on the density p,
which is the quantity which we are trying to estimate. As a consequence, in practice one cannot
compute h∗n , and selecting a ‘good’ value of the bandwidth is part of the problem of nonparametric
estimation.
60 Parametric estimation

Remark 2.5.5. Computing the MSE at a given point x0 is not the only possible criterion assessing
the quality of the estimator pbn,h . For instance, another popular choice is the Mean Integrated
Squared Error, defined by

Z 
2
MISE(b
pn,h ; p) = E (b
pn,h (x) − p(x)) dx . ◦
x∈R

Let us conclude this brief introduction to nonparametric estimation with an empirical observa-
tion of the under- and oversmoothing phenomena. Assume that the size n of the sample is fixed.
If h is too small, then each function x 7→ h1 K( x−X h ) is very peaked around Xi , and the resulting
i

KDE looks very irregular. Besides, if one draws another sample Xn , the corresponding KDE will
look very different — in other words, the variance of the KDE is large when h is small. Thus, the
KDE is not informative on the actual density p because it is not smooth enough. On the contrary,
if h is taken too large, then all functions x 7→ h1 K( x−X
h ) look the same, and are just small trans-
i

1 x
lations of the function x 7→ h K( h ). In this case, the KDE just looks like a dilatation of the kernel
K, and does not provide any information on the density p either. These phenomena are illustrated
on Figure 2.4. They are to be compared with the over- and underfitting phenomena which will be
discussed in Chapter 5 in the context of linear regression.

h too small
h good
h too large
0.5
0.4
0.3
0.2
0.1
0.0

−2 −1 0 1 2 3 4

Figure 2.4: Kernel Density Estimation: the density p and the points of the sample are plotted in
red. Three KDEs, corresponding to the same (Gaussian) kernel but with different bandwidths, are
superposed.
2.A Exercises 61

2.A Exercises
Exercise 2.A.1 (Asymptotic variance of the empirical variance). Let X1 , . . . , Xn be iid random
variables, such that E[X14 ] < +∞ and E[X1 ] = 0. We write
n n n
1X 1X 1X 2 2
Xn = Xi , Vn = (Xi − X n )2 = Xi − X n .
n n n
i=1 i=1 i=1

The purpose of the exercise is to show that n(Vn − Var(X1 )) converges to a Gaussian distribu-
tion, and to compute the associated variance. For k = 2, 3, 4, we write ρk = E[X1k ].

1. Let us define the vectors Yn = (Xn , Xn2 ) and y = (0, ρ2 ) in R2 . Compute the covariance
matrix of Yn .

2. Let ϕ(x1 , x2 ) = x2 − x21 . Compute ∇ϕ(y).


P √
3. Express Vn as a function of Y n = n1 ni=1 Yi . Deduce that n(Vn − ρ2 ) converges to
N(0, ρ4 − ρ22 ). ◦

↸ Exercise 2.A.2 (Poisson model). The Poisson model is the set {P(λ), λ > 0}, under which we
recall that, for all λ > 0,

λk
∀k ∈ N, Pλ (X1 = k) = exp(−λ) .
k!
1. A first moment estimator.
e(1)
(a) Compute Eλ [X1 ] and deduce a moment estimator λ n .

(b) What is the bias of this estimator?


e(1)
(c) Compute Eλ [X12 ] and deduce the asymptotic variance of λ n .

2. A second moment estimator.


(2)
en from the expression of Eλ [X 2 ].
(a) Deduce another moment estimator λ 1
(b) Using Jensen’s inequality, show that this estimator is biased.
e(2)
(c) Compute Eλ [X13 ], Eλ [X14 ], and deduce the asymptotic variance of λn .

3. Maximum Likelihood Estimator.

(a) Write the likelihood of the model.


(b) Compute the MLE of λ.
(c) Compute the Fisher information of the model.
(d) Show that the MLE is efficient.
(e) Use the MLE to construct an asymptotic confidence interval with level 1 − α for λ. ◦

↸ Exercise 2.A.3 (Nonexistence of an unbiased estimator). Let p ∈ (0, 1) and X1 , . . . , Xn inde-


pendent Bernoulli variables with parameter p. The purpose of this exercise is to show that there
does not exist an unbiased estimator Zn of g(p) = 1/p.

1. For all xn ∈ {0, 1}n , express Pp (Xn = xn ) as a function of k = x1 + · · · + xn .


62 Parametric estimation

2. We assume that there exists an unbiased estimator Zn of g(p). Show that there exist real
numbers a0 , . . . , an such that
n
X 1
ak pk (1 − p)n−k = ,
p
k=0

for all p ∈ (0, 1).

3. Conclude that there cannot exist such an estimator. ◦

Exercise 2.A.4 (Nonuniqueness of MLE). For θ ∈ R, we denote by Pθ the uniform distribution


on the interval [θ, θ + 1].

1. Construct a moment estimator for θ.

2. Write the likelihood Ln (xn ; θ) of the model. What do you observe?

3. For all t ∈ [0, 1], we introduce the estimator


 
θbnt = (1 − t) max Xi − 1 + t min Xi .
1≤i≤n 1≤i≤n

Show that θbnt is strongly consistent.

4. Compute the marginal distributions of U = min1≤i≤n Xi − θ and V = max1≤i≤n Xi − θ,


and deduce the bias of θbnt .

5. Write the MSE R(θbnt ; θ) as a function of t, α = E[U 2 ] and γ = E[(1 − V )U ]. Which value
of t minimises this quantity?

6. Compute the joint density of (U, V ) and deduce an expression for the MSE of the optimal
choice of θbnt . ◦

↸ Exercise 2.A.5 (Estimation in the geometric model). We consider the geometric model {Geo(p), p ∈
(0, 1)}.

1. Write the likelihood of the model.

2. Compute the Maximum Likelihood Estimator pbn of p, and show that it is strongly consistent.

3. Show that pbn is asymptotically normal and compute its asymptotic variance.

4. Is pbn asymptotically efficient?

5. Construct an asymptotic confidence interval for p with level 95%. ◦

↸ Exercise 2.A.6 (MLE in the Weibull model). A random variable X > 0 is said to follow the
Weibull distribution with parameter m > 0 if

∀x > 0, P(X > x) = exp (−xm ) .

1. Warm-up.

(a) Do you know a particular case of Weibull distribution?


(b) What is the law of X k for k > 0?
2.A Exercises 63

2. Maximum Likelihood Estimation.


(a) Write the likelihood and the log-likelihood of a realisation xn = (x1 , . . . , xn ).
(b) Let us define
n
1 1X
ϕ(m; xn ) = + (1 − xm
i ) log xi .
m n
i=1
Show that m 7→ ϕ(m; xn ) is decreasing and that, if xn 6= (1, . . . , 1), there exists
a unique mn (xn ) > 0 such that ϕ(mn (xn ); xn ) = 0. What about the case xn =
(1, . . . , 1)?
(c) Conclude that m
b n = mn (Xn ) is the MLE of m.
3. Consistency.
(a) For all m > 0, compute the limit, when n → +∞, of ϕ(m; Xn ).
You may admit the identity
Z +∞
((1 − y) log y) exp(−y)dy = −1.
y=0

For a ∈ R, we define [a]+ = max(a, 0), [a]− = max(−a, 0), and recall that |a| = [a]+ +
[a]− . To prove that m
b n is a consistent estimator m, we study [m
b n − m]− and [m
b n − m]+
separately.
(b) Study of [m
b n − m]− .
i. Check that ϕ′ (m; xn ) ≤ −1/m2 for all m > 0.
ii. If mn (xn ) ≤ m, show that mn (xn ) − m ≤ m2 |ϕ(m; xn )|.
iii. Deduce that [mb n − m]− converges almost surely to 0.
(c) Study of [m
b n − m]+ .
i. Let ǫ > 0. Show that if mn (xn ) − m ≥ ǫ, then
n
1 1X
0≤ + (1 − xm
i ) log xi .
m+ǫ n
i=1

ii. Deduce that [m


b n − m]+ converges in probability to 0.
(d) What do you conclude? ◦
Exercise 2.A.7 (An example of a free statistic). In the Exponential model, show that, for any
k ∈ {1, . . . , n},
X1 + · · · + Xk
ηk,n =
X1 + · · · + Xn
is a free statistic. ◦
1 Exercise 2.A.8 (Translation of Cauchy distributions). For all θ ∈ R, we denote by Pθ the
probability measure with density
1
p(x; θ) = .
π((x − θ)2 + 1)
We furthermore recall that
Z x
dy π
∀x ∈ R, = arctan(x) + .
y=−∞ y2+1 2
64 Parametric estimation

1. What is the name of Pθ when θ = 0?

2. Let X1 , . . . , Xn iid random variables with law Pθ . For all i ∈ {1, . . . , n}, we define Ui =
1{Xi ≤0} . Compute Eθ [U1 ].

3. Deduce a moment estimator θen of θ. Show that this estimator is strongly consistent.

4. Show that θen is asymptotically normal, and compute its asymptotic variance. Hint: use the
relation tan′ = 1 + tan2 .

5. For α ∈ (0, 1), deduce an asymptotic confidence interval for θ, with level 1 − α. ◦

1 Exercise 2.A.9 (Variance stabilisation). Let Zn be a consistent estimator of g(θ), asymptoti-


cally normal with asymptotic variance V (θ). We assume that a consistent estimator Vbn of V (θ) is
available. The level α is fixed throughout the exercise.

1. Recall the width of the asymptotic confidence interval In for g(θ) given by Proposition 2.4.18.

2. Let Φ : g(Θ) → R be a strictly monotonic and C 1 function. Show that Φ(Zn ) is a consistent
and asymptotically normal estimator of Φ(g(θ)), and compute its asymptotic variance.

3. Using Proposition 2.4.18, construct an asymptotic confidence interval for Φ(g(θ)). Using
the fact that Φ is strictly monotonic, deduce an asymptotic confidence interval InΦ for g(θ).

4. Express the width of InΦ in terms of the function ϕn defined by


 
ϕn (s) = Φ−1 Φ(Zn ) + sΦ′ (Zn ) − Φ−1 Φ(Zn ) − sΦ′ (Zn ) ,

and compute ϕn (0) and ϕ′n (0). If the function ϕn is either convex or concave, which of the
intervals In and InΦ is the smallest?
p
5. Assume that Φ satisfies the relation Φ′ (g(θ)) = 1/ V (θ). What is the expression for the
corresponding confidence interval InΦ ? Such a choice of Φ is called variance stabilisation.

6. Consider the MLE λ bn = 1/X n of λ in the Exponential model. We recall from Section 2.1.4
that this estimator is asymptotically 2

p normal, with asymptotic variance V (λ) = λ . Find a
function Φ such that Φ (λ) = 1/ V (λ), and compute the associated asymptotic confidence
interval for λ. Does variance stabilisation reduce the width of the confidence interval? ◦

2.B Summary
2.B.1 Basic definitions
• Parametric model: P = {Pθ , θ ∈ Θ}, with parameter set Θ ⊂ Rq .

• Sample: Xn = (X1 , . . . , Xn ) ∈ Xn vector of iid random variables with law Pθ under Pθ .

• Estimator of g(θ): function of the sample Zn = zn (Xn ), where zn : Xn → g(Θ) does not
depend on θ.
2.B Summary 65

2.B.2 Quality criteria for estimators


Nonasymptotic criteria:

• Bias b(Zn ; θ) = Eθ [Zn ] − g(θ).

• MSE R(Zn ; θ) = Eθ [kZn − g(θ)k2 ] = kb(Zn ; θ)k2 + Varθ [Zn ].

• FDCR bound: for any unbiased estimator Zn , Varθ [Zn ] ≥ h∇g(θ), I −1 (θ)∇g(θ)i/n.
I(θ) is the Fisher information.
An unbiased estimator reaching this bound is efficient.

Asymptotic criteria:

• Consistency: Zn → g(θ) in probability


can be checked by weak Law of Large Numbers.

• Strong consistency: Zn → g(θ) almost surely


can be checked by strong Law of Large Numbers.

• Asymptotic normality: n(Zn − g(θ)) → Nd (0, K(θ)) in distribution, where K(θ) is the
asymptotic variance
can be checked by Central Limit Theorem and Delta Method.

2.B.3 Construction of estimators


• Moment estimators: strongly consistent, asymptotically normal.

• Maximum Likelihood Estimator: strongly consistent, and


regular model: asymptotically normal, variance I −1 (θ), asymptotically efficient;

nonregular model: rate of convergence can be different from n.

• Rao–Blackwell Theorem, based on sufficient statistics, allows to improve the MSE by con-
ditioning.

2.B.4 Confidence intervals


Interval In whose bounds are statistics, and such that Pθ (g(θ) ∈ In ) ≃ 1 − α.

• Exact confidence interval: Pθ (g(θ) ∈ In ) = 1 − α


method of pivotal functions.

• Approximate confidence interval: Pθ (g(θ) ∈ In ) ≥ 1 − α


concentration inequalities.

• Asymptotic confidence interval: Pθ (g(θ) ∈ In ) → 1 − α


asymptotically normal estimators.
66 Parametric estimation

2.B.5 Kernel Density Estimation


Nonparametric estimation: we assume that X1 , . . . , Xn are iid according to a density p and wish
to construct an estimator of p.
1 Pn x−Xi
• Parzen–Rosenblatt estimator: pbn,h(x) = nh i=1 K( h ), where K is a kernel.

• h is the bandwidth.
too small: undersmoothing, large variance;
too large: oversmoothing, large bias.

• Optimal choice for h of order n−1/5 , leads to MSE of order n−4/5 .


Chapter 3

Hypothesis testing

Introduction: the Lady Tasting Tea experiment1


In 1919, Ronald Fisher worked as a statistician at Rothamsted Research, an agricultural research
institute in the UK. He once proposed a cup of tea to his colleague Muriel Bristol, who was a
phycologist. She declined and argued that she preferred having the milk poured into the cup
before the tea. Fisher did not believe that whether the milk was poured before or after the tea
could affect the flavour, and thus designed the following experiment: he prepared eight cups, four
with the milk poured first and four with the tea poured first, and had Bristol blindly taste the eight
cups. She was able to decide correctly in which order the milk and the tea had been poured for all
cups.
Fisher then developed the following argument. Assume that the lady is not actually able to tell
whether the milk or the tea have been poured first, and thus answers at random. Fisher called this
assumption the null hypothesis, because it assumes the absence of the effect that the experiment
is trying to evidence. Since Bristol was aware that four cups of each type had been prepared, the
probability under the null hypothesis that she give all correct answers is
1 1
p=   = ≃ 1, 4%,
8 70
4

which is called the p-value. Since this probability is small, Fisher declared that the result of
the experiment was too unlikely under the null hypothesis, and thus rejected the latter, hence
concluding that Bristol was actually able to tell whether the milk or the tea was poured first.
Fisher’s argument is considered as one of the first attempts to formalise the design and statisti-
cal analysis of scientific experiments. It is the basis of the theory of hypothesis testing, which was
then developed in particular by Neyman and Pearson in the 1930’s. In this chapter, we present this
theory in the framework of parametric estimation as introduced in Chapter 2. Hence, throughout
the chapter, a parametric model P = {Pθ , θ ∈ Θ} is fixed, with Θ ⊂ Rq . The state space for the
sample Xn = (X1 , . . . , Xn ) is denoted by Xn .

3.1 General formalism


3.1.1 Null and alternative hypotheses, test
Let H0 , H1 be a partition of the set of parameters Θ into two subsets:
1
This experiment is reported in Fisher’s book The design of experiments (1935).
68 Hypothesis testing

• H0 is the null hypothesis,

• H1 is the alternative hypothesis.

An hypothesis is called simple if it contains a single element, otherwise it is called composite.

Definition 3.1.1 (Test). A test of H0 against H1 is a decision rule determining, given an observa-
tion xn ∈ Xn , whether θ ∈ H0 or θ ∈ H1 .

In other words, a test is a function Xn → {H0 , H1 }. It is characterised by its region of


rejection, or critical region.

Definition 3.1.2 (Region of rejection). The region of rejection of a test is the set Wn of realisations
xn ∈ Xn for which H0 is rejected.

Example 3.1.3 (Bernoulli model). A sequence of coin flips is modelled by Bernoulli random vari-
ables with parameter p ∈ [0, 1]. The experimenter wants to know whether the coin is biased or
not. She sets:

• H0 = {p = 1/2}: the coin is not biased,

• H1 = {p 6= 1/2}: the coin is biased.

Notice that the null hypothesis is simple, while the alternative hypothesis is composite.
For a sufficiently large sample size n, the Law of Large Numbers asserts that X n is close to
p. As a consequence, an intuitive test consists in rejecting H0 as soon as X n is ‘far enough’ from
1/2. Formally, this amounts to taking

Wn = {xn ∈ {0, 1}n : |xn − 1/2| ≥ a},

for some a ∈ (0, 1/2), which has to be determined in order to control the risk of taking a wrong
decision — this notion will be made precise below.

Remark 3.1.4. In Example 3.1.3, the null hypothesis is that under which there is no bias. This is
a general fact: when one wants to check whether a certain effect is present in the data, one defines
the null hypothesis as the absence of this effect. ◦

3.1.2 Type I and type II errors, level and statistical power


Since the sample xn is the realisation of a random variable Xn , it may happen that the test returns
an incorrect result. Two types of errors are distinguished. We recall that Wn is the region of
rejection defined in Definition 3.1.2.

Definition 3.1.5 (Type I and type II errors). A type I error2 is the incorrect rejection of H0 . It is
measured by the type I risk θ ∈ H0 7→ Pθ (Xn ∈ Wn ).
A type II error3 is the incorrect acceptance of H0 . It is measured by the type II risk θ ∈ H1 7→
Pθ (Xn 6∈ Wn ).

Remark 3.1.6. With the interpretation that the null hypothesis is the absence of the effect that one
wants to identify, the type I error corresponds to a false positive, as the test incorrectly concludes
to the presence of the effect. On the contrary, the type II error corresponds to a false negative. ◦
2
Erreur de première espèce en français.
3
Erreur de seconde espèce en français.
3.1 General formalism 69

In Example 3.1.3, taking a very small leads the test to accept H0 only when X n is very close to
1/2, and therefore increases the type I risk (the test is called conservative). On the contrary, taking
a close to 1/2 makes the test accept H0 for many values of the sample set, and thus increases the
type II risk. This example shows that one cannot minimise both risks simultaneously. In order to
select a, the Neymann–Pearson approach consists in:
(i) fix a level α ∈ (0, 1), usually 1%, 5% or 10%;
(ii) define the rejection region which minimises the type II risk under the constraint that the type
I risk be lower than α.
Definition 3.1.7 (Level and statistical power). The level, or size of a test is
α = sup Pθ (Xn ∈ Wn ).
θ∈H0

The statistical power of a test is the function


θ ∈ H1 7→ Pθ (Xn ∈ Wn ) = 1 − type II error.
Remark 3.1.8. This procedure induces a dissymmetry between type I and type II errors: fixing a
level first shows that, in the experimenter’s eyes, it is more important to control the type I error
than the type II error. This dissymmetry must be taken into account when the null and alternative
hypotheses are defined. For instance, if one looks for the presence of Higgs’ boson at CERN,
defining H0 as the absence of the particle allows to control the risk of being wrong when claiming
that the particle has been discovered, at the price of accepting a possibly large type II error, that is
to say not being able to claim that the particle has been detected with sufficient significance. On
the contrary, if one constructs a medical test aimed at detecting a disease, it may be preferable to
define H0 as the presence of the disease, in order to avoid not detecting the disease as a priority. ◦
To compute the type I and type II risks in the Bernoulli model of Example 3.1.3, we use the
approximation r
p(1 − p)
Xn ≃ p + G, G ∼ N(0, 1),
n
based on the Central Limit Theorem. Then the type I risk is

P1/2 (Xn ∈ Wn ) ≃ P(|G| ≥ 2a n),

so that the level of the test is α if and only if a is chosen so that 2 na = φ1−α/2 , with φr the
quantile of order r of the standard Gaussian distribution. With this choice, the statistical power of
the test is plotted for several values of n on Figure 3.1. Obviously, its minimum value is α because
the function p 7→ Pp (Xn ∈ Wn ) is continuous on [0, 1] and the test is customised for this function
to take the value α at p = 1/2. However, for any p 6= 1/2, that is to say for any p ∈ H1 , the
statistical power increases with n.
That the test behaves better when the size of the sample increases is a natural requirement, and
motivates the following definition.
Definition 3.1.9 (Asymptotic properties). A test with rejection region Wn is consistent if
∀θ ∈ H1 , lim Pθ (Xn ∈ Wn ) = 1.
n→+∞

The asymptotic level of a test is4


α = lim sup sup Pθ (Xn ∈ Wn ).
n→+∞ θ∈H0
4
We recall that the limsup and liminf of a sequence (an )n≥1 are defined by lim supn→+∞ an =
limn→+∞ supk≥n ak and lim inf n→+∞ an = limn→+∞ inf k≥n ak .
70 Hypothesis testing

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3.1: Statistical power of the test of level α = 5% in the Bernoulli model, for n =
10, 20, 30, 40, 50. The larger n, the more peaked the curve.

Exercise 3.1.10. Show that, in the Bernoulli model of Example 3.1.3, the test with rejection region

Wn = {xn ∈ {0, 1}n : |xn − 1/2| ≥ φ1−α/2 /2 n} is consistent. ◦

Remark 3.1.11. Tests with a poor statistical power face the risk of returning a type II error with a
large probability. As Figure 3.1 shows, increasing the size of the sample allows to reduce this risk.
A standard value for the acceptable power of a test is 80%. Yet, Figure 3.1 also shows that this
value can generally not be reached uniformly over H1 . For hypotheses of the form H0 = {θ =
θ0 }, H1 = {θ 6= θ0 }, a possible approach consists in fixing a power level ρ (say ρ = 0.8) and a
threshold δ > 0, and looking for n such that

∀θ ∈ H1δ = {|θ − θ0 | ≥ δ}, Pθ (Xn ∈ Wn ) ≥ ρ. ◦

3.1.3 p-value
In Example 3.1.3, the null hypothesis H0 = {1/2} is simple. On may also consider a composite
hypothesis H0 = [0, 1/2], in which case H1 = (1/2, 1] and a natural rejection region has the form

Wn = {xn ∈ {0, 1}n : xn − 1/2 ≥ a},

with a ∈ (0, 1/2) to be selected.



Exercise 3.1.12. Show that the choice a = φ1−α /2 n yields a consistent test with asymptotic
level α. ◦

In both cases (simple and composite null hypothesis), the rejection region has the generic form

Wn = {xn ∈ Xn : ζn (xn ) ≥ a}, ζn : Xn → R. (∗)


3.1 General formalism 71

In the first case, ζn (xn ) = |xn − 1/2| and the test is called two-sided5 because both the events
xn − 1/2 ≤ −a and xn − 1/2 ≥ a must be taken into account to compute the type I error; while
in the second case, ζn (xn ) = xn − 1/2 and the test is called one-sided6 .
When the rejection region of a test has the form (∗), the random variable ζn (Xn ) is called the
test statistic.
Definition 3.1.13 (p-value). Consider a test with rejection region Wn of the form (∗). For all
xn ∈ Xn , the p-value of an observation xn is
p-value = sup Pθ (ζn (Xn ) ≥ ζn (xn ));
θ∈H0

in other words, it is the probability, under H0 , that the test statistic takes values more unfavourable
for the acceptance of H0 than the observed value in the data.
In the (two-sided) test for the Bernoulli model of Example 3.1.3, under H0 the empirical mean
X n should take values concentrated around 1/2. However, due to the randomness of the sample, it
may happen that X n takes values which are far from 1/2. For a given value xn of the test statistic,
the p-value assesses how likely it is that this value be due to random fluctuations of the sample: the
smaller the p-value, the more unlikely the realisation xn under H0 , and the more the experimenter
is encouraged to reject H0 . Indeed, it is easily checked that a test with determined level α rejects
H0 if and only if the p-value of the observation is smaller than α. As a consequence, the p-value
indicates all levels at which H0 will be rejected.
Example 3.1.14. In the Bernoulli model considered in Example 3.1.3, we give the p-values of the
observation xn = 0.6 for different values of n in Table 3.1. For small values of n, the approxima-

tion X n ≃ 1/2 + G/2 n under H0 is not valid and we rather used exact computations with the
Binomial distribution.
n 1 10 100 1000
p-value 1 0.75 0.046 2.5 10−10

Table 3.1: p-values of the observation xn = 0.6 in the Bernoulli model, for various values of n.
At the level α = 5%, H0 is rejected for n = 100, n = 1000 but accepted for n = 1, n = 10.

Exercise 3.1.15. Assume that H0 = {θ0 } is a simple hypothesis, and that under Pθ0 , the test
statistic ζn (Xn ) has a continuous cumulative distribution function. Show that under H0 , the p-
value is a statistic uniformly distributed on [0, 1]. ◦
In general, to compute the exact p-value of an observation, it is necessary to compute the
value of the cumulative distribution function of ζn (Xn ) at the point ζn (xn ). This is not always
possible analytically, and one may either resort to a scientific computing software, Monte-Carlo
simulations, or employ statistical tables, in which case only lower and upper bounds on the p-value
might be available.
Remark 3.1.16 (The ‘5σ-rule’). Levels of 1%, 5% or 10% are considered standard in many fields
of applications, such as biology, medicine, and social sciences. In particle physics, the standard
rule, called the 5σ-rule, is much more conservative: the null hypothesis, namely the nonexistence
of the sought particle, is usually rejected if the observed value of the test statistic is larger than 5
times its standard deviation. For a one-sided test with a Gaussian test statistic, this rule leads to
reject H0 if the p-value is lower than P(G ≥ 5) = 3 10−7 . ◦
5
Bilatéral en français.
6
Unilatéral en français.
72 Hypothesis testing

3.2 General construction of a test


3.2.1 General procedure to construct a test
Classically, the construction of a test is made through the following steps.
(i) Specify the model P = {Pθ , θ ∈ Θ}.
(ii) Define the null hypothesis H0 and the alternative hypothesis H1 .
Do not forget that H0 and H1 do not play symmetric roles!
H0 must be chosen so that type I errors be avoided as a priority.
(iii) Find a test statistic ζn (Xn ) which typically takes larger values under H1 than under H0 .
(iv) Define the rejection region Wn = {ζn (Xn ) ≥ a} and select the value of a which maximises
the power of the test, under the constraint that the level remain below α.
This step requires to know the law of ζn (Xn ) under H0 .
If only the asymptotic distribution of ζn (Xn ) under H0 is known, asymptotic tests can
be considered.
(v) Observe the data xn and reject H0 if xn ∈ Wn .
An alternative to the last two steps, sometimes presented as the Anglo-Saxon approach, con-
sists in returning the p-value of the observation xn rather than the binary information ‘accept H0 ’
or ‘reject H0 ’. This approach is standard in most experimental sciences, as it is more quantitative
(see for instance [4, Chapter 1]).
The crucial point in the design of a test is the choice of the test statistic ζn (Xn ). Let us assume
that the null and alternative hypothesis write
H0 = {g(θ) ≤ g0 }, H1 = {g(θ) > g0 },
for some function g : Θ → R and g0 ∈ R, so that the test is one-sided. Let Zn be an estimator
of g(θ). If either the bias of Zn is small, or Zn is consistent, then Zn should be close to g(θ)
(possibly in the asymptotic regime n → +∞), so that it should take larger values under H1 that
under H0 . Therefore one may consider a rejection region of the form Wn = {Zn ≥ a}, namely
take ζn (Xn ) = Zn − g0 as the test statistic. If the test is two-sided and the null and alternative
hypothesis write
H0 = {g(θ) = g0 }, H1 = {g(θ) 6= g0 },
then for the same reasons, one may take a rejection region of the form Wn = {|Zn − g0 | ≥ a},
which amounts to choosing ζn (Xn ) = |Zn − g0 | as the test statistic.
Sections 3.2.2 and 3.2.3 provide more general rules that automatically provide test statistics.
Example 3.2.1 (The Exponential model). A smartphone manufacturer claims that the average
lifespan of its products is at least of 3 years. A consumer organisation carries out a study over
1000 devices and observes an average lifespan of 2.8 years. At which level can it be concluded
that the manufacturer is lying?
We answer this question by following the guideline described above.
(i) In order to keep the computations simple, we assume that the lifespan of a device is expo-
nentially distributed, and take as a model
P = {E(λ), λ > 0}.
Recall that Eλ [X1 ] = 1/λ, so that the manufacturer’s claim rewrites λ ≤ λ0 , with λ0 =
1/3.
3.2 General construction of a test 73

(ii) To respect the presumption of innocence, we define the null hypothesis as that under which
the manufacturer does not lie, and therefore set

H0 = {λ ≤ λ0 }, H1 = {λ > λ0 }.

(iii) Since the typical lifespan of a smartphone is larger under H0 than under H1 , we take a
rejection region of the form

Wn = {xn ∈ [0, +∞)n : xn ≤ a}.

(iv) Let us compute the type I error. For all λ ≤ λ0 , for all a ≥ 0,

Pλ (Xn ∈ Wn ) = Pλ (X n ≤ a) = P(Sn /λ ≤ a) = P(Sn ≤ λa),

where the random variable Sn = λX n is free in the sense of Definition 2.4.3, with law
Γ(n, n). As a consequence, the type I error λ 7→ Pλ (Xn ∈ Wn ) is nondecreasing, so that

sup Pλ (Xn ∈ Wn ) = Pλ0 (Xn ∈ Wn ) = P(Sn ≤ λ0 a).


λ≤λ0

Denoting by γn,r the quantile of order r of the law Γ(n, n), we deduce that the level of the
test is lower than α if and only if
γn,α
a≤ .
λ0
For all λ ∈ H1 , the power of the test now writes

Pλ (Xn ∈ Wn ) = P(Sn ≤ λa),

which is maximal for the largest allowed value of a. As a conclusion, we finally take the
rejection region
Wn = {xn ∈ [0, +∞)n : xn ≤ γn,α /λ0 }.
With the values n = 1000, α = 0.05 and λ0 = 1/3, we get γn,α /λ0 = 2.85.

(v) Since the observed value xn = 2.8 is lower than 2.85, we are in the rejection region and H0
is rejected at the level 5%. Alternatively, the p-value of the observation is equal to 0.016,
which shows that at the level 1%, H0 would not be rejected.
Exercise 3.2.2. Using the Central Limit Theorem for X n in place of the free random variable Sn
in Example 3.2.1, construct a consistent test with asymptotic level α. Compare the p-value for this
asymptotic test with the p-value found above. ◦

3.2.2 Duality with confidence intervals


The use of a free random variable in the construction of the test detailed in Example 3.2.1 is
reminiscent of the method of the pivotal function to construct confidence intervals, presented in
Section 2.4.2. The next result shows that the construction of tests and confidence intervals relies
on the same computations. In its statement, we define a confidence region Cn = cn (Xn ) for g(θ)
with level 1 − α as a subset of g(Θ) which is described by statistics and such that

∀θ ∈ Θ, Pθ (g(θ) ∈ Cn ) = 1 − α.

With this definition, a confidence interval is a confidence region which is an interval. An approxi-
mate confidence region is a subset such that Pθ (g(θ) ∈ Cn ) ≥ 1 − α.
74 Hypothesis testing

Proposition 3.2.3 (Duality between tests and confidence intervals). Let Cn = cn (Xn ) be a confi-
dence region for g(θ) with level 1 − α. For all g0 ∈ g(Θ), the test with rejection region

Wn (g0 ) = {xn ∈ Xn : g0 6∈ cn (xn )}

has level α for the hypotheses

H0 = {g(θ) = g0 }, H1 = {g(θ) 6= g0 }.

Reciprocally, assume that for all g0 ∈ g(Θ), a test with level α and rejection region Wn (g0 )
is available for the hypotheses H0 and H1 defined above. The random region

Cn = {g ∈ g(Θ) : Xn 6∈ Wn (g)}

is an approximate confidence region for g(θ) with level 1 − α.


The proof of Proposition 3.2.3 is straightforward and left as an exercise. It can also be stated
for asymptotic confidence intervals and asymptotic tests, as illustrated in Exercise 3.2.2.

3.2.3 * The likelihood ratio test


We now present a systematic method to derive a test statistic with a certain optimality property de-
scribed below. We restrict ourselves to the case where both the null and the alternative hypotheses
are simple: Θ = {θ0 , θ1 } and H0 = {θ = θ0 }, H1 = {θ = θ1 }.
Example 3.2.4 (Detection of an infected email address). A given email account usually sends
a number of messages per day distributed according to a Poisson distribution of parameter λ0 .
When the email account is infected by a virus, the parameter of the Poisson distribution becomes
λ1 > λ0 . After an epidemic event, we want to check whether an account has been infected and
observe the number of messages X1 , . . . , Xn it sends over a period of n days.
For this example, the intuitive approach would be to reject H0 when X n ≥ a, with a chosen
to ensure that the test reaches a certain level. We now describe a generic approach.
Definition 3.2.5 (Likelihood ratio). With simple hypotheses H0 = {θ = θ0 } and H1 = {θ = θ1 },
the likelihood ratio is the statistic ζnLR (Xn ), where
Ln (xn ; θ1 )
ζnLR (xn ) = .
Ln (xn ; θ0 )
By the definition of the likelihood of a realisation, the likelihood ratio typically takes larger
values under H1 than under H0 , which leads to define a test with rejection region

WnLR = {ζnLR (Xn ) ≥ a}.

This test is called the likelihood ratio test.


Example 3.2.6 (Poisson model). In the problem of Example 3.2.4, the likelihood ratio writes
Qn  nxn
exp(−λ1 )λx1 i /xi ! λ1
ζnLR (xn ) = Qi=1
n xi = exp(−n(λ 1 − λ 0 )) .
i=1 exp(−λ )λ
0 0 /x i ! λ0

For all a ≥ 0,
n(λ1 − λ0 ) + log a
ζnLR (xn ) ≥ a if and only if xn ≥ ,
n(log λ1 − log λ0 )
3.2 General construction of a test 75

which shows that the rejection region actually takes the form xn ≥ a′ , so that the likelihood ratio
test coincides with the intuitive approached mentionned above. Since nX n ∼ P(nλ0 ) under Pλ0 ,
we deduce that the level of the test is lower than α if and only if
n(λ1 − λ0 ) + log a
≥ qnλ0 ,1−α ,
log λ1 − log λ0
where qλ,r is the quantile of order r of the Poisson distribution with parameter λ.
Besides providing a test statistic automatically, the likelihood ratio test has the following opti-
mality property.
Proposition 3.2.7 (Neyman–Pearson Lemma). Among all tests of level α for the simple hypotheses
H0 = {θ = θ0 } and H1 = {θ = θ1 }, the likelihood ratio test is the most powerful.
The likelihood ratio test is said Uniformly Most Powerful (UMP)7 .

Proof. Let Wn be a subset of Xn such that Pθ0 (Xn ∈ Wn ) ≤ α. We want to show that

Pθ1 (Xn ∈ Wn ) ≤ Pθ1 (Xn ∈ WnLR ).

To this aim, we first write

Pθ1 (Xn ∈ WnLR ) − Pθ1 (Xn ∈ Wn )


Z Z
= Ln (xn ; θ1 )dxn − Ln (xn ; θ1 )dxn
xn ∈WnLR xn ∈Wn
Z Z
= Ln (xn ; θ1 )dxn − Ln (xn ; θ1 )dxn .
xn ∈WnLR \Wn xn ∈Wn \WnLR

Since
Ln (xn ; θ1 )
xn ∈ WnLR if and only if ζnLR (xn ) = ≥ a,
Ln (xn ; θ0 )
we get
Z Z
Ln (xn ; θ1 )dxn ≥ a Ln (xn ; θ0 )dxn ,
xn ∈WnLR \Wn xn ∈WnLR \Wn
Z Z
Ln (xn ; θ1 )dxn ≤ a Ln (xn ; θ0 )dxn ,
xn ∈Wn \WnLR xn ∈Wn \WnLR

so that

Pθ1 (Xn ∈ WnLR ) − Pθ1 (Xn ∈ Wn )


Z Z !
≥a Ln (xn ; θ0 )dxn − Ln (xn ; θ0 )dxn
xn ∈WnLR \Wn xn ∈Wn \WnLR
Z Z !
=a Ln (xn ; θ0 )dxn − Ln (xn ; θ0 )dxn
xn ∈WnLR xn ∈Wn

= a Pθ0 (Xn ∈ WnLR ) − Pθ0 (Xn ∈ Wn ) .

Since the level of the likelihood ratio test is α while Pθ0 (Xn ∈ Wn ) ≤ α, we deduce that the
right-hand side above is nonnegative, which proves the claimed inequality.
7
Uniformément Plus Puissant (UPP) en français.
76 Hypothesis testing

In order to implement the construction of the likelihood ratio test, it is necessary to know the
distribution of the test statistic ζnLR (Xn ) under H0 . Exercise 3.A.6 provides an asymptotic answer.
The likelihood ratio test can be extended to composite hypotheses, by defining the likelihood
ratio
supθ1 ∈H1 Ln (xn ; θ1 )
ζnLR (xn ) = .
supθ0 ∈H0 Ln (xn ; θ0 )

3.3 Examples in the Gaussian model


We refer to Appendix A for a reminder on the definitions on χ2 , Student and Fisher distributions.

Definition 3.3.1 (Z-, t-, F-, and χ2 -tests). A Z-test is a test where the law of the test statistic
under H0 is a Gaussian distribution.
A t-test is a test where the law of the test statistic under H0 is a Student distribution.
A F-test is a test where the law of the test statistic under H0 is a Fisher distribution.
A χ2 -test is a test where the law of the test statistic under H0 is a χ2 distribution.

We consider the Gaussian model

{N(µ, σ 2 ), (µ, σ 2 ) ∈ R × (0, +∞)}

for a sample Xn = (X1 , . . . , Xn ), and detail the construction of various tests for µ and σ 2 .

↸ Exercise 3.3.2 (Test for the mean with known variance). We fix µ0 ∈ R and construct a test for
the hypotheses
H0 = {µ = µ0 }, H1 = {µ 6= µ0 },
assuming that the variance σ 2 is known.

1. Under H0 , what is the law of X n ?

2. Deduce the rejection region of a Z-test with level α. ◦

↸ Exercise 3.3.3 (Test for the mean with unknown variance). We fix µ0 ∈ R and construct a test
for the hypotheses
H0 = {µ = µ0 }, H1 = {µ 6= µ0 },
assuming that the variance σ 2 is unknown.

1. Why is the test of Exercise 3.3.2 no longer valid?


p
2. Under H0 , what is the law of (X n − µ0 )/ Sn2 /n? Hint: look at Proposition A.4.3.

3. Deduce the rejection region of a t-test with level α. ◦

↸ Exercise 3.3.4 (Test for the variance). We fix σ02 > 0 and construct a test for the hypotheses

H0 = {σ 2 ≤ σ02 }, H1 = {σ 2 > σ02 },

without assuming that the mean is known.

1. Looking at Proposition A.4.3 again, find a free random variable with χ2 distribution.

2. Construct a χ2 -test with level α. ◦


3.4 * Multiple comparisons 77

3.4 * Multiple comparisons


3.4.1 The look-elsewhere effect
Assume that one wants to study whether the population in the IMI and SEGF departments of
École des Ponts are homogeneous or not, in terms of academic performance. To this aim, one can
conduct a series of tests, comparing the grades of the two populations over the courses of the first
year. For each of these courses, Student’s test of Section 6.2.2 can be applied (see Example 6.2.9),
yielding a certain p-value. If one denotes by m the total number of tests performed, under the
null hypothesis that there is no difference between the two samples, the p-values p1 , . . . , pm are
m independent realisations of uniform random variables on [0, 1] (see Exercise 3.1.15). For any
level α ∈ (0, 1), the probability that at least one of these p-values be lower than α is thus
 
P H0 min pk ≤ α = 1 − (1 − α)m ,
1≤k≤m

which becomes larger and larger as m grows. As a consequence, even if there is no difference
between the two populations, the more experiments are conducted, the more likely it is that at
least one of these experiments will return a false positive and conclude to a difference between the
two populations. This is the look-elsewhere effect, which is also illustrated on Figure 3.2 and can
be considered as a central issue in modern statistics, as the large size of available datasets enables
multiple comparisons.

3.4.2 Family-wise error rate and the Bonferroni method


The framework of multiple comparisons can be formalised by considering, for a given choice of
null and alternative hypotheses H0 and H1 , a family of m rejection regions Wn1 , . . . , Wnm , such
that the test with rejection region Wnk has level αk . The sample set is Xn = (X1 , . . . , Xn ), but
each realisation takes its values in a possibly high-dimensional state space X. In the example of
the previous section, the events {Xn ∈ Wnk }, k ∈ {1, . . . , m} were assumed to be independent,
in which case we shall call the tests independent, but this assumption may be relaxed.
For multiple comparisons, the type I error can be measured by the Family-Wise Error Rate.

Definition 3.4.1 (Family-Wise Error Rate). The Family-Wise Error Rate (FWER) of the family of
tests Wn1 , . . . , Wnm is defined by
 
FWER = sup Pθ ∃k ∈ {1, . . . , m} : Xn ∈ Wnk .
θ∈H0

Exercise 3.4.2. If the tests are independent and H0 is simple, compute FWER in terms of the
individual levels α1 , . . . , αm . ◦

The Bonferroni method consists in rejecting H0 at the level α ∈ (0, 1) as soon as at least one
of the p-values p1 , . . . , pm is lower than α/m. It is based on the next result, which does not require
the tests to be independent.

Lemma 3.4.3 (Bonferroni correction). We have


m
X
FWER ≤ αk .
k=1

As a consequence, if one takes αk = α/m for all k ∈ {1, . . . , m}, then FWER is lower than α.
78 Hypothesis testing

Figure 3.2: An illustration of the look-elsewhere effect. Taken from XKCD by Randall Munroe:
http://www.xkcd.com/882.
3.4 * Multiple comparisons 79

Pn
Proof. Using the union bound P(∪nk=1 Ak ) ≤ k=1 P(Ak ), we have, for all θ ∈ H0 ,
  X
m   X
m
k k
Pθ ∃k ∈ {1, . . . , m} : Xn ∈ Wn ≤ P θ Xn ∈ W n ≤ αk ,
k=1 k=1

which leads to the expected bound by taking the supremum of the left-hand side over θ ∈ H0 .

The union bound employed in the proof of Lemma 3.4.3 can be very rough, and as a conse-
quence the Bonferroni method has the counterpart of generally increasing the type II risk severely.

3.4.3 False Discovery Rate and the Benjamini–Hochberg procedure


The use of the False Discovery Rate, defined below, is an alternative approach to the control of
the FWER for multiple comparisons. It is concerned with the expected number of false positives
rather than the probability of returning at least one false positive, and therefore provides methods
which have a larger type I risk than FWER-based procedures, but in turn have a better statistical
power.
We extend the framework of the previous sections by assuming that the m tests may have
different null and alternative hypotheses, respectively denoted by H0k , H1k , k ∈ {1, . . . , m}. We
introduce the notation of Table 3.2, where for instance V is the number of tests for which the null
hypothesis is rejected while true, that is to say V is the number of false positives.

H0 is true H0 is false Total


H0 is rejected V S R
H0 is accepted U T m−R
Total m0 m − m0 m

Table 3.2: Notations for multiple comparisons. V , S, R, U , T are random variables, among which
only R is a statistic. m0 is not random but depends on θ.

Definition 3.4.4 (False Discovery Rate). The False Discovery Rate (FDR) is

FDR(θ) = Eθ [V /R],

with the convention that V /R = 0 when V = R = 0.


The Benjamini–Hochberg procedure8 allows to control the FDR. It was published in 1995 but
has become a standard in the analysis of large data sets. It works as follows:
(i) Perform the m tests and let p1 , . . . , pm be the associated p-values.

(ii) Denote by p(1) ≤ · · · ≤ p(m) the increasing reordering of the p-values.

(iii) For α ∈ (0, 1), define


 
j
J = max j ∈ {1, . . . , m} : p(j) ≤ α .
m

(iv) Reject H0k for all k such that pk ≤ p(J) .


8
Y. Benjamini and Y. Hochberg, Controlling the False Discovery Rate: A Practical and Powerful Approach to
Multiple Testing, Journal of the Royal Statistical Society. Series B (Methodological), vol. 57(1), 1995, p. 289–300.
80 Hypothesis testing

Exercise 3.4.5. Show that with this procedure, the number of discoveries is R = J. ◦
Theorem 3.4.6 (Control of the FDR by the Benjamini–Hochberg procedure). Assume that the
tests are independent. With the Benjamini–Hochberg procedure,
m0
∀θ ∈ Θ, FDR(θ) = α ≤ α.
m
Proof. Let θ ∈ Θ, I be the set of indices k for which θ ∈ H0k , and for all k ∈ {1, . . . , m},

Vk = 1{H k is rejected} = 1{pk ≤p(J ) } ,


0
P
so that V = k∈I Vk and
  X   XX m
V Vk 1  
FDR(θ) = Eθ = Eθ = Eθ Vk 1{J=j} .
R J j
k∈I k∈I j=1

For all k ∈ I, let Jk denote the number of discoveries of the procedure if the p-value of the k-th
test is replaced with the value 0. Then it follows from the definition of the Benjamini–Hochberg
procedure that, for all j ∈ {1, . . . , n},

Vk 1{J=j} = Vk 1{Jk =j} = 1{pk ≤αj/m} 1{Jk =j} ,

so that
m
XX 1  
FDR(θ) = Eθ 1{pk ≤αj/m} 1{Jk =j} .
j
k∈I j=1

Since Jk only depends on (p1 , . . . , pk−1 , pk+1 , . . . , pm ), it is independent of pk so that


     
Eθ 1{pk ≤αj/m} 1{Jk =j} = Eθ 1{pk ≤αj/m} Eθ 1{Jk =j} ,

and the fact that k ∈ I implies that under Pθ , the p-value pk is uniformly distributed on [0, 1],
therefore
  αj
Eθ 1{pk ≤αj/m} = .
m
We deduce that
XX m
1 αj   m0
FDR(θ) = Eθ 1{Jk =j} = α ,
j m m
k∈I j=1

which completes the proof.

3.A Exercises
↸ Exercise 3.A.1 (Power and sample size). The probability for an individual to be infected by
a virus is denoted by p0 , and assumed to be known. A new vaccine is tested on a sample of
n individuals. We denote by p ∈ [0, p0 ] the probability to be infected after the vaccine (the
probability to be infected cannot be increased by the vaccine). For all i ∈ {1, . . . , n}, we define
Xi = 1 if the i-th individual is infected by the virus after the vaccine and Xi = 0 otherwise, so
that Xi ∼ B(p). We introduce the hypotheses

H0 = {p = p0 }, H1 = {p < p0 }.

It is recommended to use R to perform the numerical computations.


3.A Exercises 81

1. Using the approximation of X n by a Gaussian variable, construct a consistent test with


asymptotic level α for this model.

2. For p0 = 18%, an experiment carried over n = 100 individuals yields xn = 16%. At


the level α = 5%, what is the conclusion of the experiment regarding the efficiency of the
vaccine?

3. It turns out that p = 15%, so that the vaccine actually reduces the number of infected indi-
viduals by 1/6. What is the probability that an experiment carried over n = 100 individuals
succeed in detecting the efficiency of the vaccine at the level α = 5%? What do you think
of this result?

4. What should be the minimum size of the sample in order to detect, with probability at least
80%, a diminution of the number of infected people by 1/6? By 1/3? ◦

Exercise 3.A.2 (Nonasymptotic test for the Bernoulli model). The purpose of this exercise is to
construct a nonasymptotic test, with level lower than α, for the hypotheses

H0 = {p = p0 }, H1 = {p 6= p0 }, p0 ∈ [0, 1],

in the Bernoulli model, by avoiding to resort to the asymptotic approximation of X n by a Gaussian


random variable.

1. Using Hoeffding’s inequality, construct an approximate confidence interval for p (see Sec-
tion 2.4).

2. Deduce a nonasymptotic test from the duality between tests and confidence intervals. ◦

Exercise 3.A.3 (A simple quiz). This exercise is taken from A. Reinhart’s book Statistics Done
Wrong: the Woefully Complete Guide [4], which is a very good reference concerning the practical
applications of hypothesis testing in experimental sciences.
A 2002 study found that an overwhelming majority of statistics students — and instructors
— failed a simple quiz about p-values. Try the quiz (slightly adapted for this book) for yourself
to see how well you understand what the p-value really means. Suppose you are testing two
medications, Fixitol and Solvix. You have two treatment groups, one that takes Fixitol and one
that takes Solvix, and you measure their performance on some standard task (a fitness test, for
instance) afterward. You compare the mean score of each group using a simple significance test,
and you obtain p-value = 0.01, indicating there is a statistically significant difference between
means.
Based on this, decide whether each of the following statements is true or false:

1. You have absolutely disproved the null hypothesis: there is no difference between means.

2. There is a 1% probability that the null hypothesis is true.

3. You have absolutely proved the alternative hypothesis: there is a difference between means.

4. You can deduce the probability that the alternative hypothesis is true.

5. You know, if you decide to reject the null hypothesis, the probability that you are making
the wrong decision.

6. You have a reliable experimental finding, in the sense that if your experiment were repeated
many times, you would obtain a significant result in 99% of trials. ◦
82 Hypothesis testing

Exercise 3.A.4 (The Student–Wald test). Let Zn be a consistent and asymptotically normal estima-
tor of g(θ) ∈ R, with asymptotic variance V (θ). Assume that a consistent estimator Vbn of the vari-
ance is available. For g0 ∈ g(Θ), construct a consistent test for the hypotheses H0 = {g(θ) = g0 },
H1 = {g(θ) 6= g0 }, with asymptotic level α. ◦
1 Exercise 3.A.5 (From the 2015-2016 final exam9 ). Let θ > 0 and α > 0. We consider the
probability density
1
fθ,α (x) = C(θ, α) α+3 1{x∈[θ,+∞)} ,
x
and an iid sample X1 , . . . , Xn from this distribution.
1. Compute C(θ, α), E[X] and Var(X) if X has the density fθ,α(x).
2. We assume that θ is known and α is unknown. We shall admit that
1 1
E[log X] = log θ + , and Var(log X) = .
α+2 (α + 2)2
(a) Compute the MLE α
bn of α and show that it is strongly consistent.
(b) Use E[X] to compute a strongly consistent estimator α
en of α by the method of mo-
ments.
(c) Show that the estimators α
bn and αen are asymptotically normal and compute their
asymptotic variance. Which estimator do you prefer?
(d) Let α0 > 0. Construct an asymptotic test for H0 = {α = α0 }, H1 = {α > α0 }, with
level 5%, based on the better estimator of α.
3. We now assume that α is known and θ is unknown.
(a) Compute the MLE θbn of θ.
(b) Compute the bias of θbn . R∞
Hint: for a nonnegative random variable Y , E[Y ] = x=0 P(Y ≥ x)dx.
(c) Compute the cumulative distribution function of n(θbn − θ) and determine its limit.
Deduce that n(θbn − θ) converges in distribution and give the law of the limit.
(d) Construct an asymptotic confidence interval for θ, with level 95%. You may look for
an interval of the form [θbn /(1 + nb ), θbn ] with b > 0 to be determined. ◦
Exercise 3.A.6 (Asymptotics of the likelihood ratio statistic). We consider simple hypotheses
H0 = {θ = θ0 } and H1 = {θ = θ1 }. We recall the Definition 3.2.5 of the likelihood ratio
ζnLR (Xn ) and assume that the quantity
  
L1 (X1 ; θ0 )
h = E θ1 φ , φ(u) = u log u,
L1 (X1 ; θ1 )
where L1 (x1 ; θ) is the likelihood of a sample with a single value, is well-defined.
1. Using Jensen’s inequality, show that h ≥ 0. In the sequel, we shall assume that the model
is chosen so that h > 0.
1
2. Show that under H0 , n log ζnLR (Xn ) converges almost surely to −h.
3. Using the Central Limit Theorem, construct a consistent test with asymptotic level α based
on the likelihood ratio. ◦
9
Written by C. Butucea.
3.B Summary 83

3.B Summary
3.B.1 Vocabulary
• Null and alternative hypotheses: partition H0 , H1 of the parameter set Θ.
• Test: procedure accepting or rejecting H0 depending on the value of the data.
• Rejection region: set of values of the sample for which H0 is rejected.
• Type I error: reject H0 while actually θ ∈ H0 , measured by type I risk.
• Type II error: accept H0 while actually θ ∈ H1 , measured by type II risk.
• Level: maximum of type I risk.
• Power: 1 − type II risk. A test is consistent if the power converges to 1 when the size of the
sample goes to +∞.
• p-value: probability under H0 that the test statistic takes worse values for the acceptance of
H0 than the observed value.

Golden rule: the smaller the p-value, the more unlikely the observation under H0 , so that
reject H0 at level α ⇐⇒ p-value ≤ α.

3.B.2 Procedure to construct a test


(i) Specify the model P = {Pθ , θ ∈ Θ}.
(ii) Define the null hypothesis H0 and the alternative hypothesis H1 .
(iii) Find a test statistic ζn (Xn ) which typically takes larger values under H1 than under H0 .
Several possible approaches:
Find a free random variable under H0 based on an estimator of θ.
Rely on the duality with confidence interval, either exact or asymptotic.
Use the likelihood ratio.
(iv) Define the rejection region Wn = {ζn (Xn ) ≥ a} and select the value of a which maximises
the power of the test, under the constraint that the level remain below α.
(v) Observe the data xn and reject H0 if xn ∈ Wn .

3.B.3 Tests in the Gaussian model


• Tests for the mean: Z-test if the variance is known, t-test if the variance is unknown.
• Test for the variance: χ2 -test.

3.B.4 Multiple comparisons


• For multiple tests, corrections methods must be applied to avoid the look-elsewhere effect.
• The Bonferroni procedure allows to control the FWER, at the price of a large type II risk.
• The Benjamini–Hochberg procedure has a better statistical power but only controls the FDR.
84 Hypothesis testing
Chapter 4

Nonparametric tests

Nonparametric statistics refer to the framework where no parametric assumption is made on the
common law P of the iid sample X1 , . . . , Xn which we want to estimate. Theoretically, it is still
possible to employ the formalism of parametric estimation, and look for P within the set
P = {Pθ : θ ∈ Θ},
where the parameter set Θ is the set of all probability measures on the space X in which X1 , . . . , Xn
take their values, and Pθ = θ, but the interest of the dimension reduction of the parametric frame-
work is lost. Thus, specific tools need to be introduced. An overview of such tools for the prob-
lematic of nonparametric estimation is given in Section 2.5 of Chapter 2. In the present chapter,
we address the problematic of nonparametric tests.
The basic idea of nonparametric methods consists in approximating the law P of the variables
X1 , . . . , Xn by the empirical distribution Pbn defined by
n
1X
Pbn (A) = 1{Xi ∈A} ,
n
i=1

for any (measurable) subset A ⊂ X. Notice that Pbn is a random probability measure, as it depends
on the value of the sample Xn = (X1 , . . . , Xn ).
Exercise 4.0.1 (Warm-up). Show that Pbn (A) converges almost surely to P (A) = P(X1 ∈ A). ◦
In this chapter, we shall focus on two classes of tests.
• Goodness-of-fit tests1 , where a specific probability measure P0 on X is given and the hy-
potheses are
H0 = {P = P0 }, H1 = {P 6= P0 }.
An example of such a test could be: ‘are the grades of an exam distributed according to
the binomial distribution with parameters N = 20, p = 0.5 (which would mean that the
students have answered the questions at random)?’.
• Goodness-of-fit tests to a family of distributions, where a subset P0 of the set of probability
measures on X (usually, a parametric model) is given, and the hypotheses are
H0 = {P ∈ P0 }, H1 = {P 6∈ P0 }.
An example of such a test could be: ‘do the logarithms of the daily variations of the price
of a stock follow a Gaussian distribution?’.
1
Test de conformité, d’ajustement ou d’adéquation en français.
86 Nonparametric tests

In any case, we shall construct tests by measuring a certain distance between the empirical
distribution Pbn and either P0 or P0 , and reject H0 as soon as this distance is larger than a certain
threshold a. The choice of a ‘good’ notion of distance (in the space of probability measures on X)
is therefore crucial. We shall discuss two particular cases: the case where X is a finite set, and the
case where X = R and P is assumed to have a continuous cumulative distribution function.

4.1 Models with a finite state space: the χ2 test


In this section, we assume that the data X1 , . . . , Xn take their values in a finite state space X with
cardinality m. We recall that in this case, a probability measure P on X is characterised by its
probability mass function (px )x∈X defined by px = P ({x}). We shall often identify P with the
vector of Rm with coefficients (px )x∈X .

4.1.1 χ2 distance and empirical distribution


As is indicated in the introduction, in order to construct a nonparametric test we have to choose
a distance on the space of probability measures on X. It will be convenient to work with the χ2
distance defined below.
Let P and Q be two probability measures on X, with respective probability mass functions
(px )x∈X and (qx )x∈X . We say that P is absolutely continuous with respect to Q, and we write
P ≪ Q, when for any x ∈ X, the condition qx = 0 implies px = 0.

Definition 4.1.1 (χ2 distance). The χ2 distance between P and Q is defined by


X

 (px − qx )2
if P ≪ Q,
χ2 (P |Q) = x∈X qx


+∞ otherwise.

In the first line, when px = qx = 0, we take the convention that (px − qx )2 /qx = 0.

Remark 4.1.2. The χ2 distance is not a distance, because it is not symmetric. However, it can be
checked that χ2 (P |Q) = 0 if and only if P = Q, so that χ2 (P |Q) can still be understood as a
measure of how close P and Q are. ◦

The probability mass function of the empirical distribution Pbn of a sample X1 , . . . , Xn iid
according to P is the random vector (b
pn,x )x∈X defined by
n
1X
∀x ∈ X, pbn,x = 1{Xi =x} .
n
i=1

Remark 4.1.3. If there exists x ∈ X such that px = 0, then almost surely, pbn,x = 0 for all n ≥ 1.
Therefore up to removing x from X, we shall always assume that px > 0 for all x ∈ X in the
sequel. ◦

Proposition 4.1.4 (Asymptotic behaviour of Pbn ). Let X1 , . . . , Xn ∈ X be independent random


variables, identically distributed according to P . When n → +∞,

(i) the empirical distribution Pbn converges almost surely to P ,


4.1 Models with a finite state space: the χ2 test 87

√ b
(ii) n(Pn −P ) (seen as a random vector of Rm ) converges in distribution to a Gaussian vector
Nm (0, K) with covariance matrix K = (Kx,y )x,y∈X given by
(
px (1 − px ) if x = y,
Kx,y =
−px py if x 6= y.

Proof. We first notice that


n
1X
Pbn = Ri , Ri = (ri,x )x∈X ,
n
i=1
where R1 , . . . , Rn are the iid vectors of Rm such that ri,x = 1{Xi =x} . By the strong Law of Large
Numbers, we deduce that
lim Pbn = E[R1 ] = P, almost surely.
n→+∞

Now by the multidimensional Central Limit Theorem (see Theorem A.3.1 in Appendix A),

lim n(Pbn − P ) = Nm (0, K), in distribution,
n→+∞

where K is the covariance matrix of R1 . Its coefficients (Kx,y )x,y∈X write


     
Kx,y = Cov(r1,x , r1,y ) = E 1{X1 =x} 1{X1 =y} − E 1{X1 =x} E 1{X1 =y} ,
and we have E[1{X1 =x} ]E[1{X1 =y} ] = px py while
(  
  E 1{X1 =x} = px if x = y,
E 1{X1 =x} 1{X1 =y} =
0 if x 6= y,
which yields the claimed expression for Kx,y .

Corollary 4.1.5 (Asymptotic behaviour of χ2 (Pbn |Q)). Under the assumptions of Proposition 4.1.4,
(i) for any probability measure Q on X, with a probability mass function (qx )x∈X such that
qx > 0 for all x ∈ X, χ2 (Pbn |Q) converges to χ2 (P |Q),
(ii) nχ2 (Pbn |P ) converges in distribution to a random variable with distribution χ2 (m − 1).
Proof. The first point follows from the continuity of P 7→ χ2 (P |Q) and Proposition 4.1.4 (i). In
order to prove the second point, let us introduce the diagonal matrix M ∈ Rm×m with diagonal

coefficients (1/ px )x∈X , so that
X  pbn,x − px 2 √
2

nχ2 (Pbn |P ) = n √ = M n(Pbn − P ) .
px
x∈X

By Proposition 4.1.4 (ii), nχ2 (Pbn |P ) converges in distribution to kM U k2 , where U ∼ Nm (0, K).
Following Proposition A.2.3, M U ∼ Nm (0, Π) with Π = M KM ⊤ . The coefficients (Πx,y )x,y∈X
are easy to compute and write
(
1 1 − px if x = y,
Πx,y = √ √ Kx,y = √
px py − px py if x 6= y.
It is now straightforward to check that

Π = Im − ee⊤ , e = ( px )x∈X , kek = 1,
so that Π is the orthogonal projection of Rm onto the m − 1 dimensional space e⊥ . Therefore by
Proposition A.4.1, kM U k2 ∼ χ2 (m − 1).
88 Nonparametric tests

4.1.2 Goodness-of-fit χ2 test


We now fix a probability measure P0 = (p0,x )x∈X and address the goodness-of-fit test for the
hypotheses
H0 = {P = P0 }, H1 = {P 6= P0 }.
Following Remark 4.1.3, we assume, up to removing some elements from X, that p0,x > 0 for all
x ∈ X.
Example 4.1.6 (Uniform number generator). In order to assess the quality of a random number
generator, we draw n = 1000 digits at random between 0 and 9 (for instance, letting Xi = ⌊10Ui ⌋
where (Ui )1≤i≤n are independent uniform variables on [0, 1]) and obtain the following results.
Digit 0 1 2 3 4 5 6 7 8 9
Number of occurrences 90 90 100 98 104 107 111 84 102 114
We would like to know whether the distribution is uniform in {0, . . . , 9} or not.
Definition 4.1.7 (Pearson’s statistic). The statistic
X (b
pn,x − p0,x )2
dn = nχ2 (Pbn |P0 ) = n
p0,x
x∈X

is called Pearson’s statistic.


For all ℓ ≥ 1 and r ∈ (0, 1), we denote by χ2ℓ,r the quantile of order r of the χ2 (ℓ) distribution.
The χ2 test is presented in the next proposition.
Proposition 4.1.8 (χ2 goodness-of-fit test). For all α ∈ (0, 1), the test with rejection region

Wn = {dn ≥ χ2m−1,1−α }

is consistent and has asymptotic level α.


Proof. We first check that this test is consistent, and therefore assume that P 6= P0 . Then by
Corollary 4.1.5 (i), χ2 (Pbn |P0 ) converges almost surely to χ2 (P |P0 ) > 0, therefore dn → +∞.
As consequence, P(Wn ) converges to 1 and the test is consistent.
We now compute the asymptotic level, and therefore take P = P0 . By Corollary 4.1.5 (ii), dn
converges in distribution to a random variable ξ ∼ χ2 (m − 1), so that

lim P(Wn ) = P(ξ ≥ χ2m−1,1−α ) = α,


n→+∞

which completes the proof.

Remark 4.1.9 (Validity of the asymptotic approximation). The χ2 test is based on the approxi-
mation of the law of Pearson’s statistic under H0 by the χ2 distribution, which theoretically only
holds when n goes to +∞. In practice, this approximation is considered to be legitimate when the
property
∀x ∈ X, np0,x (1 − p0,x ) ≥ 5
holds. Notice that this property holds if and only if
       
n min p0,x 1 − min p0,x ≥5 and n max p0,x 1 − max p0,x ≥ 5,
x∈X x∈X x∈X x∈X

so that only 2 computations are necessary to check this property. ◦


4.1 Models with a finite state space: the χ2 test 89

Exercise 4.1.10. Apply the χ2 test to answer the question of Example 4.1.6 (with the R command
qchisq(.95,9), one obtains χ29,.95 = 16.9). ◦

Remark 4.1.11 (Extension to infinite state spaces). When the state space X is infinite, the χ2
test cannot be applied as the number of degrees of freedom m − 1 of the χ2 statistic should
be infinite. However, it can be adapted by partitioning the state space X in a finite number of
e = {A1 , . . . , Am } and testing whether the random variables X
classes X e1 , . . . , X
en defined from
the sample X1 , . . . , Xn by
Xei = Aj if Xi ∈ Aj
e defined by
are distributed according to the probability measure Pe0 on X

Pe0 (Aj ) = P0 ({x ∈ X : x ∈ Aj }).

The result of the test will of course depend on the choice of the partition. In practice, the latter
should be chosen so that under P0 , all classes A1 , . . . , Am have approximately the same proba-
bility 1/m; in this case, following Remark 4.1.9, the number m of classes must be chosen such
that  
1 1
n 1− ≥ 5.
m m
This procedure is particularly adapted to countably infinite state spaces, in which case P0 remains
represented by its probability mass function (p0,x )x∈X and
X
Pe0 (Aj ) = p0,x .
x∈Aj

For continuous probability measures on R, nonparametric tests such as those introduced in Sec-
tion 4.2 should be preferred. ◦

When the condition of Remark 4.1.9 ensuring the validity of the χ2 approximation is not
satisfied because px takes too small values, it is a common practice to aggregate the classes for
which px is small into larger classes, in the spirit of Remark 4.1.11.

4.1.3 χ2 test of goodness-of-fit to a family of distributions


Let P0 be a subset of the set of probability measures on X. We are now interested in the null and
alternative hypotheses
H0 = {P ∈ P0 }, H1 = {P 6∈ P0 }.

Example 4.1.12 (Binomial model). In a group of N students, the number of students missing the
i-th session of the course is denoted by Xi , i = 1, . . . , n, where n is the total number of ses-
sions. If each student chooses to skip a class independently of each other, the variables Xi should
follow a binomial distribution B(N, p) where p is unknown. To test whether this independence
property holds (null hypothesis), or whether there is a contagion effect in absenteeism (alternative
hypothesis), one may set P0 = {B(N, p), p ∈ [0, 1]}.

Notice that if P0 = {P0 } is a singleton, then we are in the framework of the previous para-
graph. If this is not the case, then Pearson’s statistic is ill-defined as one does not know a priori
which of the measures P0 ∈ P0 should be compared with Pbn . In order to circumvent this issue,
we shall assume that P0 is a parametric family with low dimension, which writes

P0 = {P0,θ , θ ∈ Θ},
90 Nonparametric tests

where Θ ⊂ Rq with q < m. Example 4.1.12 fulfills this assumption, with q = 1 (the unknown
parameter p has dimension 1) and m = N + 1 (a B(N, p) variable can take the N + 1 values
0, . . . , N ).

Proposition 4.1.13 (Pearson’s statistic for the goodness-of-fit to a family of distributions). Assume
that Θ has a nonempty interior in Rq , and that the mapping θ 7→ P0,θ is injective2 . Let θbn be a
consistent estimator of θ. Consider the statistic

X (b
pn,x − p0,θb n ,x
)2
d′n = nχ2 (Pbn |P0,θbn ) = n .
p0,θbn ,x
x∈X

Under H0 , d′n converges in distribution to χ2 (m − q − 1), while under H1 , d′n → +∞ almost


surely.

We omit the proof of Proposition 4.1.13 (and refer for example to [6, Section 17.5]) but insist
on the fact that the number of degrees of freedom of the limiting χ2 distribution under H0 is not
the same as in the case of a simple null hypothesis: the larger the dimension q of the parameter
set Θ, the lower the number of degrees of freedom!

Corollary 4.1.14 (χ2 goodness-of-fit test for a family of distributions). The test with rejection
region
Wn = {d′n ≥ χ2m−q−1,1−α }
is consistent and has asymptotic level α.

Exercise 4.1.15 (Continuation of Example 4.1.12). The number of students missing each session
of the course during the year 2017/2018 is reported below, for a group of N = 20 students.

Number of absent students for each of the 12 sessions3 : 3, 3, 0, 4, 3, 0, 1, 1, 0, 1, 0, 2.

Was there a contagion effect in absenteeism? ◦

4.2 Continuous models on the line: the Kolmogorov test


In this section, we assume that the space X in which the data X1 , . . . , Xn take their values is
R. We recall that a probability distribution P on R is characterised by its cumulative distribution
function (CDF) F : R → [0, 1] defined by

∀x ∈ R, F (x) = P ((−∞, x]) = P(X ≤ x)

if X is distributed according to P . CDFs are nondecreasing, right-continuous with left limits4 ,


and satisfy limx→−∞ F (x) = 0, limx→+∞ F (x) = 1.

4.2.1 Asymptotic behaviour of the empirical CDF


Let X1 , . . . , Xn be iid random variables in R, with common CDF F . The empirical CDF of the
sample Xn = (X1 , . . . , Xn ) is the random function Fbn : R → [0, 1] defined by
n
1X
∀x ∈ R, Fbn (x) = 1{Xi ≤x} .
n
i=1
4.2 Continuous models on the line: the Kolmogorov test 91

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Figure 4.1: Plot of the empirical CDF Fbn of a sample, superposed with its actual CDF F .

The empirical CDF of a sample Xn is plotted on Figure 4.1, together with the actual CDF F
of the sample.
As is intuitively expected, Fbn approximates F when the size of the sample increases. A first
justification of this approximation is that, for a fixed x ∈ R, the strong Law of Large Numbers
yields
lim Fbn (x) = E[1{X1 ≤x} ] = F (x), almost surely. (LLN)
n→+∞

This result is strengthened in the next theorem.

Theorem 4.2.1 (Glivenko–Cantelli Theorem). Let F be a CDF on R and (Xi )i≥1 be a family of
independent random variables with CDF F . We have


lim sup Fbn (x) − F (x) = 0, almost surely.
n→+∞ x∈R

Before starting the proof, the reader should wonder in which sense the statement of the
Glivenko–Cantelli is stronger than the convergence result (LLN) asserted above.

Proof. For any x ∈ R, we denote by Fbn (x− ) and F (x− ) the respective left limits of Fbn and F at
x; since these functions are right continuous, there is no need to introduce a notation for the right
limits. By the strong Law of Large Numbers, for all x ∈ R, in addition to (LLN) we also get
n
1X
lim Fbn (x− ) = lim 1{Xi <x} = E[1{X1 <x} ] = F (x− ), almost surely. (LLN-2)
n→+∞ n→+∞ n
i=1

Let ǫ > 0. Since F is nondecreasing and bounded, there is only a finite number of points x ∈ R
such that F (x) − F (x− ) > ǫ. Thus, there exist k ≥ 1 and −∞ = x0 < x1 < · · · < xk = +∞
2
Such a model is called identifiable.
3
Do not take this as a challenge to beat...
4
Continu à droite, avec une limite à gauche en français.
92 Nonparametric tests

such that F (x− b


ℓ ) − F (xℓ−1 ) ≤ ǫ for all ℓ ∈ {1, . . . , k}. Therefore, using the fact that Fn and F
are nondecreasing, for all x ∈ [xℓ−1 , xℓ ), it holds

Fbn (x) ≤ Fbn (x−


ℓ ) and F (x) ≥ F (xℓ−1 ) ≥ F (x−
ℓ ) − ǫ,

so that
Fbn (x) − F (x) ≤ Fbn (x− −
ℓ ) − F (xℓ ) + ǫ.

With similar arguments, we get

Fbn (xℓ−1 ) − F (xℓ−1 ) − ǫ ≤ Fbn (x) − F (x).

By (LLN) and (LLN-2) for x = x0 , . . . , xk , we deduce that almost surely,

lim sup sup |Fbn (x) − F (x)| ≤ ǫ,


n→+∞ x∈R

which completes the proof since the left-hand side does not depend on ǫ.

It should be clear that the Glivenko–Cantelli Theorem is a result of the same nature as the
strong Law of Large Numbers, stated in a functional space. It is accompanied by a Central Limit
Theorem, which relies on a random process5 (β(t))t∈[0,1] called the Brownian bridge.

Definition 4.2.2 (Brownian bridge). The Brownian bridge is the unique (in law) random process
(β(t))t∈[0,1] such that:

(i) almost surely, the mapping t 7→ β(t) is continuous on [0, 1], and β(0) = β(1) = 0;

(ii) for any d ≥ 1 and t1 , . . . , td ∈ [0, 1], the random vector (β(t1 ), . . . , β(td )) is Gaussian,
centered, with covariance matrix given by Cov(β(ti ), β(tj )) = min{ti , tj } − ti tj .

A proof of why the two conditions in Definition 4.2.2 ensure the existence and uniqueness of
(β(t))t∈[0,1] is beyond the scope of these notes. At a heuristic level, the Brownian bridge must be
understood as ‘the Brownian motion on [0, 1] conditioned to take the value 0 at the points 0 and 1’,
see Figure 4.2 and the course Stochastic Processes and their Applications6 .

Theorem 4.2.3 (Donsker Theorem). Let F be a CDF on R and (Xi )i≥1 be a family of independent
random variables with CDF F . The random function
√  
Gn : x ∈ R 7→ n Fbn (x) − F (x)

converges in distribution (in an appropriate functional space) to the random function

G : x ∈ R 7→ β(F (x)),

where β : [0, 1] → R is the Brownian bridge.

Donsker’s Theorem is presented because of its importance in nonparametric statistic, but we


shall not elaborate on its contents (the interested reader may have a look at Exercise 4.A.3). Still,
an important consequence of Donsker’s Theorem is the following asymptotic behaviour of the
supremum of Gn .
5
That is to say, a random variable in the space of functions.
6
http://cermics.enpc.fr/~delmas/Enseig/proba2.html
4.2 Continuous models on the line: the Kolmogorov test 93

0.7

0.6

0.5

0.4

0.3

0.2

0.1

−0.1

−0.2

−0.3

−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.2: A realisation of the Brownian bridge.

Corollary 4.2.4 (Asymptotic of the supremum). Under the assumptions of Theorem 4.2.3, the
random variable √

gn = n sup Fbn (x) − F (x)
x∈R

converges in distribution to the random variable g∞ = supx∈R |β(F (x))|. When F is continuous,
then
g∞ = sup |β(t)|,
t∈[0,1]

and its CDF writes


+∞
X 
∀y > 0, P(g∞ ≤ y) = (−1)k exp −2k2 y 2 .
k=−∞

The explicit computation of the CDF of g∞ is due to Kolmogorov7 .

4.2.2 Goodness-of-fit: the asymptotic Kolmogorov test


We now fix a probability measure P0 on R, with CDF F0 , and address the test of the hypotheses

H0 = {F = F0 }, H1 = {F 6= F0 }.

Definition 4.2.5 (Kolmogorov’s statistic). The Kolmogorov statistic is the statistic




ζn = sup Fbn (x) − F0 (x) .
x∈R

Its asymptotic behaviour is described by the Glivenko–Cantelli and Donsker Theorems, which
provide a natural asymptotic test.
7
A. N. Kolmogorov, Sulla Determinazione Empirica di una Legge di Distribuzione, Giornale dell’Istituto Italiano
degli Attuari, vol. 4, 1933, p. 83–91.
94 Nonparametric tests

Proposition 4.2.6 (Asymptotic Kolmogorov test). If P0 has a continuous CDF F0 on R, the test
with rejection region √
Wn = { nζn ≥ a},
where a > 0 is defined by the relation
+∞
X 
(−1)k exp −2k2 a2 = 1 − α,
k=−∞

is consistent and has asymptotic level α.


Exercise 4.2.7. Prove Proposition 4.2.6. Hint: the scheme of proof is exactly the same as for
Proposition 4.1.8. ◦
Remark 4.2.8 (Computation of the Kolmogorov statistic). To implement Kolmogorov’s test, one
needs to compute the value of the statistic ζn , which a priori requires to evaluate F0 (x) at all points
x ∈ R. However, the monotonicity of F0 together with the fact that Fbn is piecewise constant show
that the supremum is necessarily reached at one of the n points X1 , . . . , Xn (see Figure 4.3), so
that  
k − 1 k
ζn = max max − F0 (X(k) ) , − F0 (X(k) ) ,

1≤k≤n n n
where X(1) ≤ · · · ≤ X(n) denotes the nondecreasing reordering of X1 , . . . , Xn . Therefore, only
n evaluations of F0 are necessary. ◦

X(1) X(2) X(3)

Figure 4.3: The maximum of |Fbn − F | is reached at the point X(2) .

Exercise 4.2.9 (Cramér–von Mises test). As an alternative to the Kolmogorov test, the Cramér–von
Mises test is based on the statistic
Z  2
ξn = Fbn (x) − F0 (x) F0′ (x)dx,
x∈R

where F0 is assumed to be C 1 . This statistic is another measure of the distance between Fbn and F0 .
Thanks to heuristic computations based on the Donsker Theorem, describe the region of rejection
of this test. ◦

4.2.3 Goodness-of-fit: the nonasymptotic Kolmogorov test


Unlike the case of Pearson’s statistic for finite state space models, the Kolmogorov statistic enjoys
a peculiar property allowing to derive nonasymptotic tests: it is free8 under H0 .
8
We recall the Definition 2.4.3 in Chapter 2 of the freeness of a statistic in the parametric context.
4.2 Continuous models on the line: the Kolmogorov test 95

Lemma 4.2.10 (Freeness of Kolmogorov’s statistic). If P0 has a continuous CDF F0 on R, the


law of Kolmogorov’s statistic ζn under H0 only depends on n and not on F0 .

The proof of Lemma 4.2.10 is postponed below. It relies on the notion of pseudo-inverse of a
CDF.

Definition 4.2.11 (Pseudo-inverse). Let F : R → [0, 1] be a CDF. The pseudo-inverse F −1 :


[0, 1] → [−∞, +∞] is defined by

∀u ∈ [0, 1], F −1 (u) = inf{x ∈ R : F (x) ≥ u},

where we take the conventions that inf R = −∞ and inf ∅ = +∞.

This notion, which does not suffer from any issue of existence or uniqueness, can be seen as a
precision of the Definition 2.4.2 of a quantile given Chapter 2.

Exercise 4.2.12 (Some precautions to be taken). Construct:

1. a CDF F such that there exists u ∈ (0, 1) for which F (F −1 (u)) 6= u,

2. a CDF F such that there exists x ∈ R for which F −1 (F (x)) 6= x. ◦

Lemma 4.2.13 (Properties of the pseudo-inverse). Let F : R → [0, 1] be a CDF.

(i) For all u ∈ (0, 1), x ∈ R, F −1 (u) ≤ x if and only if u ≤ F (x).

(ii) Let U be a uniform random variable on [0, 1]. Then F is the CDF of the random variable
F −1 (U ).

Proof. Since F is right-continuous, for any u ∈]0, 1[, the set {x ∈ R : F (x) ≥ u} is closed,
so that F (F −1 (u)) ≥ u. Since F is nondecreasing, we deduce that if F −1 (u) ≤ x, then u ≤
F (F −1 (u)) ≤ F (x). Reciprocally, if u ≤ F (x), then by the definition of F −1 , F −1 (u) ≤ x: this
proves the first point.
We check the second point by writing
Z F (x)
−1
P(F (U ) ≤ x) = P(U ≤ F (x)) = du = F (x),
u=0

where the first identity follows from the first part of the lemma.

We may now detail the proof of Lemma 4.2.10.

Proof of Lemma 4.2.10. Let U1 , . . . , Un be independent uniform random variables on [0, 1]. By
Lemma 4.2.13, under H0 , ζn has the same law as
n n
1 X 1 X

sup 1{F −1 (Ui )≤x} − F0 (x) = sup 1{Ui ≤F0 (x)} − F0 (x)
x∈R n i=1 0 x∈R n
i=1
1 X n

= sup 1{Ui ≤u} − u ,
u∈(0,1) n
i=1

where we have set u = F0 (x) and used the continuity of F0 to ensure that F0 (x) takes all values
u ∈ (0, 1) in the second inequality. The law of the right-hand side does not depend on F0 , which
is the announced statement.
96 Nonparametric tests

Lemma 4.2.10 motivates the following definition.

Definition 4.2.14 (Kolmogorov’s law). Let (Ui )i≥1 be a sequence of independent random vari-
ables uniformly distributed on [0, 1]. For all n ≥ 1, the law of the random variable

1 Xn

Zn = sup 1{Ui ≤u} − u
u∈(0,1) n
i=1

is called the Kolmogorov law with parameter n.

As a conclusion, we may thus derive a nonasymptotic test, based on the statistic ζn . In R, this
test is performed with the command ks.test.

Corollary 4.2.15 (Nonasymptotic Kolmogorov test). Under the assumptions of Lemma 4.2.10,
the test with rejection region
Wn = {ζn ≥ zn,1−α }
where zn,r is the quantile of order r of Kolmogorov’s law with parameter n, has level α.

4.2.4 Goodness-of-fit to a family of distributions: the Lilliefors correction


Let P0 be a subset of the space of probability measures (with a continuous CDF) on R. Similarly to
the χ2 test in Section 4.1.3, the Kolmogorov test may generally be adapted to the set of hypotheses

H0 = {P ∈ P0 }, H1 = {P 6∈ P0 }.

We shall study the specific case (once again, similar to that of Section 4.1.3) where P0 is a para-
metric family, which thus writes

P0 = {P0,θ , θ ∈ Θ}, with Θ ⊂ Rq .

Example 4.2.16 (Goodness-of-fit to the Exponential model). We observe the lifespans X1 , . . . , Xn


of n lightbulbs, assumed to be iid, and want to test whether these random variables are exponen-
tially distributed.

Following the lines of Section 4.1.3, a natural approach consists in:

(i) finding a consistent estimator θbn of θ under H0 ;

(ii) compare the distance between the empirical CDF Fbn of the sample, and the CDF F0,θbn of
the probability measure in P0 corresponding to the estimated value θbn of θ.

In some cases, it may then be proved that the law of the statistic


ζn′ = sup Fbn (x) − F0,θbn (x)
x∈R

is free under H0 , that is to say that it depends on n (and on the model P0 ), but non on the under-
lying value θ of the parameter. In order to check this property, it is useful to mimic the proof of
Lemma 4.2.10 and to express both Fbn (x) and θbn in terms of independent uniform random vari-
ables U1 , . . . , Un . Just like the estimation of θ by θbn changes the number of degrees of freedom
of the limiting χ2 distribution in the context of Section 4.1.3 (see Proposition 4.1.13), the law
of ζn′ under H0 will generally not be the Kolmogorov law from Section 4.2.3, but a certain
4.2 Continuous models on the line: the Kolmogorov test 97

probability measure, depending on the model P0 , whose quantiles zn,r ′ need to be computed, for
instance by numerical simulation.
The example of the case where P0 is the Gaussian model is detailed below, and we refer
to Exercise 4.A.5 for the case of the Exponential model (hinted at in Example 4.2.16). In the
Gaussian case, the resulting test is called the Lilliefors test9 . By extension, the two-step approach
described above is sometimes referred to as the Lilliefors correction to the Kolmogorov test.

Example 4.2.17 (Lilliefors test). A series of 1000 values X1 , . . . , Xn ∈ R is measured. The


empirical mean and variance are

x = 1.05, v = 1.422 .

The corresponding histogram is plotted on Figure 4.4, together with the density of the N(x, v)
distribution. We want to know whether the series is normally distributed. We employ Lilliefors’

Histogram of Sample
0.30
0.20
Density

0.10
0.00

−2 0 2 4 6

Sample

Figure 4.4: Histogram of the sample, and density with estimated parameters, for the Lilliefors test.

test, and follow the steps sketched above.

(i) We use (X n , Vn ) as consistent estimators of (µ, σ 2 ).

(ii) To check the freeness, under H0 , of ζn′ , we first write, for all (µ, σ 2 ) ∈ R × (0, +∞),
 
x−µ
F0,(µ,σ2 ) (x) = Φ ,
σ

where Φ is the CDF of the standard Gaussian distribution, so that


−1 −1
F0,(µ,σ 2 ) (u) = µ + σΦ (u).
9
H. W. Lilliefors, On the Kolmogorov–Smirnov Test for Normality with Mean and Variance Unknown, Journal of
the American Statistical Association, vol. 62(318), 1967, p. 399–402.
98 Nonparametric tests

−1
As a consequence, letting Xi = F0,(µ,σ 2 ) (Ui ), where U1 , . . . , Un are independent uniform

random variables on [0, 1], we get


n n
1X  σ X −1
Xn = µ + σΦ−1 (Ui ) = µ + Φ (Ui ),
n n
i=1 i=1
 2
Xn 2 n
X Xn
1 2 σ
Vn = Xi − X n = Φ−1 (Ui ) − 1 Φ−1 (Uj ) ,
n n n
i=1 i=1 j=1

so that, for all x ∈ R,


 
Pn
 Φ−1 (u) − n1 i=1 Φ−1 (Ui ) 
F0,(X n ,Vn ) (x) = Φ 
r
,
  2
1 Pn −1 1 P n −1
n i=1 Φ (Ui ) − n j=1 Φ (Uj )

where u = F0,(µ,σ2 ) (x). We denote by ΦUn (u) the right-hand side, and finally obtain that

1 Xn

ζn′ = sup 1{Ui ≤u} − ΦUn (u) ,
u∈(0,1) n
i=1

the law of which does not depend on the parameters µ and σ 2 but only on n.
(iii) With our data, Kolmogorov’s statistic takes the value ζn′ = 0.05. To compute the p-value
(m)
of the test, we take Nsim ≫ 1 and, for all m ∈ {1, . . . , Nsim }, draw a sample Un =
(m) (m)
(U1 , . . . , Un ) of independent uniform random variables on [0, 1] and compute the value
of n
1 X
(m)
Zn = sup 1{U (m) ≤u} − ΦU(m) (u) .
u∈(0,1) n i n
i=1
We then approximate the p-value corresponding to the realisation with ζn′ = 0.05 by
  1 X
Nsim
p-value = P Zn(1) ≥ 0.05 ≃ 1 (m) .
Nsim m=1 {Zn ≥0.05}

We obtain p = 3.2 10−6 , therefore H0 is rejected at all usual levels.


Due to the importance of the Gaussian model, other normality tests are available. Among the
most famous, let us quote the Shapiro–Wilk test or d’Agostino’s K-squared test, see the references
in [3, Section 14.8].
Remark 4.2.18 (A wrong approach). Make sure that you understand why, in the example above,
applying the Kolmogorov test from Section 4.2.3 with P0 = N(1.05, 1.422 ) is not correct. ◦

4.A Exercises
Exercise 4.A.1 (χ2 test in the parametric framework). When the state space X is finite with cardi-
nality m, one can actually resort to the parametric framework by introducing the notation
( )
X
Θ = (px )x∈X ∈ [0, 1]m : px = 1 ,
x∈X
4.A Exercises 99

and consider that a probability measure on X is parametrised by its probability mass function
θ = (px )x∈X . In this context, show that the MLE of θ is Pbn . ◦
↸ Exercise 4.A.2. During 300 minutes, the number of clients entering a shop per minute is
recorded. The results are reported below:
Number of clients 0 1 2 3 4 5
Number of minutes 23 75 68 51 53 30
Does the number of clients entering the shop per minute follow a Poisson distribution? You may
use the R command 1-pchisq(x,n) to compute the probability that a χ2 (n)-distributed variable
takes values larger than x, and thus return a p-value. ◦
Exercise 4.A.3 (On Donsker’s Theorem). We recall the Definition 4.2.2 of the Brownian bridge.
The purpose of this exercise is to give a proof of a weak form of the convergence stated in Theo-
rem 4.2.3.
1. For all t ∈ [0, 1], compute the variance of β(t).

2. Let F be a CDF on R, and let x1 ≤ · · · ≤ xd in R. What is the law of the random vector
G = (β(F (x1 )), . . . , β(F (xd )))?

3. With the notation of Theorem 4.2.3, show that when n → +∞, the random vector Gn =
(Gn (x1 ), . . . , Gn (xd )) converges in distribution to G.
The property which we established is called the convergence in finite-dimensional distribution. ◦
Exercise 4.A.4 (Nonasymptotic Kolmogorov test). Let (Xi )i≥1 be a sequence of iid random vari-
ables with CDF F on R.
1. Using Hoeffding’s inequality, show that for all a > 0,
√ 
sup P n|Fbn (x) − F (x)| ≥ a ≤ 2 exp(−2a2 ).
x∈R

In 1956, Dvoretzky, Kiefer and Wolfowitz10 proved that there exists a constant C such that
 

P sup n|Fbn (x) − F (x)| ≥ a ≤ C exp(−2a2 ).
x∈R

In 1958, Birnbaum and McCarty11 conjectured that the best constant was C = 2, and this
conjecture was proved by Massart12 in 1990.

2. Deduce a nonasymptotic version of the Kolmogorov test, with level at most α, which does
not require to compute the quantiles of the Kolmogorov statistic. What is another benefit of
this test? ◦
↸ Exercise 4.A.5 (Lilliefors correction for the Exponential model). Let X1 , . . . , Xn be iid positive
random variables, with common but unknown distribution P . We wish to test whether P is an
exponential distribution, and therefore write H0 = {there exists λ > 0 such that P = E(λ)}.
10
A. Dvoretzky, J. Kiefer and J. Wolfowitz, Asymptotic minimax character of the sample distribution function and
of the classical multinomial estimator, Annals of Mathematical Statistics, vol. 27, 1956, p. 642–669.
11
Z. W. Binbaum and R. McCarty, A distribution-free upper confidence bound for P(Y < X), based on independent
samples of X and Y , Annals of Mathematical Statistics, vol. 29, 1958, p. 558–562.
12
P. Massart, The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality, The Annals of Probability, vol. 18,
1990, p. 1269–1283.
100 Nonparametric tests

bn of λ.
1. Under H0 , recall the expression of the MLE λ

2. Let Fbn be the empirical CDF of the sample, and for all λ > 0, let F0,λ be the CDF of the
distribution E(λ). Show that under H0 , the law of the statistic


ζn′ = sup Fbn (x) − F0,λbn (x)
x∈R

depends on n but not on λ. Hint: you may refer to Example 4.2.17.

3. We consider two samples of n = 200 values each, whose histograms superposed with the
bn ) distribution are reported below.
density of the associated E(λ

Histogram of the first sample Histogram of the second sample


1.2

0.0 0.1 0.2 0.3 0.4 0.5


0.8
Density

Density
0.4
0.0

0 1 2 3 4 5 0 2 4 6

s s

The associated respective values of the statistic ζn′ are 0.066 and 0.089. Using Monte-Carlo
simulations, in R, of the law of ζn′ under H0 , compute the p-values associated with each
sample. What do you conclude? ◦

4.B Summary
4.B.1 General framework
• No shape assumption is made on the common law P of the observed data X1 , . . . , Xn .

• P is estimated by the empirical distribution Pbn of X1 , . . . , Xn .

• Goodness-of-fit test:
a probability measure P0 on X is fixed;
H0 = {P = P0 }, H1 = {P 6= P0 };
H0 is rejected as soon as some ‘distance’ between Pbn and P0 is too large.

• Goodness-of-fit to a family of distributions:


a parametric family P0 = {P0,θ , θ ∈ Θ} of probability measures on X is fixed;
under H0 , an estimator θbn of θ is given;
H0 is rejected as soon as some ‘distance’ between Pbn and P0,θbn is too large.
Golden rule: The law of the test statistic under H0 will not be the same as for the
goodness-of-fit test!
4.B Summary 101

4.B.2 χ2 tests
• X is a finite space with cardinality m.

• Probability measures P are represented by their probability mass functions (px )x∈X .

• Test based on the χ2 distance χ2 (P |Q).

• Goodness-of-fit test is based on Pearson’s statistic dn = nχ2 (Pbn |P0 ).


Under H0 , dn → χ2 (m − 1) in distribution.
Under H1 , dn → +∞.
The (asymptotic) test rejects H0 as soon as dn ≥ χ2m−1,1−α .

• Goodness-of-fit to a parametric family of distributions with Θ ⊂ Rq :


based on d′n = nχ2 (Pbn |P0,θbn ) where θbn is a consistent estimator of θ;
d′n → χ2 (m − q − 1) under H0 and d′n → +∞ under H1 so that the (asymptotic) test
rejects H0 as soon as d′n ≥ χ2m−q−1,1−α .
Instance of the golden rule: notice the decrease in the number of degrees of freedom!

4.B.3 Kolmogorov tests


• X = R and P is assumed to have a continuous CDF.

• Asymptotic behaviour of empirical CDF Fbn described by Glivenko–Cantelli (≃ Law of


Large Numbers) and Donsker (≃ Central Limit Theorem) Theorems.

• Goodness-of-fit tests based on Kolmogorov’s statistic ζn = supx∈R |Fbn (x) − F0 (x)|.



Asymptotic test rejects H0 as soon as nζn is larger than the quantile of order 1 − α of
the law of the supremum of the modulus of the Brownian bridge.
Nonasymptotic test is based on the remark that the law of ζn is free under H0 and thus
rejects H0 as soon as ζn is larger than the quantile of order 1 − α of Kolmogorov’s law.

• Goodness-of-fit to a parametric family of distributions:


Lilliefors correction is employed when ζn′ = supx∈R |Fbn (x) − F0,θbn (x)| is proven to be
free under H0 ;
instance of the golden rule: in general, the law of ζn′ under H0 is not Kolmogorov’s law;
its quantiles can be computed by numerical simulation.
102 Nonparametric tests
Chapter 5

Linear and logistic regression

In the framework of regression, two observed variables x and y are assumed to be related by the
identity
y = f (x, ǫ), (Reg)
where the function f is unknown and ǫ is an unobserved random term, often called noise or error,
accounting either for the inaccuracy of the model (for example, the dependence of y upon other
variables than x which are not taken into account), or for the natural variability of y. Starting from
a sample composed of pairs (x1 , y1 ), . . . , (xn , yn ), where the variables x1 , . . . , xn may be either
determined by the experimenter (fixed design) or randomly observed (random design), the purpose
of regression is to:
• reconstruct the function f ;
• estimate parameters of the law of ǫ;
in order to be able to predict, with a controlled degree of accuracy, the value of an outcome y for
a new value of x.
In general, x is multivariate, and its coordinates x1 , . . . , xp are called explanatory variables,
also features, regressors, or independent variables, although this designation may be confusing
because these variables need not be statistically independent. The variable y is the explained
variable, also response or dependent variable1 .
The term regression is usually employed when y takes continuous — or at least numerical
— values. If y (or equivalently f ) takes values in a categorical set, those values may be seen as
classes, and the problem is recast as finding the class to which a realisation of x belongs. In this
context, the term classification is preferred.
Example 5.0.1 (Regression and classification). An Internet search engine records, in n regions,
the number of queries for a certain number p of keywords associated to flu, for instance ‘flu
symptoms’, ‘fever’ and ‘sneezing’. For each region i, a vector xi = (x1i , . . . , xpi ) of Rp is thus
obtained. Certainly, the number yi of people actually infected by flu in the i-th region depends on
xi , and it is reasonable to postulate a relation of the type of Equation (Reg), where f is increasing
with respect to each variable xj , while ǫ encodes the natural variability of the model. Estimating
f from the n samples is a regression problem, and allows to predict, for a new region, the number
yn+1 of infections from the observation of queries xn+1 in this region. If one is not interested
in the precise number of infections, but only in determining whether a region is undergoing an
epidemic or not, then one faces a classification problem.
1
En économétrie, les coordonnées de x sont les variables exogènes, et y est la variable endogène. En anglais, les
terminologies independent et dependent sont employées.
104 Linear and logistic regression

Remark 5.0.2 (Supervised and unsupervised learning). Regression and classification methods
base their predictions of y on a dataset where the pairs (xi , yi ) are available. These methods
are called supervised, in the sense they start by a learning phase on a training dataset. In contrast,
data analysis methods of Chapter 1 are called unsupervised, as they retrieve information in the
data without any preliminary training. ◦

5.1 Linear regression


In the context of linear regression, the function f of Equation (Reg) is assumed to be affine and
write
Xp
f (x, ǫ) = β0 + βj xj + ǫ,
j=1

with x = (x1 , . . . , xp ) ∈ Rp and β = (β0 , . . . , βp ) ∈ Rp+1 — in the sequel, β shall be considered


as a column vector. This parametrisation of f reduces the issue of estimating f to that of estimat-
ing the vector β. We denote by (xi , yi )1≤i≤n the pairs of observations in Rp × R, and introduce
the notation
     
y1 1 x11 · · · xp1 ǫ1
 ..   .. .. . ǫn =  ...  ∈ Rn ,
n 
..  ∈ R n×(p+1)  
yn =  .  ∈ R , xn =  . . ,
yn 1 x1n · · · xpn ǫn

so that
yn = xn β + ǫn . (LinReg)
The parameter β0 is called the intercept. It may be removed from the model, in which case the
first column of the matrix xn must be removed. All subsequent computations then remain valid,
up to replacing p + 1 with p. In the sequel we always include the intercept in the model.

5.1.1 Least square estimator for simple linear regression


When p = 1, the problem of estimating β = (β0 , β1 ) from the identity (LinReg) amounts to
finding the ‘best line’ fitting the pairs of points (xi , yi )1≤i≤n in the plane, see Figure 5.1. This is
called the simple linear regression.
3.0
2.5
2.0
Y

1.5
1.0

0.0 0.2 0.4 0.6 0.8

Figure 5.1: Linear regression when p = 1.


5.1 Linear regression 105

A natural approach consists in finding coefficients β0 , β1 which minimise a certain distance


between the observed values yi and the values β0 + β1 xi predicted by the model — in other words,
to minimise the errors ǫi . A computationally convenient choice of the notion of distance is the sum
of squares of the errors
Xn Xn
|ǫi |2 = |yi − β0 − β1 xi |2 ,
i=1 i=1
which leads to the Least Square Estimator (LSE) problem
n
X
min |yi − β0 − β1 xi |2 ,
β0 ,β1 ∈R
i=1

the solution of which is called the LSE of (β0 , β1 ).


In the sequel of this section, given two vectors an , bn ∈ Rn , we shall use the shortcut notation
n n
1X 1X
an = ai , Var(an ) = (ai − an )2 ,
n n
i=1 i=1
n
X
1 Cov(an , bn )
Cov(an , bn ) = (ai − an )(bi − bn ), Corr(an , bn ) = p p .
n Var(an ) Var(an )
i=1

Proposition 5.1.1 (LSE for the simple linear regression). Assume that n ≥ 2 and that there exist
i, j such that xi 6= xj . The LSE of (β0 , β1 ) is given by
Cov(xn , yn )
βb0 = y n − βb1 xn , βb1 = .
Var(xn )
Exercise 5.1.2. Prove Proposition 5.1.1. ◦
Once the LSE (βb0 , βb1 ) is estimated, it may be interesting to quantify the quality of the linear
model by measuring whether the points (xi , yi ) are far from the line with equation y = βb0 + βb1 x
or not. To do so, one may define y bn = (b y1 , . . . , ybn ) ∈ Rn by ybi = βb0 + βb1 xi , and define the
residual error between yn (the data) and y bn (the model) by
n
1 1X
bn k2 =
kyn − y (yi − βb0 − βb1 xi )2
n n
i=1
n
1X
= (yi − yn + βb1 xn − βb1 xi )2
n
i=1
n
1X
= (yi − yn )2 − 2βb1 (yi − y n )(xi − xn ) + βb12 (xi − xn )2
n
i=1
= Var(yn ) − 2βb1 Cov(xn , yn ) + βb12 Var(xn )
Cov(xn , yn )2
= Var(yn ) −
Var(xn )

= Var(yn ) 1 − Corr(xn , yn )2 .

This computation motivates the following definition.


Definition 5.1.3 (Coefficient of determination). The coefficient of determination R2 ∈ [0, 1] is
defined by
R2 = Corr(xn , yn )2 .
106 Linear and logistic regression

The computation above yields the identity


1 
bn k2 = Var(yn ) 1 − R2 ,
kyn − y
n
so that the closer R2 is to 1, the better the linear model fits the data.

Remark 5.1.4 (Geometric interpretation of LSE). In its geometric interpretation, the LSE problem
amounts to minimising the Euclidean distance, in the vector space Rn , between the vector yn and
the linear subspace spanned by the vectors 1n = (1, . . . , 1) and xn — in other words, finding the
orthogonal projection of yn onto this linear subspace. The assumption made in Proposition 5.1.1
that there exist i, j such that xi 6= xj ensures that this subspace has dimension 2. ◦

This geometric interpretation is the cornerstone of the estimation of β in the general case
p ≥ 2, which is developed in the next paragraph.

5.1.2 Least square estimator for multiple linear regression


When p ≥ 2, the linear regression is called multiple. The LSE of β remains defined as the solution
to the minimisation of the sum of squares of the errors
n
X
|ǫi |2 = kyn − xn βk2 ,
i=1

where k · k denotes the Euclidean norm in Rn . Thus, following the geometric interpretation of
Remark 5.1.4 leads to the next result.

Proposition 5.1.5 (LSE for the multiple linear regression). Assume that p ≤ n − 1, and that xn
has full rank2 p + 1. The LSE of β is the unique vector βb ∈ Rp+1 such that ybn = xn βb is the
orthogonal projection of yn onto the range of xn , and it writes
 −1
βb = x⊤n xn x⊤n yn .

Proof. Let y bn ∈ Rn denote the orthogonal projection of yn onto the range of xn . By the definition
bn , there exists βb ∈ Rp+1 such that y
of y b besides for all u ∈ Rp+1 , the vectors y
bn = xn β, bn − yn
n
and xn u are orthogonal in R , which implies that

x⊤
n (b
yn − yn ) = 0,

so that3  
x⊤ b ⊤
n xn β = xn yn .

Since xn has full rank, by Exercise 5.1.6 below the matrix x⊤ b


n xn is invertible and therefore β =
⊤ −1 ⊤
(xn xn ) xn yn .

Exercise 5.1.6. Let n ≥ p + 1 and xn ∈ Rn×(p+1) .

1. Assume that there is u ∈ Rp+1 \ {0} such that xn u = 0. Show that the rank of xn is at most
p. Hint: think of the Rank-nullity Theorem4 .
2
Est de rang complet en français.
3
You may also observe that this is the first-order optimality condition associated with the Least Square problem.
4
Théorème du rang en français.
5.1 Linear regression 107

2. Deduce that under the assumptions of Proposition 5.1.5, x⊤


n xn is invertible. ◦

Notice that if p + 1 ≤ n but the rank of xn is lower than p + 1, then at least one of the columns
of xn is a linear combination of the other ones, so that it may be removed from xn without loss
of information, and the process can be iterated until the matrix xn recovers full rank. Still, this
method requires the number of features p to be lower than the number of observations n.

Remark 5.1.7 (Sparsity, penalisation and bias-variance tradeoff). In high-dimensional statistics, it


is often the case that p > n, so that the matrix x⊤
n xn may be very degenerate. In this context, it is
generally assumed in addition that the vector β is sparse5 , that is to say that many of its coordinates
are 0. The underlying paradigmatic idea is that at each experiment or observation, many data are
collected, but only a few of these data are actually relevant. To compute an estimator of β in such
problems, it is customary to penalise the sum of squares of the errors by a term taking the sparsity
constraint into account. For instance, one may try to solve the minimisation problem

min kyn − xn βk2 + λkβk0 ,
β∈Rp+1

where λ > 0 is a given numerical parameter, and kβk0 denotes the number of nonzero coordi-
nates of β: an optimal solution should then have a low k · k0 -norm. This problem is actually
computationally difficult, and it may be fruitfully relaxed to the minimisation problem

min kyn − xn βk2 + λkβk1 ,
β∈Rp+1
P
where kβk1 = pj=0 |βj |. The obtained estimator is called LASSO (for Least Absolute Shrinkage
and Selection Operator), and was introduced by Tibshirani6 in 1996. It is nowadays considered as
a standard variable selection tool.
When using such a penalisation method, the value of the estimator depends on the preliminary
choice of the parameter λ. If λ is small, then βb is close to the LSE but not very sparse. On
the contrary, if λ is large, then the estimator may be far from the LSE but very sparse. This
phenomenon is related to the bias-variance tradeoff discussed in Exercise 2.1.11, p. 30: if λ is
small, then the value of βb highly depends on the data, so that it is expected to have a small bias but
a large variance; in the context of regression, it is an example of overfitting. On the contrary, if λ
is large, then βb only weakly depends on the data, so that is has a small variance but a large bias:
this is a situation of underfitting. ◦

As in the case of simple regression, once the LSE of β is computed, it is useful to assess the
quality of the linear fit on the data. Therefore, we would like to define a coefficient of determina-
tion R2 such that the residual error writes
1
bn k2 = Var(yn )(1 − R2 ).
kyn − y
n
However in the present case, the expression of Definition 5.1.3 may not be straightforwardly ex-
tended, because the fact that each xi is multivariate prevents the correlation between xn and yn
from being well-defined. We thus provide a more general definition by writing first

bn ) + (b
yn − y n 1n = (yn − y yn − y n 1n ).
5
Parcimonieux en français.
6
R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B
(Methodological), vol. 58(1), 1996, p. 267–288.
108 Linear and logistic regression

By construction, yn − y
bn is orthogonal to the range of xn . On the other hand, since 1n is the first
bn and y n 1n belong to the range of xn . Therefore, by Pythagoras’ Theorem,
column of xn , both y
1 1 1
kyn − y n 1n k2 = kyn − y yn − y n 1n k2 ,
bn k2 + kb
n
| {z } |n {z } |n {z }
Var(yn ) residual error Var(b
yn )

which brings forth the following definition.


Definition 5.1.8 (Coefficient of determination). The coefficient of determination R2 ∈ [0, 1] is
defined by
Var(b
yn ) yn − y n 1n k2
kb
R2 = = ,
Var(yn ) kyn − y n 1n k2
and it satisfies n1 kyn − y
bn k2 = Var(yn )(1 − R2 ).
The identity n1 kyn − y
bn k2 = Var(yn )(1 − R2 ) implies that Definition 5.1.8 extends Defini-
tion 5.1.3 to the multiple regression case. Once again, this identity shows that the closer R2 is to
1, the better the linear model fits the data. Another formulation of this fact is provided in the next
exercise.
Exercise 5.1.9 (Another formulation of R2 ). Show that R2 = Corr(yn , y
bn )2 . ◦

5.1.3 Statistics in the Gaussian model


The computation of the LSE described in the previous sections is purely deterministic, in the
sense that no assumption on the statistical distribution of the errors has been made. However, it
is generally convenient to endow the data with a probabilistic model, as it allows to apply the
tools of statistical inference developed in the previous chapters. The most common choice of
model is to assume that ǫ1 , . . . , ǫn are independent N(0, σ 2 ) random variables, which induces a
supplementary unknown parameter σ 2 in the model.
In this context, the likelihood of the pairs of observation (xi , yi ) ∈ Rp × R, 1 ≤ i ≤ n, writes
  2 
Yn Xp
1 1
Ln (xn , yn ; β, σ 2 ) = √ exp − 2 yi − β0 − βj xji   .
2πσ 2 2σ
i=1 j=1

The relevance of the choice of the LSE as an estimator of β is justified by the following result.
Exercise 5.1.10. Under the assumptions of Proposition 5.1.5, show that the LSE βb of β is the
Maximum Likelihood Estimator of β, and that it is unbiased. ◦
A further justification is provided by the Gauss–Markov Theorem studied in Exercise 5.A.1. In
this Gaussian setting, the law of the LSE βb is explicit, and an estimator σ
b2 of σ 2 is also available.
Proposition 5.1.11 (Law of (β,b σb2 )). Assume that ǫ1 , . . . , ǫn are independent N(0, σ 2 ) random
variables, that p ≤ n − 1, and that the matrix xn has full rank p + 1.
(i) The LSE βb satisfies βb ∼ Np+1 (β, σ 2 (x⊤ −1
n xn ) ).

b2 of σ 2 defined by
(ii) The estimator σ
bn k2
kyn − y
b2 =
σ
n−p−1
σ 2 /σ 2 ∼ χ2 (n − p − 1); in particular it is unbiased.
is such that (n − p − 1)b
5.1 Linear regression 109

(iii) The estimators βb and σ


b2 are independent.
Proof. The first point follows from the series of identities
 −1  −1  −1
βb = x⊤ x
n n x⊤
y
n n = x⊤
x
n n x ⊤
n (xn β + ǫ n ) = β + x⊤
x
n n x⊤
n ǫn ,

which by Proposition A.2.3 in Appendix A shows that βb is a Gaussian vector in Rp+1 with mean
β and covariance matrix
 −1   −1 ⊤  −1
x⊤ x
n n x ⊤
n σ 2
I n x⊤
x
n n x ⊤
n = σ 2
x ⊤
x
n n .

−1
We now denote by Π = xn x⊤ n xn x⊤ n
n the orthogonal projection of R onto the range of
xn , and Π⊥ = In − Π. By definition,

bn = Π⊥ yn = Π⊥ (xn β + ǫn ) .
yn − y

Since xn β is in the range of xn , Π⊥ xn β = 0, so that yn − y bn = Π⊥ ǫn . We deduce from the


definition of ǫn and Proposition A.2.3 in Appendix A again that (yn −b yn )/σ ∼ Nn (0, Π⊥ ), which
by Proposition A.4.1 in Appendix A shows that kyn − y 2 2
bn k /σ ∼ χ2 (n − p − 1) and yields the
second statement of the Proposition.
Finally, the third statement follows from the observation that xn (βb − β) = Πǫn , while σ
b2 is a
deterministic function of Π⊥ ǫn . By Cochran’s Theorem A.4.2, the random vectors Πǫn and Π⊥ ǫn
are independent, therefore since xn is assumed to be injective, we deduce that the estimators βb
and σb2 are independent.

Proposition 5.1.11 allows for instance to construct confidence region for β and σ 2 , see an
example in Exercise 5.A.2. It may also be applied to the prediction of the outcome y corresponding
to a value of x which has not been observed yet. Indeed, for a new vector xn+1 ∈ Rp , the LSE
provides a natural predictor
Xp
b
ybn+1 = β0 + βbj xjn+1
j=1

of the corresponding value


p
X
yn+1 = β0 + βj xjn+1 + ǫn+1 .
j=1

In the Gaussian model, the precision of this predictor can be measured by a confidence interval.
Proposition 5.1.12 (Prediction with linear regression). Given xn+1 = (x1n+1 , . . . , xpn+1 ) ∈ Rp ,
let us define the row vector x′n+1 ∈ Rp+1 by

x′n+1 = (1, x1n+1 , . . . , xpn+1 ),

and write  −1


κ = 1 + x′n+1 x⊤ x
n n (x′n+1 )⊤ ≥ 1.
A confidence interval of level 1 − α for yn+1 is given by
h √ √ i
ybn+1 − tn−p−1,1−α/2 σ b2 κ, ybn+1 + tn−p−1,1−α/2 σb2 κ ,

where tm,r denotes the quantile of order r of Student’s distribution with m degrees of freedom.
110 Linear and logistic regression

Proof. Writing ybn+1 = x′n+1 βb and using Proposition 5.1.11, we get


  −1 
′ 2 ′ ⊤ ′ ⊤
ybn+1 ∼ N xn+1 β, σ xn+1 xn xn (xn+1 ) .

Since ǫn+1 ∼ N(0, σ 2 ) and is independent of ǫ1 , . . . , ǫn , we deduce that

yn+1 − ybn+1 = x′n+1 β + ǫn+1 − ybn+1 ∼ N(0, σ 2 κ).

By Proposition 5.1.11 again, we deduce that

yn+1 − ybn+1
√ ∼ t(n − p − 1),
b2 κ
σ
which completes the proof.

Remark 5.1.13. One may also be interested in the prediction of x′n+1 β, without taking into ac-
count of the noise ǫn+1 associated with the (n + 1)-th observation. In this case, the very same
proof as in Proposition 5.1.12 shows that a confidence interval for x′n+1 β with level 1 − α is given
by h i
√ √
ybn+1 − tn−p−1,1−α/2 σ b2 λ, ybn+1 + tn−p−1,1−α/2 σ b2 λ ,
−1
with λ = x′n+1 x⊤
n xn (x′n+1 )⊤ = κ − 1. ◦

5.1.4 Variable selection


An important topic in the analysis of the results of a linear regression is variable selection, namely
deciding which of the features x1 , . . . , xp actually have an influence on y. Sparsity and penalisa-
tions methods mentioned in Remark 5.1.7 are a possible approach to variable selection. Other ap-
proaches include the use of criteria such as the Akaike information criterion (AIC), or the Bayesian
information criterion (BIC), see [1, Section 7] for a detailed account. In this section, we work un-
der the Gaussian assumption of Proposition 5.1.11 and describe Student’s and Fisher’s tests for
linear regression.
If a feature xj has no influence on y, then the corresponding coefficient βj must be 0. The
corresponding Student test therefore has the null and alternative hypotheses

H0 = {βj = 0}, H1 = {βj 6= 0}.

We denote by ρj the j-th diagonal coefficient of the matrix (x⊤


n xn )
−1 (the coefficients being

indexed from 0 to p).

Lemma 5.1.14 (Student’s test). The test rejecting H0 as soon as



βb
j
p 2 ≥ tn−p−1,1−α/2 ,
σ b ρj

where we write βb = (βb0 , . . . , βbp ), has level α.

Proof. Let ej = (0, 0, . . . , 1, . . . 0) seen as a row vector of Rp+1 , where the j-th coefficient is
equal to 1 and the coefficients are indexed from 0 to p. By Proposition 5.1.11,

βbj = ej βb ∼ N(βj , σ 2 ρj ),
5.2 Logistic regression 111

b2 . As a consequence, under H0 we have


and this variable is independent from σ

ej βb b2
σ
Y =p ∼ N(0, 1), independent from Z= (n − p − 1) ∼ χ2 (n − p − 1),
σ 2 ρj σ2

so that
ej βb Y
p =p ∼ t(n − p − 1),
σ 2
b ρj Z/(n − p − 1)
from which the construction of the Student test follows.

Fisher’s test allows to test the joint influence of several features. Up to applying a permutation
to the indices of the features, the corresponding null and alternative hypotheses may be written

H0 = {βq+1 = · · · = βp = 0}, H1 = {there exists j ≥ q + 1 such that βj 6= 0},

bn0 the orthogonal projection of y


where q ∈ {1, . . . , p − 1}. Denoting by y bn onto the range of the
matrix  
1 x11 · · · xq1
x0n =  ... ... ..  ∈ Rn×(q+1) ,

. 
1 x1n · · · xqn
we introduce the statistic
kb bn0 k2 /(p − q)
yn − y
F = .
bn k2 /(n − p − 1)
kyn − y

Lemma 5.1.15 (Fisher’s test). The test rejecting H0 as soon as

F ≥ fp−q,n−p−1,1−α,

where fp−q,n−p−1,1−α is the quantile of order 1 − α of the Fisher distribution F(p − q, n − p − 1),
has level α.

As for Student’s test above, the proof of Lemma 5.1.15 reduces to showing that under H0 ,
F ∼ F(p − q, n − p − 1). It is left as an exercise.

5.2 Logistic regression


Logistic regression adresses situations where the variable y is binary, and we take the convention to
denote by 0 and 1 its possible values. The probability of y to take the value 1 may depend on a set
of numerical parameters x = (x1 , . . . , xp ) ∈ Rp , and it is denoted by p(1|x). Logistic regression
is a method designed in order to estimate the function x 7→ p(1|x) from a set of observations
(x1 , y1 ), . . . , (xn , yn ) ∈ Rp × {0, 1}.

Example 5.2.1 (Spam detection). Let x1 , . . . , xp denote the frequency of occurrences of p key-
words (such as ‘Viagra’, ‘Drug’, etc.) contained in an email, and y = 1 if this email is a spam:
the larger the number of occurrences in a given email, the larger the probability that the email
be a spam. After training a spam detector on a dataset, say of n = 10000 emails, 500 of which
are actually spams, then an email client may adopt the rule to reject an email as soon as the
probability that it be a spam be larger than a given threshold, say 0.7.
112 Linear and logistic regression

5.2.1 Logistic parametrisation


The basis of logistic regression is the assumption that p(1|x) takes the form
 
X p
p(1|x) = Ψ β0 + βj xj  ,
j=1

where β = (β0 , . . . , βp ) ∈ Rp+1 and Ψ is the logistic function

exp(u) 1
∀u ∈ R, Ψ(u) = = ,
1 + exp(u) 1 + exp(−u)

see Figure 5.2.

Figure 5.2: The logistic function Ψ, which takes its values between 0 and 1.

In this parametric context, the estimation of x 7→ p(1|x) is reduced to the estimation of the
parameter β ∈ Rp+1 , similarly to linear regression.
Considering y1 , . . . , yn as independent realisations of Bernoulli random variables with respec-
tive parameters p(1|x1 ), . . . , p(1|xn ), we write the likelihood of the model
n
Y
Ln (yn , xn ; β) = p(1|xi )yi (1 − p(1|xi ))1−yi
i=1
Yn
= Ψ(x′i β)yi (1 − Ψ(x′i β))1−yi
i=1
Yn
exp(yi x′i β)
= ,
1 + exp(x′i β)
i=1

with the same notation x′i = (1, x1i , . . . , xpi ) as for linear regression. The MLE of β is denoted
b It does not possess an analytic formula, but can be computed by statistical softwares us-
by β.
ing numerical approximations for the system of optimality conditions (see for instance [1, Sec-
tion 4.4.1]):
n
X n
X
∂ℓn  ∂ℓn 
= yi − Ψ(x′i β) = 0, = xji yi − Ψ(x′i β) = 0, j = 1, . . . , p.
∂β0 ∂βj
i=1 i=1

b
The corresponding estimator of p(1|x) is denoted by pb(1|x) = Ψ(x′ β).
5.A Exercises 113

Once the parameter β is estimated, the prediction of the value yn+1 associated with a new
vector of parameters xn+1 ∈ Rp can be made following the rule
(
1 if pb(1|xn+1 ) ≥ p0 ,
ybn+1 =
0 otherwise,
where p0 ∈ [0, 1] is a given threshold. Such a rule, assigning a class y ∈ {0, 1} to a vector of
parameters x ∈ Rp , is called a classifier. As is the case for hypothesis testing, the value of p0 can
be tuned if one wants the classifier to be more or less conservative.

5.2.2 Likelihood ratio test


For the sake of simplicity, we assume that p = 1 here — this is the equivalent situation to the
simple linear regression. Our purpose is to test whether y actually depends on x, so that we
introduce the null and alternative hypotheses
H0 = {β1 = 0}, H1 = {β1 6= 0}.
Under H0 , the model is reduced to a simple Bernoulli model: p(1|x) does not depend on x
and its MLE is
Xn
n1
pb∗ = , n1 = yi .
n
i=1
Let us introduce the statistic
maximum likelihood of the model
Λ(yn |xn ) = 2 log
maximum likelihood of the model under H0
Qn
pb(1|xi )yi (1 − pb(1|xi ))1−yi
= 2 log i=1Qn
p∗ )yi (1 − pb∗ )1−yi
i=1 (b
Xn
=2 (yi log pb(1|xi ) + (1 − yi ) log(1 − pb(1|xi )))
i=1
!
+ n log n − n1 log(n1 ) − (n − n1 ) log(n − n1 ) .

The following theoretical result is admitted (it is based on [3, Theorem 12.4.2, p. 515] which
is sometimes called Wilks’ Theorem7 ).
Lemma 5.2.2 (Behaviour of the likelihood ratio). Under H0 and suitable conditions on xn ,
Λ(yn |xn ) converges in distribution to the χ2 (1) distribution when n goes to infinity.
Lemma 5.2.2 ensures that the test rejecting H0 if Λ(yn |xn ) ≥ χ21,1−α has asymptotic level α.

5.A Exercises
Exercise 5.A.1 (Gauss–Markov Theorem). We assume that we observe realisations (xi , yi ) ∈
Rp × R, i ∈ {1, . . . , n}, associated with the linear model
p
X
yi = β0 + βj xji + ǫi ,
j=1
7
S. S. Wilks, The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses, The Annals
of Mathematical Statistics, vol. 9(1), 1938, p. 60–62.
114 Linear and logistic regression

where ǫ1 , . . . , ǫn are supposed to be random variables such that E[ǫi ] = 0 and E[ǫi ǫj ] = σ 2 1{i=j} .
An estimator βe of β is called linear if there exists a matrix an ∈ R(p+1)×n such that
 
y1
e  .. 
β = an yn , yn =  .  .
yn

Because βe is an estimator of β, the matrix an may depend on xn but not on β.


The purpose of this exercise is to show that, under the assumptions of Proposition 5.1.5, the
LSE βb is the Best Linear Unbiased Estimator of β, which is known as the Gauss–Markov Theorem.
Here, ‘Best’ means that it has the smallest covariance matrix, in the sense of symmetric matrices8

1. If βe is a linear unbiased estimator of β, with associated matrix an ∈ R(p+1)×n , show that


an xn is the identity of Rp+1 .

2. Express the covariance matrix of the estimator βe in terms of the matrix an .

3. Show that this covariance matrix is larger, in the sense of symmetric matrices, than the
b Hint: introduce the matrix dn = an − (x⊤ xn )−1 x⊤ and compute
covariance matrix of β. n n

dn dn . ◦

Exercise 5.A.2 (Confidence ellipsoids for β in the linear regression). Under the assumptions of
Proposition 5.1.11, let λ1 ≥ · · · ≥ λp+1 > 0 be the eigenvalues of the matrix x⊤ n xn , and let
(e1 , . . . , ep+1 ) be an associated orthonormal basis.
p
1. Show that the random variables λj hβb − β, ej i are iid according to N(0, σ 2 ).

2. If σ 2 is assumed to be known, find a > 0 so that the ellipsoid


 
 p+1
X λj b 
C(a) = β ∈ Rp+1 : hβ − β, ej i2
≤ a
 σ2 
j=1

be such that
P(β ∈ C(a)) = 1 − α,
for a given level α > 0.

3. If σ 2 is no longer assumed to be known and has to be estimated by σb2 , how to modify the
definition of C(a) for the identity P(β ∈ C(a)) = 1 − α to remain valid? ◦

1 Exercise 5.A.3 (Principal Component Regression). The purpose of this exercise is to study
linear regression on the principal components associated with the original explanatory variables
x1 , . . . , xp , which is particularly useful and entails dimensionality reduction when the explanatory
variables are correlated. Throughout the text, we fix n vectors x1 , . . . , xn ∈ Rp . For the sake of
simplicity, we do not include intercepts so that the matrix xn simply writes
 1 
x1 · · · xp1
xn =  ... ..  ∈ Rn×p .

. 
xn · · · xpn
1

8
We recall that if A and B are symmetric matrices of size q × q, A is said to be larger than B if hu, (A − B)ui ≥ 0
for all u ∈ Rq .
5.B Summary 115

We also assume that the data are centered, and introduce the matrix
 1 
c1 · · · cp1
cn =  ... ..  ∈ Rn×p ,

.
c1n · · · cpn

where c1 , . . . , cp are the principal components of xn . We denote by e1 , . . . , ep the orthonormal


basis associated with the eigenvalues λ1 ≥ · · · ≥ λp of Kn = n1 x⊤n xn . We assume that λp > 0.

1. Show that xn has full rank.

2. Show that, for any β ∈ Rp , there exists γ ∈ Rp such that xn β = cn γ. Express the
coordinates of β in terms of γ.
b γ
3. Let yn ∈ Rn be the corresponding explained variables, and let β, b respectively denote the
LSE associated with the regression of yn on xn and on cn . Show that xn βb = cn γ
b.

4. We now fix k ≤ p and consider the linear regression of yn on the k first principal compo-
nents. To this aim, we introduce the matrix
 1 
c1 · · · ck1
ckn =  ... ..  ∈ Rn×k ,

.
c1n · · · ckn

and denote by
⊤ ⊤
bk = (ckn ckn )−1 ckn yn ∈ Rk
γ
bk reduces to k simple linear regres-
the corresponding LSE. Show that the computation of γ
sions.

bk , we define the PCR estimator of β by


5. From the computation of γ
k
X
βbjk = blk ejl ,
γ j = 1, . . . , p.
l=1

(Notice the similarity of this definition with the result of Question 2.) Show that, in the
Gaussian model, Cov[βbk ] ≤ Cov[β],b where the inequality has to be understood in the sense
of symmetric matrices. ◦

5.B Summary
5.B.1 Regression and classification
Regression and classification aim to describe a functional relation between explanatory variables
x1 , . . . , xp and an explained variable y.
• Regression: y takes numerical values.

• Classification: y takes categorical values.

• Both methods are examples of supervised learning: the parameters are estimated on a
dataset containing pairs of explanatory and explained variables, and the estimation is then
employed to predict the value of y given a new value of x.
116 Linear and logistic regression

5.B.2 Linear regression


P
• The model writes y = β0 + pj=1 βj xj + ǫ.
β = (β0 , . . . , βp ) has to be estimated.
ǫ is a noise or error term.
b defined through orthogonal projection.
• Least Square Estimator β:

• In the Gaussian framework, βb is the MLE and it is the Best Unbiased Linear Estimator.

• Tests of independence and confidence intervals for the prediction of y may be constructed.

5.B.3 Logistic regression


• The explained variable is binary: y ∈ {0, 1}.
Its law is characterised by p(1|x).
The purpose is to estimate p(1|x).
P
• It is assumed that p(1|x) = Ψ(β0 + pj=1 βj xj ).
Ψ is the logistic function.
β = (β0 , . . . , βp ) has to be estimated.

• The MLE βb is computed numerically, it may then be employed for classification or inde-
pendence test.
Chapter 6

Independence and homogeneity tests

In Chapters 3 and 4, hypothesis testing is made on data which take the form of a sample Xn =
(X1 , . . . , Xn ) of iid random variables. In practice, the formalism of hypothesis testing can be
expanded to address more general situations, the following two of which will be discussed in the
present chapter:

• independence tests: one observes iid realisations (X1 , Y1 ), . . . , (Xn , Yn ) and wishes to
know whether the random variables Xi and Yi are independent;

• homogeneity tests: one observes two (or more) samples X1,n1 = (X1,1 , . . . , X1,n1 ) and
X2,n2 = (X2,1 , . . . , X2,n2 ) and wishes to know whether these samples have the same dis-
tribution.

The practical application of such tests are obvious and ubiquitous:

• to determine whether life expectancy1 depends on the wealth of a country, you may collect
the life expectancy Xi and the GDP2 Yi of a panel of n countries and test whether these
variables are dependent or not;

• to determine whether a vaccine is efficient, you may take two groups of patients, respectively
with n1 and n2 people, administer the vaccine only to the first group, then measure the
concentration of antibodies X1,i1 and X2,i2 in the blood of the patients for both groups,
and finally test whether the samples X1,1 , . . . , X1,n1 and X2,1 , . . . , X2,n2 are identically
distributed.

The present chapter describes several independence and homogeneity tests, adapted either to
a parametric framework or to the nonparametric case. A summary of all these tests is provided in
Section 6.B at the end of the chapter.

6.1 Independence tests


6.1.1 χ2 tests
In this section, we assume that we observe independent realisations (X1 , Y1 ), . . . , (Xn , Yn ) of
pairs of random variables taking their values in the product space X × Y, where X and Y are finite
spaces, with respective cardinality m and l. In other words, for each experiment i ∈ {1, . . . , n},
1
Espérance de vie en français.
2
PIB en français.
118 Independence and homogeneity tests

two features Xi ∈ X and Yi ∈ Y are collected. A natural question is whether these features are
independent or not.

Example 6.1.1 (Is your second child more likely to be a boy when the first one is a boy?). Over a
population of 1000 families with two children, the gender of both children is recorded. The results
are represented in the contingency table of Table 6.1. We want to check whether the gender of the
first child has an influence on the gender of the second child.

Second child: male Second child: female Total


First child: male 240 260 500
First child: female 228 272 500
Total 468 532 1000

Table 6.1: Contingency table for the study of Example 6.1.1. The cells contain the numbers of
individuals with the corresponding features.

The distribution of the pair (X1 , Y1 ) is denoted by P . It is a probability measure on the finite
space X × Y, with probability mass function (px,y )(x,y)∈X×Y . The marginal distribution of X1 and
Y1 are respectively denoted by P X and P Y , and their probability mass functions (pX x )x∈X and
(pYy )y∈Y satisfy
X X
∀x ∈ X, pX
x = px,y , ∀y ∈ Y, pYy = px,y .
y∈Y x∈X

The variables X1 and Y1 are independent if and only if

∀(x, y) ∈ X × Y, px,y = pX Y
x py . (I)

We denote by P0 the set of all probability measures on X × Y satisfying the condition (I). No-
tice that each element P P of P0 is characterised
P by the pair of vectors P X and P Y , therefore
because of the constraints x∈X pX x = 1 and
Y
y∈Y py = 1, P0 is parametrised by a subset Θ of
R(m−1)+(l−1) with nonempty interior.
On the other hand, the quantities px,y , pX Y
x and py are respectively estimated by

n n n
1X 1X 1X
pbn,x,y = 1{Xi =x,Yi =y} , pbX
n,x = 1{Xi =x} , pbYn,y = 1{Yi =y} ,
n n n
i=1 i=1 i=1

and we denote by Pbn and PbnX ⊗ PbnY the random probability measures on X × Y with respective
probability mass functions (b pX
pn,x,y )(x,y)∈X×Y and (b bYn,y )(x,y)∈X×Y .
n,x p
With this notation, Proposition 4.1.13 in Chapter 4 allows to construct a test of independence,
for the hypotheses

H0 = {X and Y are independent} = {P ∈ P0 }, H1 = {P 6∈ P0 }.

The test statistic writes


X pn,x,y − pbX
(b n,x pbYn,y )2
d′n = nχ2 (Pbn |PbnX ⊗ PbnY ) = n ,
pbX bYn,y
n,x p
x∈X,y∈Y
6.1 Independence tests 119

and under H0 , it converges in distribution to the χ2 distribution with

(ml − 1) − ((m − 1) + (l − 1)) = (m − 1)(l − 1),


| {z } | {z }
cardinality of X × Y−1 dimension of P0

which yields the following result.

Proposition 6.1.2 (χ2 test of independence). The test rejecting H0 as soon as

d′n ≥ χ2(m−1)(l−1),1−α

is consistent and has asymptotic level 1 − α.

In order to apply this test to the case of Example 6.1.1, we take X = Y = {m, f} to encode
the gender of the first and second child, respectively, and compute the empirical frequencies in
Table 6.2.

Second child: male Second child: female Total


First child: male pb(m,m) = 0.24 pb(m,f) = 0.26 pbX
m = 0.5
First child: female pb(f,m) = 0.228 pb(f,f) = 0.272 pbX
f = 0.5
Total pbYm = 0.468 pbYf = 0.532 1

Table 6.2: Empirical frequencies for the study of Example 6.1.1. For the sake of legibility, we
omit the dependence upon n = 1000 in the notation of the empirical frequencies.

The value of the test statistic in this example is then



(0.24 − 0.5 × 0.468)2 (0.26 − 0.5 × 0.532)2
d′1000 = 1000× +
0.5 × 0.468 0.5 × 0.532
2 
(0.228 − 0.5 × 0.468) (0.272 − 0.5 × 0.532)2
+ + ≃ 0.578.
0.5 × 0.468 0.5 × 0.532

Using the R command qchisq(.9,1), we obtain that the quantile of order 0.9 of the χ2 (1)
distribution is 2.7, so that H0 cannot be rejected at the level α = 10% (and consequently at any
lower level): we conclude that the gender of the second child is independent on the gender of the
first child. Using the command 1-pchisq(.578,1), we actually obtain that the p-value of the
test is approximately 0.45.

6.1.2 Regression-based tests


The purpose of both linear and logistic regression is to describe the dependency between the
explained variable Y and the explanatory variables x1 , . . . , xp quantitatively. Thus, the Student
and Fisher tests (for linear regression) and the likelihood ratio test (for logistic regression) can be
seen as tests of independence, in the sense that if a parameter βj is equal to 0, then Y does not
depend on the corresponding variable xj . However, just like any parametric test, these tests only
make sense if the data are well represented by linear or logistic models.
120 Independence and homogeneity tests

6.2 Two-sample homogeneity tests


6.2.1 In the Bernoulli model: comparison of proportions
In this section, two Bernoulli samples X1,n1 = (X1,1 , . . . , X1,n1 ) ∈ {0, 1}n1 and X2,n2 =
(X2,1 , . . . , X2,n2 ) ∈ {0, 1}n2 are given. They are sometimes represented in a 2 × 2 contingency
table, as is described in Table 6.3.

First sample Second sample Total


P P
Number of 1 Y1 = ni11=1 X1,i1 Y2 = ni22=1 X2,i2 Y = Y1 + Y2
Number of 0 n 1 − Y1 n 2 − Y2 n1 + n2 − Y
Total n1 n2 n1 + n2

Table 6.3: Representation of the samples X1,n1 and X2,n2 in a 2 × 2 contingency table.

It is assumed that the two samples are independent, and

∀i1 ∈ {1, . . . , n1 }, X1,i1 ∼ B(p1 ), ∀i2 ∈ {1, . . . , n2 }, X2,i2 ∼ B(p2 ).

We want to test whether the two samples have the same distribution, so that we set

H0 = {p1 = p2 }, H1 = {p1 6= p2 }.

Example 6.2.1 (Decrease in Captain Haddock’s alcoholism after meeting Tintin3 ). In a very se-
rious paper4 , the health issues of Captain Haddock in The Adventures of Tintin are studied. In
particular, it is shown that before meeting Tintin, Captain Haddock sustains n1 = 24 health im-
pairments, 58.3% of which are due to alcohol, while after meeting Tintin, this proportion drops to
10.7% of his n2 = 225 health impairments (see Table 6.4). The test of comparison of proportions
allows to decide whether this decrease is statistically significant.

Before meeting Tintin After meeting Tintin Total


Alcohol-related HI 14 24 38
Non alcohol-related HI 10 201 211
Total 24 225 249

Table 6.4: Contingency table for Captain Haddock’s health impairments (HI).

An estimator of the parameter (p1 , p2 ) is (X 1,n1 , X 2,n2 ), so that it is natural to reject H0 when
|X 1,n1 − X 2,n2 | takes large values. In order to ensure that the level of the test be lower than α,
one should thus select a > 0 such that

sup P(p,p)(|X 1,n1 − X 2,n2 | ≥ a) ≤ α.


p∈[0,1]

The type I error appearing in the left-hand side is not easily computable, which prevents from
obtaining a satisfactory threshold a. We hereby present an asymptotic and a nonasymptotic test
circumventing this difficulty.
3
Suggested by K. Jean.
4
E. Caumes, L. Epelboin, G. Guermonprez, F. Leturcq and P. Clarke, Captain Haddock’s health issues in the adven-
tures of Tintin. Comparison with Tintin’s health issues, La Presse Médicale, vol. 45(7-8), 2016, p. 225-232.
6.2 Two-sample homogeneity tests 121

Asymptotic test: Z-test


When n1 and n2 are large, the Central Limit Theorem yields the approximations
   
p1 (1 − p1 ) p2 (1 − p2 )
X 1,n1 ≃ Z1 ∼ N p1 , , X 2,n2 ≃ Z2 ∼ N p2 , ,
n1 n2
and Z1 , Z2 are independent. Thus, under H0 , with p = p1 = p2 ,
  
1 1
X 1,n1 − X 2,n2 ≃ Z1 − Z2 ∼ N 0, p(1 − p) + ,
n1 n2
and !
n1
X n2
X
1
Xn = X1,i1 + X2,i2 → p, almost surely,
n1 + n2
i1 =1 i2 =1
so that r
n1 n2 X 1,n1 − X 2,n2
ζn1 ,n2 = q → N(0, 1), in distribution.
n1 + n2
X n (1 − X n )
We therefore deduce the following result.
Lemma 6.2.2 (Asymptotic Z-test). The test rejecting H0 when
|ζn1 ,n2 | ≥ φ1−α/2
is consistent and has asymptotic level α.
With the data of Example 6.2.1, ζn1 ,n2 = 6.16, so that the p-value of the observation, obtained
with the R command 2*pnorm(-6.16,mean=0,sd=1) (why?), is less than 10−9 : H0 is rejected
at all usual levels.

Fisher’s exact test


If n1 and n2 are not large, a nonasymptotic test is given by Fisher’s exact test, which relies on the
following conditioning result. As in Table 6.3, we write Y1 = n1 X 1,n1 and Y2 = n2 X 2,n2 , so that
Y = Y1 + Y2 is the total number of 1 in the union of the samples X1,n1 and X2,n2 . Notice that
under H0 , denoting p = p1 = p2 , we have Y ∼ B(n1 + n2 , p).
Lemma 6.2.3 (Conditioning by the sum). Under H0 , for any y1 ∈ {0, . . . , n1 } and y2 ∈ {0, . . . , n2 },
for all p ∈ [0, 1],
  
n1 n2
y y2
P(p,p) (Y1 = y1 , Y2 = y2 |Y = y1 + y2 ) =  1 .
n1 + n2
y1 + y2
Proof. The computation is straightforward: on the one hand,
P(p,p) (Y1 = y1 , Y2 = y2 , Y = y1 + y2 ) = P(p,p) (Y1 = y1 , Y2 = y2 )
   
n 1 y1 n1 −y1 n2
= p (1 − p) py2 (1 − p)n2 −y2 ,
y1 y2
while on the other hand,
 
n1 + n2 y1 +y2
P(p,p) (Y = y1 + y2 ) = p (1 − p)n1 +n2 −y1 −y2 ,
y1 + y2
so that the terms depending on p cancel in the definition of the conditional probability.
122 Independence and homogeneity tests

Remark 6.2.4 (Hypergeometric distribution). An integer random variable Z is said to have the
hypergeometric distribution with parameters N ≥ 0, K ∈ {0, . . . , N } and n ∈ {0, . . . , N } if for
all k ∈ {0, . . . , N },   
K N −K
k n−k
P(Z = k) =   .
N
n
Thus, Lemma 6.2.3 can be rephrased as stating that the conditional distribution of Y1 given Y is
the hypergeometric distribution with parameters n1 + n2 , n1 and Y . ◦

Assume that the observed values of x1 = y1 /n1 and x2 = y2 /n2 are such that x1 < x2 . To
know whether this difference is statistically significant, one may try to compute the probability
for X 1,n1 to take even smaller values than x1 under H0 , and reject H0 if this probability is small
enough. Writing

P(p,p)(Y1 ≤ y1 ) = P(p,p)(Y1 ≤ y1 |Y = y1 + y2 )P(p,p) (Y = y1 + y2 )

allows to bring the quantity P(p,p) (Y1 ≤ y1 |Y = y1 + y2 ) out, which by Lemma 6.2.3 does
not depend on p and therefore can be computed explicitly. This conditioning procedure is the
cornerstone of Fisher’s exact test.

Definition 6.2.5 (Fisher’s exact test). Fisher’s exact test is defined by the following procedure.
Denote by y1 and y2 the observed value of Y1 and Y2 , and write x1 = y1 /n1 , x2 = y2 /n2 .

• If x1 = x2 , accept H0 .

• If x1 < x2 : compute ρ = P(Y1′ ≤ y1 ), where Y1′ is a hypergeometric random variable with


parameters n1 + n2 , n1 and y1 + y2 , and reject H0 if ρ ≤ α/2.

• If x1 > x2 : compute ρ = P(Y1′ ≥ y1 ), where Y1′ is a hypergeometric random variable with


parameters n1 + n2 , n1 and y1 + y2 , and reject H0 if ρ ≤ α/2.

Proposition 6.2.6 (Fisher’s exact test). Fisher’s exact test has a level lower than α.

Proof. Let Wn1 ,n2 refer to the event ‘H0 is rejected’. For all p ∈ [0, 1], we first write
nX
1 +n2

P(p,p) (Wn1 ,n2 ) = P(p,p)(Wn1 ,n2 |Y = y)P(p,p) (Y = y),


y=0

and for all y ∈ {0, . . . , n1 + n2 },

P(p,p)(Wn1 ,n2 |Y = y) = P(p,p)(Wn1 ,n2 , X 1,n1 = X 2,n2 |Y = y)


+ P(p,p)(Wn1 ,n2 , X 1,n1 < X 2,n2 |Y = y)
+ P(p,p)(Wn1 ,n2 , X 1,n1 > X 2,n2 |Y = y).

By the construction of the test, P(p,p)(Wn1 ,n2 , X 1,n1 = X 2,n2 |Y = y) = 0. We now write

P(p,p)(Wn1 ,n2 , X 1,n1 < X 2,n2 |Y = y) = P(p,p)(ρy (Y1 ) ≤ α/2, X 1,n1 < X 2,n2 |Y = y)
≤ P(p,p)(ρy (Y1 ) ≤ α/2|Y = y),
6.2 Two-sample homogeneity tests 123

where ρy is the cumulative distribution function of the hypergeometric distribution with parame-
ters n1 + n2 , n1 and y; in other words, ρy is the cumulative distribution function of the conditional
distribution of Y1 on the event Y = y. Thus, Exercise 6.2.8 below yields

P(p,p) (ρy (Y1 ) ≤ α/2|Y = y) ≤ α/2.

By similar arguments, we get

P(p,p)(Wn1 ,n2 , X 1,n1 > X 2,n2 |Y = y) ≤ α/2,

and finally deduce


1 +n2 
α α
nX
P(p,p) (Wn1 ,n2 ) ≤ + P(p,p)(Y = y) = α,
2 2
y=0

which completes the proof.

Remark 6.2.7. The denomination ‘exact’ for this test has to be understood as ‘nonasymptotic’.
However, it is important to observe that Proposition 6.2.6 only provides an upper bound on the
level of the test. ◦
Exercise 6.2.8. Let Z be a random variable with cumulative distribution function ρ. Show that,
for all r ∈ [0, 1],
P(ρ(Z) ≤ r) ≤ r.
Hint: recall that in the proof of Lemma 4.2.13, p. 93, we showed that for any u ∈ (0, 1),
ρ(ρ−1 (u)) ≥ u. ◦
Applying Fisher’s exact test on the data of Example 6.2.1 with the R command fisher.test
yields a p-value of the order of 10−7 , so that we reject H0 at all usual levels.

6.2.2 In the Gaussian model: Fisher and Student tests


In this section, we assume that we observe two independent samples X1,n1 = (X1,1 , . . . , X1,n1 )
and X2,n2 = (X2,1 , . . . , X2,n2 ) of Gaussian random variables:

∀i1 ∈ {1, . . . , n1 }, X1,i1 ∼ N(µ1 , σ12 ), ∀i2 ∈ {1, . . . , n2 }, X2,i2 ∼ N(µ2 , σ22 ).

We shall study two tests:


• Fisher’s test of homoscedasticity with null and alternative hypotheses

H0 = {σ12 = σ22 }, H1 = {σ12 6= σ22 };

• Student’s test of homogeneity, in which it is assumed that σ12 = σ22 and

H0 = {µ1 = µ2 }, H1 = {µ1 6= µ2 }.

Example 6.2.9 (Grades of IMI and SEGF students in 20185 ). The statistics associated with the
grades at the final exam of the course Statistics and Data Analysis in 2018 are reported in Ta-
ble 6.5. Assuming that these samples are Gaussian, we want to known whether there is a statisti-
cally significant difference between IMI and SEGF students.
5
Thanks to P. Gréaume for this study.
124 Independence and homogeneity tests

IMI SEGF
Number of students n1 = 43 n2 = 35
Average 14.08 12.86
Standard deviation 1.9 1.84

Table 6.5: Statistics of the grades at the final exam of the course Statistics and Data Analysis in
2018.

Student’s test of homogeneity


Here, we suppose that it is already known that the two samples have the same variance. The latter
is denoted by σ 2 , however its value is not assumed to be known. Then we have

∀i1 ∈ {1, . . . , n1 }, X1,i1 ∼ N(µ1 , σ 2 ), ∀i2 ∈ {1, . . . , n2 }, X2,i2 ∼ N(µ2 , σ 2 ),

and will work with the hypotheses

H0 = {µ1 = µ2 }, H1 = {µ1 6= µ2 }.

Lemma 6.2.10 (Student’s test). Under H0 , the statistic


s
n1 + n2 − 2 X 1,n1 − X 2,n2
Tn1 ,n2 = q
1/n1 + 1/n2 (n − 1)S 2 + (n − 1)S 2
1 1,n1 2 2,n2

is distributed according to the Student distribution t(n1 + n2 − 2).


As a consequence, the test rejecting H0 as soon as

|Tn1 ,n2 | ≥ tn1 +n2 −2,1−α/2

has level α.

Proof. By Proposition A.4.3, under H0 ,

X 1,n1 − X 2,n2 ∼ N(0, σ 2 /n1 + σ 2 /n2 ),

while
2
(n1 − 1)S1,n 2
+ (n2 − 1)S2,n
1 2
∼ χ2 (n1 + n2 − 2),
σ2
and these two variables are independent. As a consequence,

√X21,n1 −X 2,n2
σ (1/n1 +1/n2 )
Tn1 ,n2 = r ∼ t(n1 + n2 − 2),
2
(n1 −1)S1,n 2
+(n2 −1)S2,n
1 2
(n1 +n2 −2)σ 2

and the proof is completed.

Remark 6.2.11 (Adaptation when σ12 6= σ22 ). Student’s test can be adapted when the assumption
that the two samples have the same variance is not fulfilled. The resulting test is called Welch’s
t-test [3, Section 11.3.1]. ◦
6.2 Two-sample homogeneity tests 125

Fisher’s test of homoscedasticity


We no longer assume that the two samples have the same variance, and work with the hypotheses

H0 = {σ12 = σ22 }, H1 = {σ12 6= σ22 }.

By Proposition A.4.3, under H0 we have


2
S1,n 1
Fn1 ,n2 = 2 ∼ F(n1 − 1, n2 − 2).
S2,n 2

Therefore Fisher’s two-sided test consists in rejecting H0 as soon as

Fn1 ,n2 6∈ [fn1 −1,n2 −1,α/2 , fn1 −1,n2 −1,1−α/2 ],

where fn1 −1,n2 −1,r is the quantile of order r of F(n1 − 1, n2 − 2). This ensures that this test has
level α.
Remark 6.2.12 (Other tests). Fisher’s test is known to be very sensitive to the assumption that the
sample be Gaussian. More robust tests of homoscedasticity are reviewed in [3, Section 11.6]. ◦
With the data of Example 6.2.9, the application of Fisher’s test yields a p-value of 0.966, which
allows to accept the hypothesis that the two samples have the same variance. Therefore we may
perform Student’s test and obtain a p-value of 0.005, which indicates that there is a statistically
significative difference between IMI and SEGF students: it is up to SEGF students to show that
2018 was an outlier!

6.2.3 In the nonparametric framework


In this section, we consider nonparametric tests of homogeneity, aiming to check whether two
independent samples X1,n1 = (X1,1 , . . . , X1,n1 ) and X2,n2 = (X2,1 , . . . , X2,n2 ) have the same
distribution, without making any parametric assumption on this distribution (unlike in the previous
section, where the samples were assumed to be Gaussian).
Example 6.2.13 (Efficiency of a vaccine). In order to study the efficiency of a vaccine, a group of
200 people is split into two groups of n1 = n2 = 100 people. The first group is treated with the
vaccine while the other group receives a placebo. One week after, the concentration of antibodies
in the patients’ blood is measured for both groups. The corresponding histograms are plotted on
Figure 6.1.
Clearly, the distribution of the concentration in antibodies is not Gaussian. Therefore Student’
and Fisher’s test are not appropriate, and a nonparametric homogeneity test must be employed.

Graphical visualisation: the Q-Q plot


Before introducing quantitative methods, we present a popular graphical method which provides
a first idea of the closedness between two distributions. Let us denote by Fb1,n1 and Fb2,n2 the
respective empirical CDF of the two samples. For all u ∈ (0, 1), the empirical quantiles of order
u in each sample are given by Fb1,n
−1
1
(u) and Fb2,n
−1
2
(u), where we recall the Definition 4.2.11 of the
quantile function.
Definition 6.2.14 (Q-Q plot). The Q-Q plot of the samples X1,n1 and X2,n2 is the parametric
curve  
u ∈ [0, 1] 7→ Fb1,n
−1
1
(u), b−1 (u) ∈ R2 .
F2,n2
126 Independence and homogeneity tests

Histogram of Group1 Histogram of Group2

12

15
10
8
Frequency

Frequency

10
6
4

5
2
0

0
5 6 7 8 9 10 11 5 6 7 8 9 10 11

Group1 Group2

Figure 6.1: Concentration in antibodies for both groups.

Three examples are plotted on Figure 6.2 below: one where the samples have the same dis-
tribution, one where the laws of the two samples are linearly related, one where the samples have
distinct laws.
2

8
8
1

6
4
0
X2

X2

X2

4
2
−1

2
0
−2

−2

−2 −1 0 1 2 −1 0 1 2 −2 −1 0 1 2

X1 X1 X1

Figure 6.2: Q-Q plots of: (i) two samples with identical N(0, 1) distribution; (ii) two samples
with respective distributions N(0, 1) and N(3, 2); (iii) two samples with respective distributions
N(0, 1) and E(1).

The interpretation is that the closer the scatter plot is to the diagonal, the more similar both
distributions are. Notice that when the scatter plot has a linear shape, but distinct from the di-
agonal, it expresses the fact that the two CDFs F1 and F2 have a linear relation of the form
F2 (x) = F1 ((x − b)/a). In particular, if the Q-Q plot of a single sample with the standard Gaus-
sian distribution is linear, then it is likely that the sample be Gaussian, and normality tests such as
the Lilliefors test may be applied.
6.3 Many-sample homogeneity tests 127

The Kolmogorov–Smirnov test


Let us denote by F1 and F2 the respective CDFs of the samples X1,n1 and X2,n2 . The null and
alternative hypotheses for homogeneity tests are
H0 = {F1 = F2 }, H1 = {F1 6= F2 }.
The Kolmogorov–Smirnov test for these hypotheses is a variation of the Kolmogorov test studied
in Section 4.2. It is based on the Kolmogorov–Smirnov statistic


ξn1 ,n2 = sup Fb1,n1 (x) − Fb2,n2 (x) ,
x∈R

which can be computed with similar arguments as those detailed in Remark 4.2.8. The test is
based on the following result.
Lemma 6.2.15 (Freeness of the Kolmogorov–Smirnov statistic). Assume that F1 and F2 are con-
tinuous. Under H0 , the statistic ξn1 ,n2 is free: its law only depends on n1 and n2 .
The proof of Lemma 6.2.15 is postponed to Exercise 6.A.1. The law of ξn1 ,n2 under H0 is
called the Kolmogorov–Smirnov distribution with parameters n1 and n2 , its quantile of order r is
denoted by xn1 ,n2 ,r .
Corollary 6.2.16 (Kolmogorov–Smirnov test). The test rejecting H0 when ξn1 ,n2 ≥ xn1 ,n2 ,1−α
has level 1 − α.
11

The application of this test to the data of Exam-


10

ple 6.2.13, thanks to the R command ks.test,


9
Group2

yields a p-value of 0.7, which allows to accept


8

H0 . This result is confirmed by the linear shape


of the Q-Q plot of these samples, represented on
7

the right.
6

5 6 7 8 9 10 11

Group1

Based on the results of Section 4.2, it is also possible to design an asymptotic version of the
Kolmogorov–Smirnov test. Another popular nonparametric test of homogeneity is the Mann–
Whitney U -test [6, Example 12.7, p. 66]. Wilcoxon’s test, which addresses matched samples, is
studied in Exercise 6.A.2.

6.3 Many-sample homogeneity tests


6.3.1 χ2 homogeneity tests
Let X1,n1 , . . . , Xk,nk be independent samples taking their values in a finite state space X, with re-
spective distributions P1 , . . . , Pk . To test the homogeneity of these samples, that is to say whether
P1 = · · · = Pk , a possible approach consists in applying a χ2 test of independence to the k × |X|
contingency table which, at the ℓ-th row and x-th column, contains the number
nℓ
X
Nℓ,x = 1{Xℓ,i =x}
i=1
128 Independence and homogeneity tests

of occurrences of x within the ℓ-th sample. Hence, in this case, the homogeneity test reduces to
the independence test of Section 6.1.1.

6.3.2 Analysis of variance in the Gaussian model


The technique of Analysis of Variance (ANOVA) allows to test the homogeneity of k ≥ 3 samples,
and therefore goes beyond the scope of Student’s test of Section 6.2.2. Let us be given k samples
X1,n1 = (X1,1 , . . . , X1,n1 ), . . . , Xk,nk = (Xk,1 , . . . , Xk,nk ), which are assumed to be Gaussian
and have the same variance:
∀ℓ ∈ {1, . . . , k}, ∀i ∈ {1, . . . , nℓ }, Xℓ,i ∼ N(µℓ , σ 2 ).
The purpose of ANOVA is to construct a test for the null and alternative hypotheses
H0 = {µ1 = · · · = µk }, H1 = {∃ℓ, m : µℓ 6= µm }.
Example 6.3.1 (Influence of environment on physical development). In order to measure the
influence of the environment on physical characteristics, three groups of 30-year male people are
considered.
• The first group contains n1 = 35 inhabitants of Paris.
• The second group contains n2 = 22 inhabitants of Chamonix.
• The third group contains n3 = 29 inhabitants of Hendaye.
The associated empirical means and variances are reported in Table 6.6. We want to know whether
there is a statistically significant difference between these distributions, assuming that they are
Gaussian.
Paris Chamonix Hendaye
Size of the sample n1 = 35 n2 = 22 n3 = 29
Average 178.2 176.5 179.4
Standard deviation 1.55 1.80 1.42

Table 6.6: Statistic of heights for Example 6.3.1.

A somehow intuitive approach consists in computing the empirical averages


nℓ
1 X
X ℓ,· = Xℓ,i
nℓ
i=1

of each sample, and to check whether they are highly scattered6 around the global empirical aver-
age
Pk Pn ℓ
i=1 Xℓ,i
X ·,· = ℓ=1 Pk .
ℓ=1 nℓ
This scattering is measured by the Sum of Squares for the Model
k
X 2
SSM = nℓ X ℓ,· − X ·,· .
ℓ=1
6
Dispersé en français.
6.3 Many-sample homogeneity tests 129

We also introduce the Sum of Squares for the Total


nℓ
k X
X 2
SST = Xℓ,i − X ·,· ,
ℓ=1 i=1

and the Sum of Squares for the Error


nℓ
k X
X 2
SSE = Xℓ,i − X ℓ,· .
ℓ=1 i=1

Remark 6.3.2 (Geometric interpretation). Let us define

k
X
n= nℓ , Xn = (X1,n1 , . . . , Xk,nk ) ∈ Rn ,
ℓ=1

and denote by E the k-dimensional linear subspace of Rn spanned by the vectors

(1n1 , 0n2 , . . . , 0nk ), (0n1 , 1n2 , . . . , 0nk ), . . . , (0n1 , 0n2 , . . . , 1nk ),

where for all ℓ ∈ {1, . . . , k},

1nℓ = (1, . . . , 1) ∈ Rnℓ , 0nℓ = (0, . . . , 0) ∈ Rnℓ .

We also denote by H the one-dimensional linear subspace of Rn spanned by the vector

1n = (1n1 , . . . , 1nk ) ∈ Rn ,

and notice that H ⊂ E. The orthogonal projections of Xn onto E and H are respectively denoted
by XE H
n and Xn . Then it is easily checked that

SSM = kXE H 2
n − Xn k , SST = kXn − XH 2
nk , SSE = kXn − XE 2
nk .

Besides, since Xn − XE ⊥ E H
n ∈ E while Xn − Xn ∈ E, Pythagoras’ Theorem yields

kXn − XH 2 E 2 E H 2
n k = kXn − Xn k + kXn − Xn k ,

which rewrites
SST = SSE + SSM. ◦

Following our intuitive approach, the test should reject H0 when SSM takes large values.
However, the magnitude of these values depends on the unknown parameter σ 2 , which in turn is
estimated by SSE (suitably renormalised). The next proposition clarifies these heuristic arguments
and allows to construct a rigorous test.

Proposition 6.3.3 (Variance decomposition). Under H0 , the statistic

SSM/(k − 1)
Fn =
SSE/(n − k)

is distributed according to the Fisher distribution with k − 1 and n − k degrees of freedom.


130 Independence and homogeneity tests

Proof. Under H0 , let µn ∈ Rn be the vector µ1n , where µ is the common value of µ1 , . . . , µk .
Then
Xn = µ n + ǫ n , ǫn ∼ Nn (0, σ 2 In ).
Let us denote by ǫE H
n and ǫn the respective orthogonal projections of ǫn onto E and H, so that

XE E
n = µn + ǫn , XH H
n = µ n + ǫn .

As a consequence,

SSM = kXE H 2 E H 2
n − Xn k = kǫn − ǫn k , SSE = kXn − XE 2 E 2
n k = kǫn − ǫn k .

Combining Cochran’s Theorem A.4.2 with the orthogonal decomposition

ǫn = ǫn − ǫE + ǫE − ǫH + ǫH
n ,
| {z n} | n {z n} |{z}
∈E ⊥ ∈H ′ ∈H

where H ′ denotes the orthogonal of H in E, we deduce that SSM and SSE are independent, with
1 1
SSM ∼ χ2 (k − 1), SSE ∼ χ2 (n − k).
σ2 σ2
It follows that Fn ∼ F(k − 1, n − k).

Corollary 6.3.4 (F-test for ANOVA). The test rejecting H0 as soon as Fn ≥ fk−1,n−k,1−α has
level α.

In practice, statistical softwares return an ANOVA table such as depicted in Table 6.7, in which
all useful quantities are summarised.

Df Sum Sq Mean Sq F value Pr(>F)


group k−1 SSM SSM/(k − 1) Fn p-value
Residuals n−k SSE SSE/(n − k)

Table 6.7: The contents of an ANOVA table returned by R.

With the data of Example 6.3.1, where k = 3 and n = 86, the ANOVA table reads
Df Sum Sq Mean Sq F value Pr(>F)
group 2 105.7 52.8 20.51 5.7 10−8
Residuals 83 213.8 2.58
In particular, H0 is rejected at all usual levels.

6.A Exercises
Exercise 6.A.1. Prove Lemma 6.2.15. ◦

1 Exercise 6.A.2 (Wilcoxon’s test). The Wilcoxon signed-rank test is adapted to a slightly differ-
ent framework than homogeneity tests described so far, as it addresses matched samples7 . In this
context, it is assumed that the sample is a series of independent pairs (X1,i , X2,i )1≤i≤n , such that
7
Échantillons appariés en français.
6.B Summary 131

the differences Zi = X1,i − X2,i are identically distributed. We recall that the law of a random
variable ζ is said to be symmetric if ζ and −ζ have the same law, and define the following null and
alternative hypotheses:

H0 = {the law of Z1 is symmetric}, H1 = {the law of Z1 is not symmetric}.

An example. The papers of n students at an exam are graded successively by two professors,
yielding the grades X1,i and X2,i for the i-th student. We want to know whether there is a ‘profes-
sor effect’ resulting from different grading methods. We assume that each paper has an intrinsic
value xi , and that the grade given by each professor is a random fluctuation around this intrinsic
value, so that
X1,i = xi + ǫ1,i , X2,i = xi + ǫ2,i ,
where the sequences (ǫ1,i )1≤i≤n and (ǫ2,i )1≤i≤n are independent, and within each of these se-
quences, the variables are iid with respective distributions P1 and P2 . In this context, what is the
relation between the null hypothesis H0 introduced above and the ‘natural’ homogeneity hypoth-
esis H0′ = {P1 = P2 }?
We shall make the technical assumption:

∀z ∈ R, P(Z1 = z) = 0. (∗)

This ensures that, almost surely, the values of Z1 , . . . , Zn are pairwise distinct, and therefore there
is a unique permutation π of {1, . . . , n}, the set of which is denoted by Sn , such that |Zπ(1) | <
· · · < |Zπ(n) |, and thus allows us to introduce the statistic
n
X
T+ = k1{Zπ(k) >0} ,
k=1

on which Wilcoxon’s test for H0 and H1 is based. We insist on the fact that the permutation π
depends on the realisation of the sample (Z1 , . . . , Zn ), and therefore is random.

1. Let ζ be a random variable with symmetric law, satisfying (∗). Show that the random vari-
ables8 sign(ζ) ∈ {−1, 1} and |ζ| > 0 are independent, and that sign(ζ) is a Rademacher
variable (that is to say that P(sign(ζ) = −1) = P(sign(ζ) = 1) = 1/2).

2. Let ζ1 , . . . , ζn be iid random variables, with symmetric law and satisfying (∗). We define
the random permutation π ∈ Sn to be such that |ζπ(1) | < · · · < |ζπ(n) |. Show that the
random variables sign(ζπ(1) ), . . . , sign(ζπ(n) ) are independent Rademacher variables.

3. Deduce that under H0 , the statistic T + is free, and describe a nonasymptotic two-sided test
for H0 .

4. Compute the expectation tn and the variance σn2 of T + under H0 .

5. Show that, under H0 , (T + − tn )/σn converges in distribution to N(0, 1). Deduce an asymp-
totic test for H0 . ◦

6.B Summary
We summarise the various tests which have been presented in these notes.
8
The condition (∗) ensures that ζ 6= 0, almost surely, so that there is no need to take a convention to define the value
of sign(0).
132 Independence and homogeneity tests

6.B.1 Goodness-of-fit tests for one sample


Goodness-of-fit for a single distribution P ∗
Parametric model Nonparametric model
General model Gaussian model Discrete model Continuous model
Nonasymptotic test Tests for µ: χ2 test Kolmogorov test,
Section 3.2 • σ 2 known: Z-test Section 4.1.2 • asymptotic
Student–Wald test • σ 2 unknown: t-test Section 4.2.2
Exercise 3.A.4 Test for σ 2 : F-test • nonasymptotic
Section 3.3. Section 4.2.3

Goodness-of-fit for a family of distributions P∗


Discrete model: χ2 test Continuous model: Lilliefors correction
Section 4.1.3 Section 4.2.4

6.B.2 Homogeneity tests


Homogeneity tests for two samples
Independent samples Matched samples
Bernoulli model Gaussian model Nonparametric model Nonparametric model
• Asymptotic Z-test • µ1 = µ2 : t-test Kolmogorov–Smirnov Wilcoxon test
• Fisher’s exact test • σ12 = σ22 : F-test Section 6.2.3 Exercise 6.A.2
Section 6.2.1 Section 6.2.2

Homogeneity tests for k samples


Discrete model: χ2 test Gaussian model: F-test for ANOVA
Section 6.3.1 Section 6.3.2

6.B.3 Independence tests


• For discrete models: χ2 test, Section 6.1.1.

• Regression-based independence tests (Section 6.1.2):


Linear: t- and F-tests, Section 5.1.4
Logistic: likelihood ratio test, Section 5.2.2
 Check-out list

The following exercises cover the main points of the course and are typical examples of what you
should be able to solve (almost) instantaneously at the end of the semester. The questions for
which a numerical computation is required are marked with the symbol [R].

 Exercise 1 (Principal Component Analysis1 ). We consider n = 30 observation stations along


the Doubs river, the location of which is represented on Figure (a) (the stations are numbered
according to the direction of the current). Each station collects p = 7 chemical data: pH (the
less pH, the more acid the water), water hardness (the concentration in mineral components),
the concentration in phosphate, nitrate, ammonia, oxygen, and the biological demand in oxygen.
These features are respectively denoted by pH, har, pho, nit, amm, oxy, bdo below.

1. Should PCA be normalised here?

2. The correlation circle of the first two principal components is plotted on Figure (b). Recall
how this figure is plotted, and give an interpretation of the score of a station on the first two
principal components.

3. The projection of the n stations in the first factoriel plane is plotted on Figure (c), and the
scores of these stations on the first principal component is plotted on Figure (d). What do
you conclude?

Cercle des corrélations


2eme axe de l'ACP

1er axe de l'ACP

(a) The location of the n stations along the (b) The correlation circle for the first two
Doubs river. principal components.
1
Inspired from https://fr.wikipedia.org/wiki/Analyse_en_composantes_principales.
134  Check-out list

2eme axe de l'ACP

1er axe de l'ACP

(c) The projection of the n stations in the (d) Scores of the stations on the first princi-
first factoriel plane. pal component.
 Exercise 2 (Parametric estimation2 ). The Rayleigh distribution with parameter θ > 0 is the
probability distribution with density
r  2
2 x
p(x; θ) = 1{x>0} exp − .
πθ 2θ
We remark that under Pθ , X1 has the same law as |Z|, where Z ∼ N(0, θ), and recall that E[Z 2 ] =
θ, E[Z 4 ] = 3θ 2 .
1. Compute Eθ [X1 ] and deduce a moment estimator θen of θ.
2. Show that θen is asymptotically normal and compute its asymptotic variance.
3. Compute the MLE θbn of θ.
4. Show that θbn is unbiased and strongly consistant.
5. Show that θbn is asymptotically normal and compute its asymptotic variance.
6. Which of the estimators θen and θbn do you prefer?
7. Show that the MLE θbn is asymptotically efficient.
8. Construct an asymptotic confidence interval with level 1 − α for θ.
9. Show that under Pθ , the random variable θbn /θ is free and deduce an exact confidence inter-
val with level 1 − α for θ.
10. With the same arguments, construct a test with level α for the hypotheses H0 = {θ ≤ θ0 },
H1 = {θ > θ0 }. What is the p-value of an observation θbn = θ obs ?
 Exercise 3 (Hypothesis testing). In order to be labelled as ‘organic food’, a food producer has
to ensure that each of his products contains less than 1% of GMO3 . He takes a sample of n = 25
products and computes the percentage of GMO in each of these products. We denote by Xi the
logarithm of this percentage for the i-th product and assume that X1 , . . . , Xn are iid under the law
N(θ, 1).
2
Exercises 2 and 3 were proposed by Christophe Denis.
3
OGM en français.
135

1. For the producer, the products do not contain GMO unless the contrary is proved. He there-
fore sets
H0 = {θ ≤ 0}, H1 = {θ > 0}.

Construct a test with level α = 0.05 for these hypotheses.

2. [R] An environmental organisation wants to make sure that the products do not contain
GMO. In particular, they worry about the ability of the test to detect products with a quantity
of GMO which is 50% larger than what is authorised. Compute the probability that the test
concludes to the absence of GMO when the actual percentage of GMO is 1.5% (that is to
say with θ = log 1.5).

3. [R] Scandalised by this result, the organisation wants to modify the producer’s test. To
them, the products do contain GMO unless the contrary is proved. Which test should be
constructed? With n = 25 and α = 0.05, what is now the probability to conclude to the
absence of GMO when θ = log 1.5?

 Exercise 4 (Nonparametric, independence and homogeneity tests). 1. [R] A type of plant


has two genetical characters, each of which can take two forms:

• the first character can take the forms A or a;


• the second character can take the forms B or b.

We take a population of first generation plants in which the phenotypes AB, aB, Ab, ab
are equally distributed. In order to test the hypothesis H0 that A and B are dominant and
a and b are recessive, we perform random interbreedings in the population and obtain a
second generation. Under H0 , the theoretical probability of each phenotype in the second
generation, predicted by Mendel’s theory, is given below. For a population of n = 160
plants in the second population, the actual number of occurrences of each phenotype are
also given. What do you conclude?

Phenotype AB aB Ab ab
Theoretical probability 9/16 3/16 3/16 1/16
Experimental observation 100 18 24 18

2. Which test should you use in the following situations?

(a) You want to know if the time between two consecutive buses at a bus stop has an
exponential distribution.
(b) You want to know if eye colour and hair colour are independent.
(c) You want to know if the number of earthquakes per year follows a Poisson distribution.
(d) You want to know if men between age 20 and 30 have the same size in France and in
Italy.
(e) You want to know if men between age 20 and 30 have the same size all across Europe.
(f) You want to know if having young children has an influence on having a professional
activity, to this aim you compute the unemployment rate among 100 women with
children and among 100 women without children.
136  Check-out list

 Exercise 5 (Linear regression4 ). [R] The Cobb–Douglas model in economics relates the total
production Y of a company (the real value of all goods produced) in a year with the labour input
L (the total number of person-hours worked) and the capital input K (a measure of all machinery,
equipment, and buildings) through the formula

Y = ALβ1 K β2 .

The variables Y , L and K are expressed in M$.


The constant A is called the total factor productivity. We let y = log Y , x1 = log L, x2 =
log K and assume the linear model

y = log A + β1 x1 + β2 x2 + ǫ, ǫ ∼ N(0, σ 2 ).

Applying linear regression on a database of n = 1658 companies for which Y , L and K are given,
we obtain the following results:

βb0 3.136
 
βb1 0.738 0.0288 0.0012 −0.0034
βb2 0.282 (x⊤
n xn )
−1
=  0.0012 0.0016 0.0010 
R2 0.945 −0.0034 0.0010 0.0009
bn k2
kyn − y 148.27

b2 of σ 2 .
1. Compute the value of the unbiased estimator σ

2. Give a confidence interval with level 95% for the total factor productivity.

3. Give a prediction interval with level 95% for the total production of a company with labour
input L = 100 M$ and capital input K = 50 M$.

4. Determine, at the level 5%, if the capital input is a significative factor.

5. The model is said to display constant returns to scale5 if multiplying L and K by λ > 0
results in multiplying Y by λ.

(a) Express this condition in terms of β1 and β2 .


(b) Construct and apply a t-test deciding if the model displays constant returns to scale.

4
Taken from the lecture notes Régression linéaire by Arnaud Guyader: http://www.lpsm.paris/pageperso/guyader.
5
Rendements d’échelle constants en français.
Appendix A

Complements on Gaussian statistics

This appendix contains a short introduction to several distributions related to Gaussian variables or
vectors, which were used in the previous chapters. We refer to [2] for a more detailed exposition.

A.1 Gaussian-related distributions


Definition A.1.1 (χ2 distribution). For all n ≥ 1, the χ2 distribution with n degrees of freedom is
the law of the random variable
Xn
Yn = Xi2 ,
i=1
where X1 , . . . , Xn are independent N(0, 1) random variables. We write Yn ∼ χ2 (n).
It is immediate from the definition of the χ2 distribution that

E[Yn ] = n, Var(Yn ) = 2n.

Besides, by the strong Law of Large Numbers, Yn /n converges almost surely to 1.


It is a classical exercise [2, Proposition 3.4.6, p. 53] to check that χ2 (n) = Γ(n/2, 1/2). The
χ2 distribution is also called Pearson distribution.
Definition A.1.2 (Student distribution). Let X ∼ N(0, 1) and Yn ∼ χ2 (n) be independent random
variables. The law of the random variable
X
Tn = p
Yn /n
is called the Student distribution with n degrees of freedom. It is denoted by t(n).
Exercise A.1.3. For all n ≥ 1, let Tn ∼ t(n). Show that Tn converges in distribution to a random
variable T ∼ N(0, 1). ◦
Definition A.1.4 (Fisher distribution). Let Yn1 ∼ χ2 (n1 ) and Yn2 ∼ χ2 (n2 ) be independent
random variables. The law of the random variable
Yn1 /n1
Fn1 ,n2 =
Yn2 /n2
is called the Fisher distribution with n1 and n2 degrees of freedom. It is denoted by F(n1 , n2 ).
The Fisher distribution is also called Fisher–Snedecor, or Snedecor distribution.
138 Complements on Gaussian statistics

A.2 Gaussian vectors


Definition A.2.1 (Gaussian vector). A random vector X = (X1 , . . . , Xd ) ∈ Rd is called a gaus-
sian vector if, for all u ∈ Rd , the random variable hu, Xi is a Gaussian random variable, that is
to say that there exist µ ∈ R and σ 2 ≥ 0 such that
d
X
hu, Xi = ui Xi ∼ N(µ, σ 2 ).
i=1

In the previous definition, the variance σ 2 is allowed to vanish so that we take the convention
that deterministic random variables are Gaussian random variables with zero variance.
If X = (X1 , . . . , Xd ) ∈ Rd is a random vector, then for all u ∈ Rd ,

µ = E[hu, Xi] = hu, mi, σ 2 = Var(hu, Xi) = hu, Kui,

where m = E[X] and K is the covariance matrix of X. The next result then follows from the
expression of the characteristic function of Gaussian random variables.

Lemma A.2.2 (Characteristic function of Gaussian vectors). If X ∈ Rd is a Gaussian vector, then


its characteristic function writes
 
d 1
∀u ∈ R , ΦX (u) = E[exp(ihu, Xi)] = exp ihu, mi − hu, Kui ,
2

where m = E[X] and K is the covariance matrix of X.

As a consequence, the law of a Gaussian vector is characterised by its expectation m ∈ Rd


and its covariance matrix K ∈ Rd×d . Hence, it is denoted by

X ∼ Nd (m, K).

The proof of the next proposition is left as an exercise.

Proposition A.2.3 (Affine transformation of Gaussian vectors). Let X ∼ Nd (m, K) and A ∈


Rk×d , b ∈ Rk . Then
AX + b ∼ Nk (Am + b, AKA⊤ ).

A very useful property of Gaussian vectors is that independence can be checked from the
covariance matrix.

Proposition A.2.4 (Gaussian vectors and independence). Let X ∈ Rl and Y ∈ Rd be two ran-
dom vectors such that the vector (X, Y ) ∈ Rl+d is Gaussian. Then the following conditions are
equivalent.

(i) X and Y are independent.

(ii) The covariance matrix of (X, Y ) admits the block decomposition


 
KX 0
,
0 KY

where KX and KY are the respective covariance matrices of X and Y .

(iii) For all u ∈ Rl , v ∈ Rd , Cov(hu, Xi, hv, Y i) = 0.


A.3 Multidimensional Central Limit Theorem 139

Proof. The implications (i) ⇒ (ii) ⇒ (iii) are straightforward. Let us assume (iii) and prove (i).
To this aim, let us fix u ∈ Rl and v ∈ Rd . The characteristic function of (X, Y ) evaluated at (u, v)
writes
Φ(X,Y ) (u, v) = E [exp (i(hu, Xi + hv, Y i))] .
Since the vector (X, Y ) is Gaussian, the random variable hu, Xi + hv, Y i is also Gaussian, with
mean
E [hu, Xi + hv, Y i] = hu, mX i + hv, mY i
and variance

Var (hu, Xi + hv, Y i) = Var (hu, Xi) + 2 Cov (hu, Xi, hv, Y i) + Var (hv, Y i)
= hu, KX ui + hv, KY vi,

where we have used the condition (iii) here. As a consequence, using the expression of the char-
acteristic functions of real-valued Gaussian variables, we get
 
1
E [exp (i(hu, Xi + hv, Y i))] = exp i (hu, mX i + hv, mY i) − (hu, KX ui + hv, KY vi)
2
= ΦX (u)ΦY (v),

which shows that X and Y are independent.

A.3 Multidimensional Central Limit Theorem


Theorem A.3.1 (Multidimensional Central Limit Theorem). Let (Xi )i≥1 be a sequence of iid
random vectors in Rd , with well-defined expectation m ∈ Rd and covariance matrix K ∈ Rd×d .
We have !
n
√ 1X
n Xi − m → G ∼ Nd (0, K), in distribution.
n
i=1

As in the scalar case, the proof of Theorem A.3.1 follows from the computation of the limit,
when n → +∞, of the characteristic function of the random vector in the left-hand side.

A.4 Cochran’s Theorem


Let E be a linear subspace of Rd , and let Π be the matrix of the orthogonal projection of Rd
onto E in a given orthonormal basis of Rd . We recall that Π is symmetric and has nonnegative
eigenvalues; in particular, it is a covariance matrix.
Proposition A.4.1 (Gaussian vectors with orthogonal projections as covariance). Let E be a linear
subspace of Rd , with dimension k ≤ d, and let Π ∈ Rd×d denote the matrix of the orthogonal
projection of Rd onto E, in a given orthonormal basis of Rd . If X ∼ Nd (0, Π), then kXk2 ∼
χ2 (k).
Proof. Let (e1 , . . . , ek ) be an orthonormal basis of E and (ek+1 , . . . , ed ) be an orthonormal basis
of E ⊥ , so that
Xd
2
kXk = |hX, ei i|2 .
i=1

Let us write ζi = hX, ei i and ζ = (ζ1 , . . . , ζd ) ∈ Rd .


140 Complements on Gaussian statistics

For any u = (u1 , . . . , ud ) ∈ Rd ,


d
* d
+
X X
hu, ζi = ui hX, ei i = X, ui ei .
i=1 i=1

By Definition A.2.1, the right-hand side is a Gaussian variable, therefore ζ is a Gaussian vector.
Since E[X] = 0, it is immediate that E[ζ] = 0. Now, by the definition of Π, for all i ∈
{1, . . . , d}, (
1 if i ≤ k,
Var(ζi ) = Var(hX, ei i) = hei , Πei i =
0 if i ≥ k + 1,
so that ζ1 , . . . , ζk are identically distributed according to N(0, 1), while ζk+1 = · · · = ζd = 0.
Therefore, to conclude, it remains to check that ζ1 , . . . , ζk are independent. Since the vector ζ
is Gaussian, it is sufficient to check that Cov(ζi , ζj ) = 0 for 1 ≤ i < j ≤ k. But by the definition
of Π again, and the fact that e1 , . . . , ek are orthogonal, we have

Cov(ζi , ζj ) = Cov(hX, ei i, hX, ej i) = hei , Πej i = hei , ej i = 0,

which completes the proof.

Theorem A.4.2 (Cochran’s Theorem). Let X ∼ Nd (0, Id ) and E1 , E2 be two orthogonal sub-
spaces of Rd , with respective dimensions d1 , d2 . The orthogonal projections X 1 and X 2 of X,
respectively on E1 and E2 , are independent Gaussian vectors, and kX 1 k2 ∼ χ2 (d1 ), kX 2 k2 ∼
χ2 (d2 ).
Proof. Let Π1 and Π2 denote the matrices of the orthogonal projection of Rd onto E1 and E2 ,
respectively. For any u ∈ Rd ,

hu, X 1 i = hu, Π1 Xi = hΠ1 u, Xi,

so that by Definition A.2.1, X 1 is a Gaussian vector. The same arguments also apply to X 2 .
To show that X 1 and X 2 are independent, we compute the characteristic function of the pair
of vectors (X 1 , X 2 ). For any (u, v) ∈ Rd × Rd ,
 
Φ(X 1 ,X 2 ) (u, v) = E exp(i(hu, X 1 i + hv, X 2 i))
= E [exp(i(hΠ1 u, Xi + hΠ2 v, Xi))]
= ΦX (Π1 u + Π2 v)
 
1
= exp − hΠ1 u + Π2 v, Π1 u + Π2 vi ,
2

where we have used Lemma A.2.2 and the fact that X ∼ Nd (0, Id ) at the last line. Since Π1 u ∈
E1 , Π2 v ∈ E2 , and E1 and E2 are orthogonal, Pythagoras’ Theorem yields

hΠ1 u + Π2 v, Π1 u + Π2 vi = kΠ1 u + Π2 vk2 = kΠ1 uk2 + kΠ2 vk2 ,

so that    
1 2 1 2
Φ(X 1 ,X 2 ) (u, v) = exp − kΠ1 uk exp − kΠ2 vk .
2 2
This shows that X 1 and X 2 are independent, with respective distributions Nd (0, Π1 ) and Nd (0, Π2 ).
By Proposition A.4.1, we deduce that kX 1 k2 ∼ χ2 (d1 ), kX 2 k2 ∼ χ2 (d2 ), which completes the
proof.
A.4 Cochran’s Theorem 141

Applying Cochran’s Theorem with E1 = Span(1), 1 = (1, . . . , 1) and E2 = E1⊥ , we obtain


the following result, which is useful in the study of the Gaussian model.

Proposition A.4.3 (Empirical mean and variance). Let X1 , . . . , Xn be independent N(0, 1) ran-
dom variables. Let us write
n n
1X 1 X
Xn = Xi , Sn2 = (Xi − X n )2 .
n n−1
i=1 i=1

The random variables X n and Sn2 are independent, and

X
(n − 1)Sn2 ∼ χ2 (n − 1), p n ∼ t(n − 1).
Sn2 /n
142 Complements on Gaussian statistics
Annexe B

Sujets de classe inversée

Chacun des six sujets proposés ci-dessous décrit une méthode statistique, permettant d’effectuer
un test d’hypothèse ou une régression dans un cadre particulier. Il doit être présenté au cours d’un
exposé d’une quinzaine de minutes, avec slides, qui comportera deux parties :

• une présentation théorique rapide de la méthode, guidée par les indications données ci-
dessous — n’hésitez pas à faire vos propres recherches en dehors du poly ;

• la démonstration de l’application de la méthode à un exemple différent de celui ou ceux


traités dans le poly, issu d’une base de données réelle ou simulée, à l’aide de R : certaines
commandes sont indiquées ci-dessous, un bon réflexe consiste à commencer par aller voir
la documentation de ces commandes.

La seconde partie ne doit pas se limiter à un simple énoncé du résultat du test (par exemple :
« Nous avons comparé les températures du mois de décembre à Oslo et à Alger, eh bien nous
trouvons qu’elles n’ont pas la même distribution. »), mais doit s’accompagner d’une discussion ;
des indications d’ouvertures possibles sont données ci-dessous pour chacun des sujets.

Voici quelques conseils pour la préparation des exposés.

• Dans la présentation théorique, donnez une vue d’ensemble de la méthode (à quoi ça sert,
dans quel cadre on s’en sert, comment on l’applique) plutôt que d’insister sur les détails
mathématiques : ces derniers prennent énormément de temps à exposer, il est préférable de
renvoyer précisément votre auditoire au poly.

• Lorsque vous présentez le résultat d’un test d’hypothèse, il est beaucoup plus parlant de
donner la p-valeur (immédiatement interprétable) plutôt que la valeur de la statistique de
test (très dépendante du test). Lorsque vous le pouvez, n’hésitez pas à discuter également de
la puissance du test que vous effectuez.

• N’hésitez pas à utiliser des données artificielles, que vous aurez vous-même simulées, pour
explorer les limites des méthodes que vous présentez. Par exemple, si vous décrivez une
méthode qui repose sur l’hypothèse que les données sont gaussiennes, il peut être intéressant
de montrer que la méthode « fonctionne » sur des données que vous avez générées selon une
gaussienne, puis de simuler des données « de moins en moins gaussiennes » et de voir si la
méthode continue de donner des résultats corrects ou non.

• Des répertoires de bases de données sont indiqués sur Teams. Selon le sujet que vous traitez,
il peut également être intéressant de rechercher des études scientifiques utilisant la méthode
144 Sujets de classe inversée

à laquelle votre exposé est consacré, et d’en discuter les résultats : la taille de l’échantillon
est-elle suffisante pour assurer que le test est assez puissant ? les données semblent-elles
vérifier les hypothèses sous lesquelles la méthode est valable ?

Nous rappelons les échéances pour la préparation des exposés.

• Pour le vendredi 9 octobre, vous devez avoir constitué un groupe et choisi un sujet, sur la
base des « Résumés rapides » donnés ci-dessous.

• Pour le vendredi 27 novembre, vous devez envoyer à votre enseignant de petite classe un
mail décrivant :

1. la base de données sur laquelle vous avez choisi d’appliquer votre méthode,
2. les hypothèses que vous allez chercher à tester, ou les relations que vous allez chercher
à mettre en évidence, sur cette base de données.

De plus, avant les trois séances dédiées aux exposés, vous devez lire les sections du poly corres-
pondant aux sujets traités en séance, et préparer une question à poser au sein de chaque groupe,
pour chaque sujet.

B.1 Test d’indépendance du χ2 (Section 6.1.1)


B.1.1 Résumé rapide
On observe des réalisations d’un couple aléatoire (X, Y ) à valeurs dans un espace fini X × Y.
Le test du χ2 d’indépendance permet de répondre à la question : « les variables aléatoires X et
Y sont-elles indépendantes ? ». Voir l’Exemple 6.1.1 sur l’indépendence du sexe des enfants dans
une fratrie.

B.1.2 Présentation théorique


Vous présenterez les hypothèses nulle et alternative de ce test et montrerez que le test d’indépen-
dance se reformule comme un test d’adéquation à une famille de lois, traité dans le cours sur le
test du χ2 . Vous en déduirez l’expression de la statistique de test et exposerez le principe du test
d’indépendance. Vous pourrez présenter la notion de table de contingence.

B.1.3 Application à un exemple


Vous appliquerez le test du χ2 d’indépendance sur un ensemble de données de votre choix, et
discuterez ses limites, par exemple concernant le nombre de classes ou le nombre d’observations.
Vous pourrez utiliser la commande chisq.test sous R.

B.2 Tests de comparaison de proportions (Section 6.2.1)


B.2.1 Résumé rapide
On observe deux échantillons X1,1 , . . . , X1,n1 et X2,1 , . . . , X2,n2 de variables de Bernoulli, de
paramètres respectifs p1 et p2 . Les tests de comparaison de proportion permettent d’estimer si
p1 = p2 . Voir l’Exemple 6.2.1 sur le nombre d’accidents du capitaine Haddock liés à l’alcool
avant et après sa rencontre avec Tintin.
B.3 Tests de Student et de Fisher (Section 6.2.2) 145

B.2.2 Présentation théorique


Vous présenterez le cadre général des tests de comparaison de proportions. Vous décrirez rapi-
dement les approximations gaussiennes menant à la construction d’un Z-test asymptotique. Sans
entrer dans les détails mathématiques, vous décrirez ensuite la mise en œuvre du test de Fisher
exact.

B.2.3 Application à un exemple


Vous appliquerez les deux tests (asymptotique et de Fisher exact) sur un ensemble de données
de votre choix. Il sera alors très pertinent de comparer le niveau et la puissance de ces tests en
fonction de la taille de vos échantillons. Vous pourrez utiliser la commande fisher.test sous R.

B.3 Tests de Student et de Fisher (Section 6.2.2)


B.3.1 Résumé rapide
On observe deux échantillons X1,1 , . . . , X1,n1 et X2,1 , . . . , X2,n2 de variables aléatoires gaus-
siennes, de paramètres respectifs (µ1 , σ12 ) et (µ2 , σ22 ). Le test de Student permet, sous l’hypothèse
d’homoscédasticité σ12 = σ22 , de tester si µ1 = µ2 (c’est-à-dire que les échantillons ont même loi).
L’hypothèse σ12 = σ22 est elle-même l’objet du test de Fisher. Voir l’Exemple 6.2.9 sur les notes
des élèves IMI et SEGF à l’examen du cours en 2018.

B.3.2 Présentation théorique


Vous présenterez les hypothèses sous-jacentes aux tests de Fisher et de Student. Vous donnerez
l’expression des statistiques de test associées à ces deux tests, et décrirez leur comportement sous
H0 . Vous pourrez également évoquer leur comportement sous H1 , notamment lorsque la taille des
deux échantillons tend vers +∞.

B.3.3 Application à un exemple


Vous appliquerez les deux tests sur un ensemble de données de votre choix. Vous discuterez éga-
lement du bien-fondé de l’hypothèse gaussienne sur votre ensemble de données. Vous pourrez
utiliser les commandes t.test (Student) et var.test (Fisher) sous R.

B.4 Tests d’homogénéité non-paramétriques (Section 6.2.3)


B.4.1 Résumé rapide
On observe deux échantillons X1,1 , . . . , X1,n1 et X2,1 , . . . , X2,n2 de variables aléatoires réelles,
mais sans hypothèse particulière sur la forme de leur loi. Les tests d’homogénéité non-paramétriques
permettent de tester si ces deux échantillons sont distribués sous la même loi. Voir l’Exemple 6.2.13
sur l’efficacité d’un vaccin.

B.4.2 Présentation théorique


Vous présenterez le principe et l’interprétation du Q-Q plot. Vous décrirez ensuite la statistique de
Kolmogorov–Smirnov, ainsi que son comportement sous l’hypothèse nulle, et ferez le lien avec le
test de Kolmogorov vu dans le cours sur les tests non-paramétriques.
146 Sujets de classe inversée

B.4.3 Application à un exemple


Vous étudierez le Q-Q plot et appliquerez le test de Kolmogorov–Smirnov sur un ensemble de
données de votre choix. Vous pourrez également comparer vos résultats à ceux d’une approche
paramétrique, si la forme de vos données le permet. Vous pourrez utiliser la commande ks.test
sous R.

B.5 Analyse de variance dans le modèle gaussien (Section 6.3.2)


B.5.1 Résumé rapide
On observe k ≥ 2 échantillons indépendants (X1,i1 )1≤i1 ≤n1 , . . . , Xk,ik )1≤ik ≤nk de variables
gaussiennes, de paramètres respectifs (µ1 , σ 2 ), . . . , (µk , σ 2 ). L’analyse de variance permet de dé-
terminer si ces échantillons sont distribués de manière homogène, c’est-à-dire si µ1 = · · · = µk .
Voir l’Exemple 6.3.1 sur l’influence de l’environnement sur la taille des habitants.

B.5.2 Présentation théorique


Vous présenterez les statistiques SSM, SST et SSE et expliquerez la construction de la statistique
de test à partir de ces quantités. Vous décrirez leur interprétation géométrique et pourrez faire le
lien avec la régression linéaire. Vous présenterez enfin le test de Fisher pour l’analyse de variance.

B.5.3 Application à un exemple


Vous appliquerez l’analyse de variance à un un ensemble de données de votre choix. Vous dis-
cuterez également du bien-fondé de l’hypothèse gaussienne sur votre ensemble de données. Vous
pourrez utiliser les commandes aov sous R.

B.6 Régression logistique (Section 5.2)


B.6.1 Résumé rapide
On observe une variable aléatoire binaire Y , dont la loi dépend d’un vecteur de paramètres nu-
mériques x = (x1 , . . . , xp ). La régression logistique permet de quantifier cette dépendance en
estimant la probabilité que Y = 1 en fonction de la valeur de x. Voir l’Exemple 5.2.1 sur la
détection de spams.

B.6.2 Présentation théorique


Vous présenterez le modèle logistique pour p(1|x) et décrirez la procédure d’estimation du para-
mètre β. Vous évoquerez le test du rapport de vraisemblance pour l’utilité des régresseurs.

B.6.3 Application à un exemple


Vous appliquerez la régression logistique sur un ensemble de données de votre choix, et vérifierez
l’utilité des régresseurs en appliquant le test du ratio de vraisemblance. Vous pourrez utiliser la
commande glm, avec le paramètre family=binomial(logit), sous R.
Appendix C

Correction of the exercises

C.1 Correction of the exercises of Chapter 1


Correction of Exercise 1.A.1 For the sake of convenience, we write yi = xi − xn and let
I = {uKn : u ∈ Rp }, J = Span(y1 , . . . , yn ).
For all u ∈ Rp , we have
n n  
1X ⊤ X 1
uKn = uyi yi = hu, yi i yi ∈ J,
n n
i=1 i=1

so that I ⊂ J.
To prove the converse inclusion, we show that I ⊥ ⊂ J ⊥ . Let v ⊂ I ⊥ : by definition, for all
u ∈ Rp ,
0 = huKn , vi = hu, vKn⊤ i = hu, vKn i,

so that vKn = 0. As a consequence,


n n
1X ⊤ 1X
0 = hvKn , vi = hvyi yi , vi = hv, yi i2 ,
n n
i=1 i=1

therefore hv, yi i = 0 for all i ∈ {1, . . . , n} and v ∈ J ⊥ .

Correction of Exercise 1.A.2 We have


n n
1X 1 1X 1
(xi − x1n )(x2i − x2n ) = α (xi − x1n )2 ,
n n
i=1 i=1

and
n n
1X 2 1X 1
(xi − x2n )2 = α2 (xi − x1n )2 .
n n
i=1 i=1

As a consequence, the empirical correlation is equal to 1 if α > 0, to 0 if α = 0 and to −1 if


α < 0.

Correction of Exercise 1.A.3 We employ the R script below.


148 Correction of the exercises

n = 100

# Generation of the sample x1


x1 = runif ( n )

# Plot of one realisation with k =1


x2a = sin ( pi * x1 /2)
plot ( x1 , x2a )

# Plot of one realisation with k =50


x2b = sin (50 * pi * x1 /2)
plot ( x1 , x2b )

# Evolution of the empirical correlation with k


correl = c ()
for ( k in 1:50) {
correl [ k ] = cor ( x1 , sin ( k * pi * x1 /2) )
}
plot (1:50 , correl , type = " l " )

We obtain the following figures.

1.0
1.0

1.0
0.8

0.5

0.5
0.6

correl
x2a

x2b

0.0

0.0
0.4

−0.5
0.2

−0.5
−1.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50

x1 x1 1:50

The first and second figure show two set of points (x1i , x2i ), i ∈ {1, . . . , 100}, with respective
parameter k = 1 and k = 100. The third picture shows the evolution of the empirical correlation
between x1 and x2 for k ranging from 1 to 100. In the present example, x1 and x2 are related by a
deterministic function, but when k is large, the nonlinearity of this function makes the correlation
between x1 and x2 close to 0, as if these variables were statistically independent.

Correction of Exercise 1.A.4

1. We have x11 = 15, x12 = 12 and x13 = 18, so that x12 < x11 < x13 . As a consequence, r11 = 2
(because x11 is ranked second), r21 = 1 and r31 = 3. Likewise, x21 = 9, x22 = 7 and x23 = 8,
so that x22 < x23 < x21 and r12 = 3, r22 = 1 and r32 = 2. In the definition

1 P3 1
3 − r 1 )(ri2 − r2 )
i=1 (ri
rs = q P q P ,
1 3 1 1 2 1 3 2 2 2
3 i=1 (ri − r ) 3 i=1 (ri − r )

we immediately have that


1+2+3
r1 = r2 = = 2,
3
C.1 Correction of the exercises of Chapter 1 149

and
3 3
1X 1 1X 2 12 + 02 + 12 2
(ri − r 1 )2 = (ri − r2 )2 = = ,
3 3 3 3
i=1 i=1
while the numerator writes
3
1X 1 1 1
(ri − r1 )(ri2 − r2 ) = [(2 − 2)(3 − 2) + (1 − 2)(1 − 2) + (3 − 2)(2 − 2)] = .
3 3 3
i=1

Therefore rs = 1/2.
We may already notice for further purpose that since, in the general case, r 1 and r 2 are
permutations of {1, . . . , n}, it always holds
1 + ··· + n n+1
r1 = r2 = = ,
n 2
and
n n n  
1X 1 1 2 1X 2 2 2 1X n + 1 2 n2 − 1
(ri − r ) = (ri − r ) = k− = .
n n n 2 12
i=1 i=1 k=1

2. Since f is increasing, the series x11 , . . . , x1n and x21 , . . . , x2n are ranked in the same order,
therefore r 1 = r 2 and it is immediate that rs = 1. On the contrary, if f is decreasing, the
series x11 , . . . , x1n and x21 , . . . , x2n are ranked in the opposite order, therefore ri1 = n + 1 − ri2
for any i ∈ {1, . . . , n}. As a consequence, the numerator in the definition of Spearman’s
coefficient writes
n n   
1X 1 1 2 2 1 X 1 n+1 1 n+1
(ri − r )(ri − r ) = ri − n + 1 − ri −
n n 2 2
i=1 i=1
n    
1X n+1 n+1
= k− n+1−k−
n 2 2
k=1
n  
1X n+1 2
=− k−
n 2
k=1
n2 − 1
=− ,
12
where we have used the computation made at the end of the previous question. We conclude
that rs = −1.
3. In general, the numerator in the definition of rs writes
n n   
1X 1 1 X 1 n+1 n+1
(ri − r 1 )(ri2 − r 2 ) = ri − ri2 −
n n 2 2
i=1 i=1
n  
1X 1 2 1 2 n+1 (n + 1)2
= ri ri − (ri + ri ) + .
n 2 4
i=1

Let us compute separately the sums of the three terms appearing in the right-hand side,
starting from the last two. We get
n
1 X (n + 1)2 (n + 1)2
= ,
n 4 4
i=1
150 Correction of the exercises

and
n n
1X 1 n+1 n+12 X (n + 1)2
(ri + ri2 ) = k= .
n 2 2 n 2
i=1 k=1

Last,
n n
1X 1 2 1 X 
ri ri = (ri1 )2 + (ri2 )2 − (ri1 − ri2 )2
n 2n
i=1 i=1
n n
!
1 X X
2 1 2 2
= 2 k − (ri − ri )
2n
k=1 i=1
n
(n + 1)(2n + 1) 1 X 1
= − (ri − ri2 )2 .
6 2n
i=1

Putting these results together, we get

(n+1)(2n+1) 1 Pn 1 2 2 (n+1)2 (n+1)2


6 − 2n i=1 (ri − ri ) − 2 + 4
rs = 2
n −1
,
12

which simplifies into


X n
6
rs = 1 − (ri1 − ri2 )2 .
n(n2 − 1)
i=1

Correction of Exercise 1.A.5

1. For all n ≥ 0,

+∞
n+1 1 X n+1 1
bn+1 = E[X ]= k
e k!
k=0
+∞
X
1 1
= kn
e (k − 1)!
k=1
+∞
1X 1
= (l + 1)n
e l!
l=0
+∞
" n   #
1 X X n m 1
= l
e m l!
l=0 m=0
X  X
n
n 1
+∞
1
= lm
m=0
m e l!
l=0
n
X n  
= bm .
m=0
m

2. By Exercise 1.3.1, the sequences (bn )n≥0 and (Bn )n≥0 satisfy the same recursive relation,
and it is immediate that b0 = B0 = 1. By induction, we deduce that bn = Bn for all n ≥ 0.
C.1 Correction of the exercises of Chapter 1 151

Correction of Exercise 1.A.6


1. The square and the associated partitions are depicted below.
(0)
x1 x S
• •2 2
+
x•
4
•x
3 (0)
S1
(0) (0)
Both means x1 and x2 lie at the center of the square, and are represented
√ by the + sign.
If the square is chosen to be of unit side, then WCSS(0) = 4 × ( 2/2)2 = 2. In the first
(1) (1)
step of the algorithm, J(i) = 1 for all i, so that S1 = {1, 2, 3, 4}, S2 = ∅, and the mean
(1)
x1 remains at the center of the square. As a consequence, WCSS(1) = 2 = WCSS(0) .

2. To show that WCSS(t+1) ≤ WCSS(t) in the proof of Lemma 1.3.3, we have used two
(t) (t) (t)
inequalities: first, that for all j ∈ {1, . . . , k}, for all i ∈ Sj , kxi − xj k ≥ kxi − xJ(i) k;
(t+1) (t)
second, that for all j ∈ {1, . . . , k}, kxj − xj k ≥ 0. If WCSS(t+1) = WCSS(t) , then
it is necessary that both inequalities actually be equalities. In particular, the second one
(t) (t+1)
implies that for all j ∈ {1, . . . , k}, xj = xj .
In the t + 1-th iteration, the assignment step computes, for each i ∈ {1, . . . , n}, the index
(t+1) (t+1)
J ′ (i) which is the index j of the point xj which is the closest to xi . But since xj =
(t)
xj , then J ′ (i) = J(i). Therefore, it is now clear that the update step of the t+1-th iteration
does not modify the partition.

3. Lemma 1.3.3 shows that the sequence WCSS(t) is nonincreasing. Since there is only a
finite number of partitions, this sequence is necessarily stationary: there exists t0 ≥ 0 such
that WCSS(t0 ) = WCSS(t0 +1) = · · · . The counterexample of the first question of this
exercise shows that it is not necessary that the partitions obtained at the t0 -th and t0 + 1-th
iterations of the algorithm coincide, however the second question ensures that it is the case
for the partitions obtained at the t0 + 1-th and t0 + 2-th iterations. Therefore the algorithm
is stopped after the t0 + 2-th iteration.

Correction of Exercise 1.A.7


1. The assumption on ρ ensures that for all k ∈ {1, . . . , K}, the points {xi , i ∈ Ik } are close
to µk . The set of points is therefore divided into clusters grouped in the neighbourhood of
each of the points µ1 , . . . , µK .

2. The WCSS of the partition {I1 , . . . , IK } writes


K X
X K X
X
kxi − xk k2 = kǫi − ǫk k2 ,
k=1 i∈Ik k=1 i∈Ik
P P
with the notations xk = n1k i∈Ik xi , ǫk = n1k i∈Ik ǫi . Since the points ǫi , i ∈ Ik , are
located in the convex ball of radius ρ, so is their centre of mass ǫk , therefore kǫi − ǫk k ≤ 2ρ
and finally
XK X
kxi − xk k2 ≤ 4nρ2 .
k=1 i∈Ik
152 Correction of the exercises

Take a different partition {J1 , . . . , JK }: then there exists l0 ∈ {1, . . . , K} such that Jl0
contains two indices i ∈ Ik and i′ ∈ Ik′ with k 6= k′ . The WCSS of this partition satisfies
K X
X
kxi − xl k2 ≥ kxi − xl0 k2 + kxi′ − xl0 k2 ,
l=1 i∈Jl

with xl now defined with respect to the class Jl .


By the triangle inequality,

kxi − xi′ k ≤ kxi − xl0 k + kxi′ − xl0 k,

but on the other hand kxi − µk k ≤ ρ and kxi′ − µk′ k ≤ ρ, so that the triangle inequality
again yields
kµk − µk′ k − 2ρ ≤ kxi − xi′ k.
The classical inequality 12 (a + b)2 ≤ a2 + b2 finally yields
1
kxi − xi′ k2 ≤ kxi − xl0 k2 + kxi′ − xl0 k2 ,
2
so that the WCSS of the partition {J1 , . . . , JK } is bounded from below by
1
(kµk − µk′ k − 2ρ)2 .
2
When ρ → 0, this bound becomes larger than 4nρ2 , which proves the optimality of the
partition {I1 , . . . , IK }.

3. The vector xn writes


K K
1 XX X nk
xn = (µk + ǫi ) = µ k + ǫn .
n n
k=1 i∈Ik k=1

4. The empirical covariance matrix writes


n
1X
Kn = (xi − xn )⊤ (xi − xn ),
n
i=1

where we recall that vectors of Rp are considered as row vectors. For all k ∈ {1, . . . , K},
for all i ∈ Ik ,
K
X nk ′
xi − xn = µ k + ǫ i − µ k ′ − ǫn = νk + ǫi − ǫn ,

n
k =1

where
K
X nk ′
νk = µ k − µk ′ .

n
k =1

Since kǫi − ǫn k ≤ ρ, we deduce that

1 XX ⊤ 
K
Kn = νk νk + O(ρ) = Kn0 + O(ρ),
n
k=1 i∈Ik
C.2 Correction of the exercises of Chapter 2 153

with
K
X nk
Kn0 = νk⊤ νk .
n
k=1

Notice that Kn0is the covariance matrix of a random vector X ∈ Rp taking the value µk
with probability nk /n.

5. When ρ = 0, we have Kn = Kn0 . Since the latter matrix is a covariance matrix, it is


symmetric and nonnegative, so that it has eigenvalues λ01 ≥ · · · ≥ λ0p ≥ 0 and associated
left eigenvectors e01 , . . . , e0p . For each j ∈ {1, . . . , K}, the definition of the pair (λ0j , e0j )
writes
K
X nk ⊤
e0j ν νk = λ0j e0j ,
n k
k=1

and multiplying by (e0j )⊤ on the right yields

K
X nk
he0j , νk i2 = λ0j ke0j k2 .
n
k=1

As a consequence, λ0j = 0 if and only if e0j is orthogonal to the subspace of Rp spanned by


ν1 , . . . , νK . Denoting by K ′ the dimension of this subspace, we conclude that λ0j > 0 for
j ≤ K ′ , and λ0j = 0 for j ≥ K ′ + 1. Notice that the colinearity relation

K
X nk
νk = 0
n
k=1

implies that K ′ ≤ K − 1.
Since the eigenvalues of symmetric matrices are Lipschitz continuous, the spectrum of Kn
is a perturbation of order ρ of the spectrum of Kn0 . Therefore, when ρ → 0, the scree plot
of the Principal Component Analysis shall exhibit:

• K ′ eigenvalues λ1 ≥ . . . ≥ λK ′ with an order of magnitude which does not depend


on 0;
• the remaining eigenvalues λK ′ +1 , . . . , λp which are of order of magnitude ρ.

Thus, an inflection point shall occur between the K ′ -th and (K ′ + 1)-th eigenvalues.

6. The ‘actual information’ contained in the data is encoded in a K ′ -dimensional vector, which
is exactly recovered in the ρ → 0 limit by keeping only the K ′ first principal components.

C.2 Correction of the exercises of Chapter 2


Correction of Exercise 2.A.1

1. The covariance matrix of Y1 writes


 
Var(X1 ) Cov(X1 , X12 )
K= ,
Cov(X1 , X12 ) Var(X12 )
154 Correction of the exercises

and we have
Var(X1 ) = E[X12 ] − E[X1 ]2 = ρ2 ,
Cov(X1 , X12 ) = E[X13 ] − E[X1 ]E[X12 ] = ρ3 ,
Var(X12 ) = E[X14 ] − E[X12 ]2 = ρ4 − ρ22 .

2. For all x1 , x2 ∈ R, we have


!  
∂ϕ
∇ϕ(x1 , x2 ) = ∂x1 (x1 , x2 ) =
−2x1
.
∂ϕ 1
∂x2 (x1 , x2 )

As a consequence,  
0
∇ϕ(y) = .
1

3. We have Vn = ϕ(Y n ) and ρ2 = ϕ(y), so that by the Delta Method, n(Vn − ρ2 ) converges
in distribution to N(0, v) with v = ∇ϕ(y)⊤ K∇ϕ(y) = ρ4 − ρ22 .

Correction of Exercise 2.A.4


1. We have Eθ [X1 ] = θ + 1/2, so that θen = X n − 1/2 is a moment estimator. It is unbiased,
strongly consistent, asymptotically normal with asymptotic variance 1/12.
2. The likelihood of a realisation xn ∈ Rn writes
n
Y
Ln (xn ; θ) = 1{θ≤xi ≤θ+1} = 1{max1≤i≤n xi −1≤θ≤min1≤i≤n xi } ,
i=1

which is plotted on Figure C.1. We observe that any choice of θ between max1≤i≤n xi − 1
and min1≤i≤n xi maximises this likelihood: the MLE is not unique.

max xi − 1 min xi θ
1≤i≤n 1≤i≤n

Figure C.1: The likelihood in Exercise 2.A.4.

P
3. For all ǫ > 0, the strong Law of Large Numbers implies that n1 ni=1 1{Xi ≤θ+ǫ} converges
to ǫ, Pθ -almost surely. As a consequence, 1{Xi ≤θ+ǫ} = 1 for infinitely many values of i,
which implies that min1≤i≤n Xi converges to θ, Pθ -almost surely. By the same arguments,
max1≤i≤n Xi converges to θ + 1, Pθ -almost surely. As a consequence,

lim θbt = (1 − t)(θ + 1 − 1) + tθ, Pθ -almost surely,


n→+∞ n

so that θbnt is strongly consistent.


C.2 Correction of the exercises of Chapter 2 155

4. Let x ∈ [0, 1]. For all n ≥ 1,


   
Pθ min Xi − θ ≤ x = 1 − Pθ min Xi − θ > x
1≤i≤n 1≤i≤n

= 1 − Pθ (X1 − θ > x)n


= 1 − (1 − x)n ,
so that under Pθ , U = min1≤i≤n Xi −θ ∼ β(1, n). By symmetry, V = max1≤i≤n Xi −θ ∼
β(n, 1). Recalling that the expectation of a β(a, b) random variable is a/(a + b), we deduce
that    
bt n 1 2t − 1
Eθ [θn ] = (1 − t) θ + −1 +t θ+ =θ+ .
n+1 n+1 n+1
As a consequence, the estimator θbt is unbiased for t = 1/2.
n

5. The MSE writes


R(θbnt ; θ) = Eθ [(θbnt − θ)2 ]
= E[((1 − t)(V − 1) + tU )2 ]
= (1 − t)2 E[(V − 1)2 ] + 2t(1 − t)E[(V − 1)U ] + t2 E[U 2 ].
Since V ∼ β(n, 1), we have 1 − V ∼ β(1, n) and therefore E[(1 − V )2 ] = E[U 2 ] = α, so
that
R(θbnt ; θ) = α((1 − t)2 + t2 ) − γ2t(1 − t) =,
which is easily seen to reach its minimum over t for t = 1/2.
6. For t = 1/2, the value of the MSE is
α−γ
R(θbn1/2 ; θ) =
.
2
On the one hand, a straightforward computation for Beta distributions shows that
Z 1
2 2
α = E[U ] = u2 n(1 − u)n−1 du = .
u=0 (n + 1)(n + 2)
On the other hand, to compute β we need to compute the joint distribution of (U, V ). To
this aim, we write, for any bounded function f : [0, 1]2 → R,
Z 1  
E[f (U, V )] = f min ui , max ui du1 · · · dun
u1 ,...,un =0 1≤i≤n 1≤i≤n
Z
= n! f (u1 , un )du1 · · · dun
0≤u1 ≤···≤un ≤1
Z 1 Z 1
= n! f (u1 , un )In (u1 , un )du1 dun ,
u1 =0 un =0

where
Z 1
In (u1 , un ) = 1{u1 ≤u2 ≤···≤un −1≤un } du2 · · · dun−1
u2 ,...,un−1 =0
Z un Z un Z un
= 1{u1 ≤un } ··· dun−1 · · · du2
u2 =u1 u3 =u2 un−1 =un−2
(un − u1 )n−2
= 1{u1 ≤un } .
(n − 2)!
156 Correction of the exercises

We deduce that the density of the pair (U, V ) writes

qn (u, v) = 1{u≤v} n(n − 1)(v − u)n−2

on [0, 1]2 . We may now compute


Z 1
1
γ = E[(1 − V )U ] = (1 − v)uqn (u, v)dudv = ,
u,v=0 (n + 1)(n + 2)

from which we finally deduce that


1
R(θbn1/2 ; θ) = .
2(n + 1)(n + 2)

We notice that the MSE is of order 1/n2 , while for regular model, it is generally expected
to be of order 1/n.

Correction of Exercise 2.A.7 It is clear that ηk,n is a statistic. Using properties of the exponen-
tial distribution, we rewrite
Z1
ηk,n = ,
Z1 + Z2
where Z1 = X1 + · · · + Xk ∼ Γ(k, λ) and Z2 = Xk+1 + · · · + Xn ∼ Γ(n − k, λ) are independent.
Following [2, Proposition 3.4.4, p. 52], we deduce that ηk,n ∼ β(k, n − k), which does not depend
on the value of λ.

Correction of Exercise 2.A.8

1. P0 is the Cauchy distribution with parameter 1.

2. By definition,

Eθ [U1 ] = Pθ (X1 ≤ 0)
Z 0
dx
= 2
x=−∞ π((x − θ) + 1)
Z −θ
dy
= 2
letting y = x − θ
y=−∞ π(y + 1)
1 π 1 π
= arctan(−θ) + = − arctan(θ) + .
π 2 π 2

3. The identity obtained in Question 2 rewrites


 
1
arctan(θ) = π − Eθ [U1 ] .
2

Since Eθ [U1 ] ∈ (0, 1), the right-hand side belongs to the interval (− π2 , π2 ), therefore

θ = g(Eθ [U1 ]),

where g is defined on (0, 1) by


  
1
g(u) = tan π −u .
2
C.2 Correction of the exercises of Chapter 2 157

1 Pn
Approximating Eθ [U1 ] by n i=1 Ui ,we get the estimator
n
!
1 X
θen = g Ui .
n
i=1

Since the function


P g is continuous on (0, 1) on the one hand, and by the strong Law of Large
Numbers, n1 ni=1 Ui converges almost surely to Eθ [U1 ] on the other hand, we deduce that
this estimator is strongly consistent.
4. By the Central Limit Theorem,
n
!
√ 1X
n Ui − Eθ [U1 ] → N(0, Var θ [U1 ]), in distribution,
n
i=1

and since U1 is a Bernoulli random variable,


Varθ [U1 ] = Eθ [U1 ](1 − Eθ [U1 ])
1  π π
= 2 − arctan(θ) + arctan(θ) +
π  2 2
1 π2
= 2 − arctan(θ)2 .
π 4
Besides, g is C 1 on (0, 1), and for all u ∈ (0, 1),
   "   2 #
1 1
g′ (u) = −π tan′ π −u = −π 1 + tan π −u ,
2 2

so that
"    #
′ 1 1 π 2
g (Eθ [U1 ]) = −π 1 + tan π − − arctan(θ) +
2 π 2
h i
= −π 1 + tan (arctan(θ))2
= −π(1 + θ 2 ).
Applying the Delta Method, we get
! !
√   √ 1X
n
n θen − θ = n g Ui − g (Eθ [U1 ])
n
i=1

converges in distribution to a centered Gaussian variable, with variance


 2 
′ 2 2 2 π 2
v(θ) = g (Eθ [U1 ]) Varθ [U1 ] = (1 + θ ) − arctan(θ) .
4

5. The function v introduced in Question 4 is continuous, therefore


q v(θen ) converges to v(θ),
Pθ -almost surely. By Slutsky’s Theorem, we deduce that n/v(θen )(θen − θ) converges in
distribution to N(0, 1). In particular, if φ1−α/2 denotes the quantile of order 1 − α/2 of the
standard Gaussian distribution, we have for all θ ∈ R,
s 
n e
lim Pθ  |θn − θ| ≥ φ1−α/2  = α,
n→+∞ v(θen )
158 Correction of the exercises

whence  s s 
θen − φ1−α/2 v(θen ) e e
v(θn ) 
, θn + φ1−α/2
n n

is an asymptotic confidence interval with level 1 − α for θ.

Correction of Exercise 2.A.9


q
1. The confidence interval In provided by Proposition 2.4.18 has width 2φ1−α/2 Vbn /n.

2. By the Delta Method, Φ(Zn ) is a consistent and asymptotically normal estimator of Φ(g(θ)),
with asymptotic variance Φ′ (g(θ))2 V (θ).

3. Since Φ is C 1 , Φ′ (Zn ) converges to Φ′ (g(θ)) in probability, therefore Φ′ (Zn )2 Vbn is a con-


sistent estimator of Φ′ (g(θ))2 V (θ). As a consequence, by Proposition 2.4.18 an asymptotic
confidence interval for Φ(g(θ)) is
 s s 
′ 2 b
Φ (Zn ) Vn ′ 2 b
Φ (Zn ) Vn 
Jn = Φ(Zn ) − φ1−α/2 , Φ(Zn ) + φ1−α/2 .
n n

Using the monotonicity of Φ, we have


s s
′ 2 b
Φ (Zn ) Vn Φ′ (Zn )2 Vbn
Φ(Zn ) − φ1−α/2 ≤ Φ(g(θ)) ≤ Φ(Zn ) + φ1−α/2
n n
if and only if
 s   s 
Vbn  Vbn 
Φ−1 Φ(Zn ) − Φ′ (Zn )φ1−α/2 ≤ g(θ) ≤ Φ−1 Φ(Zn ) + Φ′ (Zn )φ1−α/2 ,
n n

so that
  s   s 
b
Vn  −1  Vbn 
InΦ = Φ−1 Φ(Zn ) − Φ′ (Zn )φ1−α/2 ,Φ Φ(Zn ) + Φ′ (Zn )φ1−α/2
n n

is a second approximate confidence interval for g(θ).

4. Let us define s
Vbn
sn = φ1−α/2 ,
n
so that the width of the interval InΦ writes ϕn (sn ), while the width of In is 2sn .
It is obvious that ϕn (0) = 0, and
 
ϕ′n (s) = Φ′ (Zn )(Φ−1 )′ Φ(Zn ) + Φ′ (Zn )s + Φ′ (Zn )(Φ−1 )′ Φ(Zn ) − Φ′ (Zn )s .

As a consequence,
ϕ′n (0) = 2Φ′ (Zn )(Φ−1 )′ (Φ(Zn )) = 2.
We deduce that the width of In rewrites ϕn (0) + ϕ′n (0)s, from which we conclude that:
C.3 Correction of the exercises of Chapter 3 159

– if ϕn is convex, then In is smaller than InΦ ;


– if ϕn is concave, then InΦ is smaller than In .
p
5. If Φ′ (g(θ)) = 1/ V (θ), then the asymptotic variance of Φ(Zn ) is 1 and no longer depends
on the parameter θ. As a consequence, there is no need to estimate the variance and we get
the asymptotic confidence interval
 
φ1−α/2 φ1−α/2
Jn = Φ(Zn ) − √ , Φ(Zn ) + √
n n
for Φ(g(θ)), from which we deduce the asymptotic confidence interval
    
Φ −1 φ1−α/2 −1 φ1−α/2
In = Φ Φ(Zn ) − √ ,Φ Φ(Zn ) + √
n n
for g(θ).

6. As an example of a function Φ such that


1 1
Φ′ (λ) = p = ,
V (λ) λ

one can take Φ(λ) = log λ. With this choice, the function ϕn introduced in Question 4
writes
     
bn + s − exp log λ
ϕn (s) = exp log λ bn − s = 2λ bn sinh s ,
bn
λ bn
λ bn
λ
which is convex on [0, +∞). As a consequence, the confidence interval obtained with
variance stabilisation is larger than the original one.

C.3 Correction of the exercises of Chapter 3


Correction of Exercise 3.A.2
1. By Corollary 2.4.17, for any p ∈ [0, 1] and r > 0,
√ 
Pp X n − p ≥ r/ n ≤ 2 exp(−2r 2 ).
p
The right-hand side of this inequality is equal to α if one takes r = −(log(α/2))/2,
therefore " r r #
log(α/2) log(α/2)
In = X n − − , Xn + −
2n 2n
is an approximate confidence interval for p.

2. Following the approach of Section 3.2.2, we consider the test with rejection region
( r )
log(α/2)
Wn = {p0 6∈ In } = X n − p0 > − .
2n

Then by Question 1,
Pp0 (Wn ) = Pp0 (p0 6∈ In ) ≤ α,
so that the level of this test is lower than α.
160 Correction of the exercises

Correction of Exercise 3.A.3 You should have figured out that none of the proposed statements
are correct!

Correction of Exercise 3.A.4 By Slutsky’s Theorem,


r
n
(Zn − g0 ) → N(0, 1), in distribution,
Vbn
under H0 . As a consequence, the test which rejects H0 as soon as
r
n
|Zn − g0 | ≥ φ1−α/2
Vbn
is of asymptotic level α. Let us show that this test is consistent: if g(θ) 6= g0 , then
r r
n n
|Zn − g0 | = |(Zn − g(θ)) + (g(θ) − g0 )|
b
Vn b
Vn
diverges to +∞ when n → +∞, since Zn − g(θ) → 0 and g(θ) − g0 6= 0. As a consequence,
r 
n
lim Pθ |Zn − g0 | ≥ φ1−α/2 = 1.
n→+∞ Vbn
This test, called the Student–Wald test, is an asymptotic Z-test.

Correction of Exercise 3.A.5


1. Since fθ,α is a probability density, we have
Z +∞
dx
C(θ, α) =1
x=θ xα+3
so that
C(θ, α) = (α + 2)θ α+2 .
We then compute
α+2 α+2 2 α+2
E[X] = θ, E[X 2 ] = θ , Var(X) = θ2.
α+1 α α(α + 1)2

2.a The likelihood of a realisation xn = (x1 , . . . , xn ) writes


n
!−(α+3)
Y
Ln (xn ; θ, α) = (α + 2)n θ n(α+2) xi 1{min1≤i≤n xi ≥θ} .
i=1

Fix θ > 0 such that min1≤i≤n xi ≥ θ. Then the log-likelihood writes


n
X
ℓn (xn ; θ, α) = n log(α + 2) + n(α + 2) log θ − (α + 3) log xi ,
i=1

and the function α 7→ ℓn (xn ; θ, α) reaches its maximum at the point


1
αn (xn ) = −2 + n x .
1 X i
log
n θ
i=1
C.3 Correction of the exercises of Chapter 3 161

As a consequence, the MLE of α is


1
α
bn = αn (Xn ) = −2 + n  .
1 X Xi
log
n θ
i=1

By the strong Law of Large Numbers,


n  
1X Xi 1
lim log = , almost surely,
n→+∞ n θ α+2
i=1

which shows that α


bn is strongly consistent.

2.b By the strong Law of Large Numbers and the result of Question 1,
α+2
lim X n = E[X] = θ, almost surely.
n→+∞ α+1
Since
α+2 2 − x/θ
θ=x if and only if α = ,
α+1 x/θ − 1
we deduce that
2 − X n /θ
α
en =
X n /θ − 1
is a strongly consistent estimator of α.

bn and α
2.c To prove the asymptotic normality of α en , we use the Delta method.
bn . We write
Asymptotic normality of α
n  !  !
√ √ 1X Xi 1 1
n (b
αn − α) = n g log −g , g(y) = −2 + .
n θ α+2 y
i=1

On the one hand, the function g is differentiable in y = 1/(α + 2) and


 
′ 1
g = −(α + 2)2 .
α+2

On the other hand, the Central Limit Theorem asserts that


n   !  
√ 1X Xi 1 1
lim n log − = N 0, , in distribution.
n→+∞ n θ α+2 (α + 2)2
i=1

As a consequence, α
bn is asymptotically normal, with asymptotic variance
1
[−(α + 2)2 ]2 = (α + 2)2 .
(α + 2)2

en . We write
Asymptotic normality of α
  
√ √  α+2 2 − x/θ
n (e
αn − α) = n h X n − h θ , h(x) = .
α+1 x/θ − 1
162 Correction of the exercises

On the one hand, the function h is differentiable in x = θ(α + 2)/(α + 1) and


 
′ α+2 (α + 1)2
h θ =− .
α+1 θ

On the other hand, the Central Limit Theorem asserts that


   
√ α+2 α+2 2
lim n Xn − θ = N 0, θ , in distribution.
n→+∞ α+1 α(α + 1)2

As a consequence, α
en is asymptotically normal, with asymptotic variance
 2
(α + 1)2 α+2 2 (α + 1)2 (α + 2)
− θ = .
θ α(α + 1)2 α

Comparison. The difference between the asymptotic variances writes

(α + 1)2 (α + 2) α+2
(α + 2)2 − =− < 0,
α α
so that α
bn has a smaller asymptotic variance than α
en .

2.d Since αbn is a consistent estimator of α, it takes larger values under H1 than under H0 .
Therefore we shall look for a rejection region of the form Wn = {b αn ≥ a}. To compute the
value of a, we write the type I error
√ √   √ 
n(bαn − α0 ) n(a − α0 ) n(a − α0 )
Pα0 (bαn ≥ a) = Pα0 ≥ ≃P Z≥ ,
α0 + 2 α0 + 2 α0 + 2

with Z ∼ N(0, 1). For the right-hand side to be equal to 5%, n(a − α0 )/(α0 + 2) must
be equal to the quantile φ0.95 ≃ 1.65 of order 95% of the N(0, 1) distribution, so that one
has to take  
α0 + 2
Wn = α bn ≥ α0 + 1.65 √ .
n
This is an example of the Student–Wald asymptotic Z-test studied in Exercise 3.A.4, from
which we immediately deduce that the test is consistent.

3.a We recall that the likelihood writes


n
!−(α+3)
Y
n n(α+2)
Ln (xn ; θ, α) = (α + 2) θ xi 1{min1≤i≤n xi ≥θ} .
i=1

It is easy to see that, for a fixed realisation xn and α > 0,

– θ 7→ Ln (xn ; θ, α) is positive and increasing on (0, min1≤i≤n xi ),


– Ln (xn ; θ, α) = 0 for θ ≥ min1≤i≤n xi .

Therefore the likelihood has a unique maximum for θ = θn (x) = min1≤i≤n xi . As a


consequence, the MLE of θ is
θbn = min Xi .
1≤i≤n

Notice that the model is not regular here.


C.3 Correction of the exercises of Chapter 3 163

3.b Using the hint, we first write


Z +∞  
Eθ [θbn ] = Pθ min Xi ≥ x dx.
x=0 1≤i≤n

Since the variables X1 , . . . , Xn are iid,


   n(α+2)
n θ
Pθ min Xi ≥ x = Pθ (X1 ≥ x) = ,
1≤i≤n max{θ, x}
which yields
n(α + 2)
Eθ [θbn ] = θ .
n(α + 2) − 1
We notice that the bias converges to 0 when n → +∞.
3.c By the same computations as above, the CDF Fn of n(θbn − θ) writes
(
0 if x ≤ 0,
Fn (x) = x
 n(α+2)
1 − 1 + nθ if x > 0.
When n → +∞,   
α+2
Fn (x) → 1{x>0} 1 − exp − x ,
θ
so that n(θbn − θ) converges in distribution to the exponential distribution with parameter
(α + 2)/θ.
3.d We deduce from the previous question that
α+2 b
n(θn − θ) → T ∼ E(1), in distribution.
θ
The prelimit plays the role (asymptotically) of a pivotal function. Thus, for any b > 0,
 
α+2 b
lim Pθ 0 ≤ n(θn − θ) ≤ b = P(0 ≤ T ≤ b) = 1 − exp(−b).
n→+∞ θ
The right-hand side is equal to 0.95 for the choice b = − log(0.05), in which case the
corresponding confidence interval writes
" #
θbn b
b
, θn .
1 + n(α+2)

Correction of Exercise 3.A.6


1. The function φ is convex, so that Jensen’s inequality yields
  
L1 (X1 ; θ0 )
h ≥ φ E θ1 .
L1 (X1 ; θ1 )
Recall the definition of the likelihood and assume for instance that for both θ = θ0 and
θ = θ1 , X1 possesses the density p(x; θ) under Pθ . Then L1 (x; θ) = p(x; θ) and
  Z Z
L1 (X1 ; θ0 ) p(x; θ0 )
E θ1 = p(x; θ1 )dx = p(x; θ0 )dx = 1,
L1 (X1 ; θ1 ) p(x; θ1 )
therefore we get h ≥ φ(1) = 0.
164 Correction of the exercises

2. With the same hypothesis, under H0 ,


n  
1 LR 1X p(Xi ; θ1 )
log ζn (Xn ) = log
n n p(Xi ; θ0 )
i=1
  
p(X1 ; θ1 )
→ Eθ0 log
p(X1 ; θ0 )
Z  
p(x; θ1 )
= log p(x; θ0 )dx
p(x; θ0 )
Z  
p(x; θ0 ) p(x; θ0 )
= − log p(x; θ1 )dx
p(x; θ1 ) p(x; θ1 )
= −h.

3. By the Central Limit Theorem, under H0 ,


 
√ 1
n log ζn (Xn ) + h → N(0, σ 2 ),
LR
n

with   
p(X1 ; θ1 )
σ 2 = Varθ0 log .
p(X1 ; θ0 )

Notice that both h and σ 2 are virtually computable, although depending on the model, these
computations may be more or less straightforward. With the same arguments as in Exer-
cise 3.A.4, we thus deduce that the test with rejection region
 
1 σ
LR
Wn = log ζn (Xn ) + h ≥ φ1−α/2 √
n n

is consistent and has asymptotic level α.

C.4 Correction of the exercises of Chapter 4


Correction of Exercise 4.A.1 With θ = (px )x∈X , the likelihood of a realisation xn = (x1 , . . . , xn ) ∈
Xn writes
Yn
Ln (xn ; θ) = pxi .
i=1

Let Pbn be the empirical measure associated with the realisation xn . In the product in the right-hand
above, each x ∈ X appears nb pn,x times, so that the likelihood rewrites
Y nb
pn,x
Ln (xn ; θ) = px ,
x∈X

and the associated log-likelihood is


X
ℓn (xn ; θ) = nb
pn,x log(px ).
x∈X
C.4 Correction of the exercises of Chapter 4 165

We now check that this function reaches its maximum on Θ for θ = Pbn , by writing, for any θ ∈ Θ,
X X
ℓn (xn ; Pbn ) − ℓn (xn ; θ) = nb
pn,x log(b
pn,x ) − nb
pn,x log(px )
x∈X x∈X
X  
pbn,x
=n pbn,x log
px
x∈X
X  
pbn,x
=n px φ ,
px
x∈X

where φ(u) = u log u. Since this function is convex, Jensen’s inequality yields
  !
X pbn,x X pbn,x
px φ ≥φ px = φ(1) = 0,
px px
x∈X x∈X

which completes the computation.

Correction of Exercise 4.A.3

1. By Definition 4.2.2, for any t ∈ [0, 1], β(t) is a centered Gaussian variable, with variance
t − t2 = t(1 − t).

2. By Definition 4.2.2, G is a centered Gaussian vector with covariance matrix given by

Cov(β(F (xi )), β(F (xj ))) = min{F (xi ), F (xj )} − F (xi )F (xj )
= F (min{xi , xj }) − F (xi )F (xj ),

since F is nondecreasing.

3. The random vector Gn writes


n
!
√ 1X
Gn = n Si − E[S1 ] ,
n
i=1

where S1 , . . . , Sn are iid random vectors in Rd defined by



Si = 1{Xi ≤x1 } , . . . , 1{Xi ≤xd } .

By the multidimensional Central Limit Theorem (see Theorem A.3.1 in Appendix A), Gn
converges in distribution to Nd (0, K) where K is the covariance matrix of S1 . The coeffi-
cients of this matrix are given by
h i   h i
E 1{X1 ≤xi } 1{X1 ≤xj } − E 1{X1 ≤xi } E 1{X1 ≤xj }
h i   h i
= E 1{X1 ≤min{xi ,xj }} − E 1{X1 ≤xi } E 1{X1 ≤xj }
= F (min{xi , xj }) − F (xi )F (xj ),

so that Gn converges in distribution to G.


166 Correction of the exercises

Correction of Exercise 4.A.4

1. Let x ∈ R. For all n ≥ 1, the iid random variables 1{X1 ≤x} , . . . , 1{Xn ≤x} take their values
in [0, 1] and satisfy E[1{X1 ≤x} ] = F (x), therefore by Corollary 2.4.17 p. 52,
√ 
P n|Fbn (x) − F (x)| ≥ a ≤ 2 exp(−2a2 ).

2. For α ∈ (0, 1), we let


r
1 α
a= − log .
2 2
Then the test rejecting H0 as soon as

sup n|Fbn (x) − F0 (x)| ≥ a
x∈R

has level  

PH0 sup n|Fbn (x) − F0 (x)| ≥ a ≤ 2 exp(−2a2 ) = α.
x∈R

In addition to provide an easily computable threshold a, this test does not require the CDF
F0 to be continuous.

C.5 Correction of the exercises of Chapter 5


Correction of Exercise 5.A.1
e = β for any β ∈ Rp+1 . But on the other
1. Since the estimator is unbiased, we have E[β]
hand,
e = E[an yn ] = E[an (xn β + ǫn )] = an xn β,
E[β]
where we have used the fact that E[ǫn ] = 0. As a consequence, the (p + 1) × (p + 1) matrix
an xn satisfies an xn β = β for all β ∈ Rp+1 , so that it is equal to the identity Ip+1 of Rp+1 .

2. The covariance matrix of βe = an (xn β+ǫn ) is the covariance matrix of the vector an ǫn , and
since ǫn has covariance matrix σ 2 Ip+1 , we deduce that the covariance matrix of βe writes

an σ 2 Ip+1 a⊤ 2 ⊤
n = σ an an .

3. We recall that the covariance matrix of βb writes σ 2 (x⊤ −1


n xn ) . As a consequence, to answer
the question it suffices to check that for all u ∈ R , hu, (an a⊤
p+1 ⊤ −1
n − (xn xn ) )ui ≥ 0.
Following the hint, we write

an a⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
n = ((xn xn ) xn + dn )((xn xn ) xn + dn )

= ((x⊤ −1 ⊤ ⊤
n xn ) xn + dn )(xn (xn xn )
−1
+ d⊤
n)
= (x⊤
n xn )
−1
+ (x⊤ −1 ⊤ ⊤ ⊤
n xn ) xn dn + dn xn (xn xn )
−1
+ dn d⊤
n.

By Question 1,

Ip+1 = an xn = ((x⊤ −1 ⊤
n xn ) xn + dn )xn = Ip+1 + dn xn ,
C.5 Correction of the exercises of Chapter 5 167

so that dn xn = 0 and (dn xn )⊤ = x⊤ ⊤


n dn = 0, therefore the series of identities above
simplifies to
an a⊤ ⊤
n = (xn xn )
−1
+ dn d⊤
n.

As a conclusion, for all u ∈ Rp+1 ,

hu, (an a⊤ ⊤ −1 ⊤ ⊤ 2
n − (xn xn ) )ui = hu, dn dn ui = kdn uk ≥ 0,

which completes the proof.

Correction of Exercise 5.A.2


p
1. For all j ∈ {1, . . . , p + 1}, let us write ζj = λj hβb − β, ej i. Any linear combination of
the variables ζj is a linear combination of the coefficients of the Gaussian vector βb − β,
therefore the vector with coordinates ζj is Gaussian. By Proposition 5.1.11, it is immediate
that E[ζj ] = 0 and
p p
E[ζi ζi ] = λi λj E[hβb − β, ei ihβb − β, ej i] = λi λj hei , σ 2 (x⊤ −1 2
n xn ) ej i = σ 1{i=j} ,

which shows the claimed result.

2. As a consequence of the previous question, we deduce that


p+1
X p+1 2
X
λj b 2
ζj
2
hβ − β, ej i = ∼ χ2 (p + 1),
σ σ2
j=1 j=1

so that taking a to be the quantile χ2p+1,1−α of order 1 − α of the χ2 (p + 1) distribution


yields the expected identity.

3. If σ 2 is estimated by σ
b2 , we now write
 
 p+1
X λj b 
p+1 2
C(a) = β ∈ R : hβ − β, ej i ≤ a
 b2
σ 
j=1

and look for the value of a which makes the identity P(β ∈ C(a)) = 1 − α remain valid.
We rewrite to this aim
p+1
X
λj hβb − β, ej i2 = σ 2 Z1 , Z1 ∼ χ2 (p + 1),
j=1

and
Z2
b2 = σ2
σ , Z2 ∼ χ2 (n − p − 1),
n−p−1
where Proposition 5.1.11 ensures that Z1 and Z2 are independent. As a consequence,
p+1
X λj b Z1 /(p + 1)
2
hβ − β, ej i2 = (p + 1) .
σ
b Z2 /(n − p − 1)
j=1

The law of the ratio Z2Z/(n−p−1)


1 /(p+1)
is thus the Fisher distribution with degrees of freedom p + 1
and n − p − 1, whose quantile of order 1 − α is denoted by fp+1,n−p−1,1−α. Then we
conclude that a must take the value fp+1,n−p−1,1−α/(p + 1).
168 Correction of the exercises

Correction of Exercise 5.A.3


1. Assume that the rank of xn is strictly smaller than p. This implies that one of its columns is
a linear combination of the other, so that there exists u ∈ Rp \ {0} such that xn u = 0. As
a consequence Kn u = n1 x⊤ n xn u = 0, so that 0 is an eigenvalue of Kn , which prevents the
smallest eigenvalue λp from being positive.
2. By Proposition 1.2.4, the subspaces of Rn spanned by x1 , . . . , xp and c1 , . . . , cp are the
same, so that the matrices xn and cn have the same range. As a consequence, for all β ∈ Rp ,
there exists γ ∈ Rp such that
p
X p
X
j
xn β = βj x = γl cl = cn γ.
j=1 l=1

By the definition of the principal components, the third term in the sequence of inequalities
above rewrites !
Xp X p p
X X p Xp
l j j j
γl c = γl el x = γl el xj ,
l=1 l=1 j=1 j=1 l=1
therefore the identification of the coefficients yields
p
X
βj = γl ejl .
l=1

3. The orthogonal projections of yn on the range of xn and on the range of cn are the same,
because these two spaces are identical.
4. By Proposition 1.2.4, the principal components c1 , . . . , ck are pairwise orthogonal in Rn .
As a consequence, the orthogonal projection of yn onto the range of ckn writes
k
X
bnk =
y blk cl ,
γ
l=1

blk is computed from the orthogonal projection of y


where γ bnk onto cl , which is exactly the
purpose of simple linear regression.
5. Let u ∈ Rp . We aim to show that
Var(hβbk , ui) ≤ Var(hβ,
b ui).

The definition of βbk rewrites βbk = Pk γ bk , where Pk is the p × k matrix whose columns
bk is a Gaussian vector with covariance
are the vectors e1 , . . . , ek . By Proposition 5.1.11, γ
⊤ k −1 ⊤
matrix σ (cn cn ) . On the other hand, Proposition 1.2.4 shows that the matrix ckn ckn is
2 k

diagonal with coefficients nλ1 , . . . , nλk . As a consequence, the covariance of the vector βbk
2
writes σn Pk diag(λ−1 −1 ⊤
1 , . . . , λk )Pk , therefore
k
bk σ 2 X hu, el i2
Var(hβ , ui) = .
n λl
l=1

We finally remark that when k = p, γ b and βbk = γ


bk = γ b. As a consequence,
p k
b ui) = σ 2 X hu, el i2 σ 2 X hu, el i2
Var(hβ, ≥ ,
n λl n λl
l=1 l=1
which completes the proof.
C.6 Correction of the exercises of Chapter 6 169

C.6 Correction of the exercises of Chapter 6


Correction of Exercise 6.A.1 We argue as in Section 4.2 and call F the common CDF to both
samples under H0 . Then, under H0 , ξn1 ,n2 has the same distribution as

1 X n1
1 Xn2

Xn1 ,n2 = sup 1{F −1 (U1,i1 )≤x} − 1{F −1 (U2,i2 )≤x} ,
x∈R n 1 n 2
i1 =1 i2 =1

where (U1,1 , . . . , U1,n1 ) and (U2,1 , . . . , U2,n2 ) are independent samples of uniformly distributed
random variables on [0, 1]. Performing the change of variable u = F (x) and using the continuity
of F allows to rewrite

1 X n1
1 X
n2

Xn1 ,n2 = sup 1{U1,i1 ≤u} − 1{U2,i2 ≤u} ,
u∈(0,1) n1 i1 =1
n2
i2 =1

which no longer depends on F .

Correction of Exercise 6.A.2 In the example, Zi = X1,i − X2,i = ǫ1,i − ǫ2,i . Under H0′ , the
pairs (ǫ1,i , ǫ2,i ) and (ǫ2,i , ǫ1,i ) have the same distribution, so that ǫ1,i − ǫ2,i and ǫ2,i − ǫ1,i have the
same distribution, which implies that the law of Zi is symmetric. Therefore, H0′ ⊂ H0 , so that if
Wn is the rejection region for a test of level α for the null hypothesis H0 , then
sup P(Wn ) ≤ sup P(Wn ) = α.
H0′ H0

In other words, the test rejecting H0′ on the event Wn has level at most α.

1. Let f : {−1, 1} → R and g : (0, +∞) → R be bounded functions. The symmetry of the
law of ζ yields
E[f (sign(ζ))g(|ζ|)] = E[f (sign(−ζ))g(| − ζ|)] = E[f (− sign(ζ))g(|ζ|)],
so that
1
E[f (sign(ζ))g(|ζ|)] = (E[f (sign(ζ))g(|ζ|)] + E[f (− sign(ζ))g(|ζ|)])
2 
1 
= E  (f (sign(ζ)) + f (− sign(ζ))) g(|ζ|)
2| {z }
=f (1)+f (−1)
f (1) + f (−1)
= E[g(|ζ|)],
2
which shows that sign(ζ) and |ζ| are independent, with P(sign(ζ) = 1) = P(sign(ζ) =
−1) = 1/2. (Notice that we have used the assumption that P(ζ = 0) = 0, which follows
from (∗), to ensure that f (sign(ζ)) + f (− sign(ζ)) = f (1) + f (−1), almost surely.)
2. Let ǫ1 , . . . , ǫn ∈ {−1, 1}. We write

P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn
X 
= P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn , π = σ
σ∈Sn
X 
= P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn , |ζσ(1) | < · · · < |ζσ(n) | .
σ∈Sn
170 Correction of the exercises

By the result of Question 1, the vectors (sign(ζ1 ), . . . , sign(ζn )) and (|ζ1 |, . . . , |ζn |) are
independent, therefore for any σ ∈ Sn ,

P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn , |ζσ(1) | < · · · < |ζσ(n) |
 
= P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn P |ζσ(1) | < · · · < |ζσ(n) |
1
= n P(π = σ),
2
where at the last line, we have used the fact that the variables sign(ζ1 ), . . . , sign(ζn ) are
independent Rademacher variables. We deduce that
 1 X 1
P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn = n P(π = σ) = n ,
2 2
σ∈Sn

which shows that sign(ζπ(1) ), . . . , sign(ζπ(n) ) are independent Rademacher variables.


3. According to the previous question, under H0 , T + has the same law as the random variable
n
X
+
τ = kℓk ,
k=1

where ℓ1 , . . . , ℓn are independent B(1/2) variables. This variable does not depend on the
law of Z1 , so that the statistic T + is free under H0 . Denoting by t+
n,r the quantile of order r
of τ , we deduce that the test rejecting H0 as soon as T 6∈ [tn,α/2 , t+
+ + +
n,1−α/2 ] has level α.
These quantiles may be computed by numerical simulation.
4. Under H0 , the expectation of T + is equal to
n
X n
1X n(n + 1)
tn = E[τ + ] = kE[ℓk ] = k= ,
2 4
k=1 k=1

and the variance of T+ is equal to


n
X n
1 X 2 n(n + 1)(2n + 1)
σn2 +
= Var(τ ) = 2
k Var(ℓk ) = k = .
4 24
k=1 k=1

5. We proceed as for the proof of the Central Limit Theorem, and compute the characteristic
function Φwn of the reduced variable
τ + − tn
wn = .
σn
For all u ∈ R,
Φwn (u) = E [exp(iuwn )]
  
τ + − tn
= E exp iu
σn
" n
!#
iu X
= E exp k(ℓk − 1/2)
σn
k=1
Yn   
iu
= E exp k(ℓk − 1/2)
σn
k=1
Yn     
1 iuk 1 iuk
= exp + exp − .
2 2σn 2 2σn
k=1
C.6 Correction of the exercises of Chapter 6 171


The expression of σn2 computed above shows that k/σn = O(1/ n), uniformly over k ≤ n,
so that for any k ∈ {1, . . . , n},
   
1 iuk 1 iuk
exp + exp −
2 2σn 2 2σn
 2 2
    
1 iuk u k 1 1 iuk u2 k2 1
= 1+ − +o + exp 1 − − +o
2 2σn 8σn2 n 2 2σn 8σn2 n
 
u2 k2 1
=1− 2
+o .
8σn n

Pretending not to see that we are taking the logarithm of a complex number, we then write
n
X   
u2 k 2 1
log Φwn (u) = log 1 − +o
8σn2 n
k=1
n
X u2 k 2
= − + o(1)
8σn2
k=1
u2 n(n + 1)(2n + 1)
=− + o(1)
8σn2 6
u2
= − + o(1).
2
We deduce that Φwn (u) converges to the characteristic function exp(−u2 /2) of the N(0, 1)
distribution. This allows us to construct an asymptotic test, which rejects H0 as soon as
|wn | ≥ φ1−α/2 .
172 Correction of the exercises
Bibliography

[1] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer
Series in Statistics. Springer, New York, second edition, 2009. Data mining, inference, and
prediction.

[2] B. Jourdain. Probabilités et Statistiques. Ellipses, 2016.

[3] E. L. Lehmann and Joseph P. Romano. Testing statistical hypotheses. Springer Texts in
Statistics. Springer, New York, third edition, 2005.

[4] A. Reinhart. Statistics Done Wrong: The woefully complete guide. No Starch Press, 2015.

[5] A. Tsybakov. Introduction to nonparametric estimation. Springer, 2009.

[6] A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.

Vous aimerez peut-être aussi