Académique Documents
Professionnel Documents
Culture Documents
Julien Reygner
Vendredi 25 septembre
Vendredi 2 octobre
Vendredi 9 octobre
Vendredi 16 octobre
Vendredi 23 octobre
Amphi
Vendredi 6 novembre
Petite classe
Vendredi 13 novembre
Examen
Vendredi 27 novembre
Vendredi 11 décembre
Vendredi 18 décembre
Vendredi 8 janvier
Vendredi 15 janvier
Vendredi 22 janvier
Toutes les informations relatives au cours sont disponibles sur l’équipe Teams STAT - Statis-
tiques et Analyse de Données.
Les exercices et TP sont à faire dans le courant de la semaine entre deux séances. Leur pré-
paration est obligatoire : avant le début de chaque séance, vous devez déposer un document
sur Teams dans le dossier partagé correspondant à la séance de votre groupe de petite classe.
Pour les exercices, ce document doit être au format PDF (scan lisible de votre rédaction manus-
crite, ou bien rédaction directe sur ordinateur) ; pour les TP, vous devez déposer votre notebook
au format R Markdown. Selon les cas, votre document doit être nommé nom_prenom.pdf ou
nom_prenom.rmd.
TP informatiques
Les TP informatiques se font avec le logiciel R, toutes les informations nécessaires à son instal-
lation sont mises sur Teams. Ces TP ont pour objectif principal d’illustrer les notions du cours en
familiarisant les élèves à l’usage de ce langage, des ressources supplémentaires sont proposées sur
Teams pour les élèves qui souhaiteraient approfondir leur pratique.
Classes inversées
Une partie des trois dernières séances de cours est consacrée à la présentation d’exposés de classe
inversée : vous ferez le cours à vos camarades. Six sujets, décrits dans l’Annexe B, p. 141, devront
être présentés par des groupes de 3 ou 4 élèves, au cours d’un exposé d’une quinzaine de minutes,
suivi d’une discussion d’une dizaine de minutes avec le reste de la classe. Vous trouverez des
consignes détaillées pour la préparation des exposés dans l’introduction de l’Annexe B (à lire
impérativement, dès maintenant). Les méthodes présentées au cours de ces exposés seront au
programme de l’examen final : gardez donc à l’esprit que vos camarades attendront de vous le
même effort de pédagogie que celui que vous exigez de vos professeurs !
Modalités d’évaluation
La note de module est composée à 75% de la note à l’examen final et à 25% de l’évaluation de
l’exposé de classe inversée.
Cours en anglais
Les notes de cours et l’intégralité des documents pédagogiques liés au cours sont rédigés en an-
glais, afin de sensibiliser les étudiants à la pratique de l’anglais scientifique, ce qui sera très utile
pour ceux qui souhaitent partir un semestre ou une année à l’étranger. Les petites classes restent
cependant enseignées en français. Nous conseillons de préparer les slides — pardon, les transpa-
rents — des exposés de classe inversée en anglais, mais de conserver le français pour l’exposé
oral. Enfin, les étudiants sont libres de choisir le français ou l’anglais pour rédiger leur copie à
l’examen final.
Certains termes spécifiques n’ont pas forcément de correspondance français/anglais : les en-
seignants de petite classe sont là pour éclairer les possibles ambiguïtés que cela pourrait induire.
À titre d’exemple, soulignons dès maintenant une subtilité classique : les termes positive et
negative en anglais désignent respectivement des nombres strictement positifs et strictement né-
gatifs ; pour parler d’un nombre positif ou nul, on emploie le terme nonnegative — et évidemment
nonpositive pour les nombres négatifs ou nuls. La même règle s’applique à la monotonie des fonc-
tions : increasing et decreasing désignent respectivement des fonctions strictement croissantes et
strictement décroissantes, pour les fonctions croissantes et décroissantes au sens large, on utilise
nondecreasing et nonincreasing. Une autre différence notable entre les conventions anglophone
et francophone concerne l’emploi d’une parenthèse pour représenter la borne ouverte d’un inter-
valle : ce que l’on noterait [0, 1[ en français est noté [0, 1) en anglais.
Polycopié
Ce polycopié est destiné aux élèves de l’École des Ponts. Une version électronique est disponible
sur Teams, et nous sommes généralement très heureux de la transmettre à tout étudiant à qui elle
pourrait être utile. Nous vous remercions cependant de ne pas diffuser la version électronique sur
des pages web publiques.
iii
Des exercices « au fil du cours » sont inclus dans le corps des chapitres, il s’agit d’applications
directes du cours et ne sont pas corrigés. Les exercices en fin de chapitre sont généralement un
peu plus originaux ou profonds, ils sont corrigés dans l’Annexe C (à l’exception des exercices
obligatoires, qui sont corrigés en classe).
Enfin, certains passages du polycopié sont signalés par un astérisque : ils sont hors du pro-
gramme du cours, mais il pourra vous être utile d’y revenir dans la suite de votre scolarité.
Office hours
Dans la mesure de ce que permettra la situation sanitaire, il est prévu de mettre en place un créneau
d’office hour en présentiel pour compléter le cours à distance. Les modalités seront décidées lors
du premier amphi.
iv
Programme des séances
Les amphis et petites classes sont respectivement signalés par les pictogrammes et L dans le
programme ci-dessous.
Pour chaque séance d’amphi, on donne dans un encadré les principales notions à connaître. À
la fin (ou mieux, tout au long) du semestre, n’hésitez pas à consulter ce programme afin de vous
assurer que vous avez les idées claires sur chacune d’entre elles.
Les exercices et TP à préparer sont indiqués avec le pictogramme ↸.
Le calendrier de préparation des classes inversées est signalé par le pictogramme :.
: Classes inversées. Lire les résumés rapides de chacun des six sujets (Annexe B, p. 141),
former des groupes et choisir un sujet.
↸ Travail préparatoire. Exercices 2.A.2 (question 3), 2.A.5 (sauf question 5).
↸ Travail préparatoire.
↸ Travail préparatoire.
: Classes inversées. Envoyer à votre enseignant de petite classe un mail décrivant la base de
données sur laquelle vous avez choisi d’appliquer la méthode que vous devez présenter, et
les questions que vous souhaitez traiter.
↸ Travail préparatoire. Exercices 4.A.2 et 4.A.5 (lire la Section 4.2.4 avant de commencer cet
exercice).
: Classes invsersées.
: Classes invsersées.
: Classes invsersées.
↸ Les exercices signalés par le pictogramme 1 dans tout le poly, ainsi que l’exercice 4 p. 133,
constituent un excellent programme de révision. Les sujets des examens des années précé-
dentes, disponibles sur Educnet, sont également recommandés.
Introduction 1
Check-in list 3
1 Data analysis 7
1.1 Empirical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Parametric estimation 27
2.1 General definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Maximum Likelihood Estimation and efficiency . . . . . . . . . . . . . . . . . . 33
2.3 * Sufficient statistics and the Rao–Blackwell Theorem . . . . . . . . . . . . . . 40
2.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 * Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Hypothesis testing 65
3.1 General formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 General construction of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 Examples in the Gaussian model . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 * Multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4 Nonparametric tests 83
4.1 Models with a finite state space: the χ2 test . . . . . . . . . . . . . . . . . . . . 84
4.2 Continuous models on the line: the Kolmogorov test . . . . . . . . . . . . . . . 88
4.A Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.B Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Bibliography 171
Introduction
For low-dimensional data, elementary tools such as histograms and scatter plots1 allow to ad-
dress the first step by giving a rapid qualitative overview of a data set. However, when the number
1
Nuages de points en français.
2 Introduction
of variables collected during an experiment is large, visualising and extracting pertinent informa-
tion from data becomes more intricate. The techniques of data analysis exposed in Chapter 1
allow to address these questions.
The collection and visualisation of data generally allows to construct a model for the phe-
nomenon under study, which is based on assumptions: for example, that certain variables are
independent, or that the distribution of some variables belong to a given family, such as Gaus-
sian or exponential laws. Once the model is constructed, its parameters (for instance, the mean
and variance of the variables which are assumed to be Gaussian) have to be estimated from the
observation of the data. The framework of parametric estimation, presented in Chapter 2, al-
lows to carry out this estimation procedure with quantitative error estimates, through the notion of
confidence intervals in particular.
The first purpose of the construction of a model and of the estimation of its parameters is to
improve the understanding of the phenomenon studied by the experiment. For example, when
analysing the results of a survey2 , one may define a model by assuming that picking n people
at random and asking then ‘Will you vote for Candidate X?’ yields a sample of n independent
and identically distributed Bernoulli variables, with an unknown parameter p which corresponds
to the actual proportion of the population who will vote for Candidate X. It is then of interest to
determine whether p is larger or smaller than 1/2 in order to know whether Candidate X will win
the election. The theory of hypothesis testing introduced in Chapter 3 provides the framework to
address such questions. Hypothesis testing also allows to assess the validity of a model; in particu-
lar, nonparametric tests discussed in Chapter 4 may permit to determine whether a sample of data
is actually distributed according to a given probability law, while independence and homogeneity
tests presented in Chapter 6 are designed to study the dependence between variables.
Last, once a model is constructed and validated through parameter estimation and hypothesis
testing, it can be employed to predictive purposes. For instance, assume that at each experiment,
pairs of variables (xi , yi ), i = 1, . . . , n, are collected, and that the model has validated a functional
relation of the form yi ≃ f (xi ), where the symbol ≃ denotes the fact that some fluctuations are
not entirely captured by the function f . In this context, estimating the function f is a regression
problem, and its resolution may be expected to allow to predict values of y corresponding to a set
of variables x which has not been observed during the first n experiments. Two specific regression
problems, namely linear and logistic, are presented in Chapter 5.
Afterword
These notes are based on the former polycopié of the course coordinated by Jean-François Delmas.
They benefited from many useful discussions with Cristina Butucea, Guillaume Obozinski and
Arnaud Guyader, as well as the lecturers (past and present) of the course: Christophe Denis,
Vincent Feuillard, Patrick Hoscheit, Guillaume Perrin and Gabriel Stoltz. I wish to warmly thank
all of them for their comments and help.
If you find any typo, mistake or imprecision in the text, or if you have any comment on its
contents, please let me know: julien.reygner@enpc.fr. Thank you!
2
Sondage en français.
Check-in list
In this short preliminary chapter, we introduce some notation, in particular of linear algebra and
probability theory, which will be used throughout the notes. We also recall a few definitions and
leave to the reader a number of quick questions (and enough room to fill in the blanks), the answer
to which should be a good refresher before delving into the body of these notes.
Linear algebra
For n, p ≥ 1, we denote by Rn×p the space of matrices with n rows and p columns. The transpose
of a matrix A ∈ Rn×p is denoted by A⊤ ∈ Rp×n .
Symmetric matrices
We denote by h·, ·i the usual scalar product on Rp .
What is the definition of the Euclidean norm k · k on Rp induced by h·, ·i?
Orthogonal projections
Let E be a finite-dimensional linear space3 , endowed with a scalar product h·, ·i, and let H be a
linear subspace4 of E.
3
Espace vectoriel en français.
4
Sous-espace vectoriel en français.
4 Check-in list
If H = Span(e) is the linear space generated by some vector e ∈ E such that kek = 1,
what is the operator ee⊤ ?
Probability theory
A probability space is a triple (Ω, A, P), where Ω is a set, A is a σ-algebra on Ω, and P is a
probability measure on (Ω, A).
Random variables
A random variable with values in a measurable space X is a measurable function X : Ω → X.
When X = Rd , a random variable X is said to have a density pX : Rd → [0, +∞) if, for any
measurable subset B ⊂ Rd ,
Z
P(X ∈ B) = 1{x∈B} pX (x)dx,
x∈Rd
– exponential E(λ),
– Gaussian N(µ, σ 2 )?
Expand Var(X + Y ).
The covariance matrix of a random vector X = (X1 , . . . , Xd ) ∈ Rd is the matrix with coefficients
Cov(Xi , Xj ).
Independence
A family of random variables (Xi )i∈I , which take their respective values in some spaces Xi , is
called independent if, for any finite set of distinct indices i1 , . . . , ik ∈ I, for any measurable sets
B1 ⊂ Xi1 , . . . , Bk ⊂ Xik ,
If X, Y ∈ R are independent, what is the value of Cov(X, Y )? What about the converse
statement?
1 Pn
If X1 , . . . , Xn are iid, what are the expectation and the variance of n i=1 Xi ?
• in distribution if, for any continuous and bounded function f , lim E[f (Xn )] = E[f (X)].
5
Independent and Identically Distributed
7
State the (strong) Law of Large Numbers and the Central Limit Theorem.
8 Check-in list
Chapter 1
Data analysis
The expression ‘data analysis’ covers a set of techniques allowing to extract and summarise the
information contained in data bases. In this chapter, we shall focus on data bases taking the form
of a table with n rows and p columns:
• each row represents a data, that is to say an individual among a population, or the result of
a random experiment among a series of identically distributed experiments, etc.;
• each column represents a characteristics of the individual, which we shall also call a fea-
ture1 .
We only address quantitative variables, which means that the cells of the table may only contain
numbers, as opposed to categorical variables2 . The features can represent quantities of a different
nature, for instance physical variables that are not expressed in the same unit of measurement.
In this chapter we shall work with the example of the results of men’s decathlon at the 2016
Summer Olympics, reproduced in Table 1.1. We may extract from this table two arrays with
n = 23 rows and p = 10 columns: a first array with raw results at each event (expressed in seconds
or in metres), for which comparing the numbers in two different columns makes no sense; and a
second array with the results converted in points, in which case the features are said homogeneous.
The table is seen as a matrix xn ∈ Rn×p . Its rows are denoted by x1 , . . . , xn and its columns
are denoted by x1 , . . . , xp . The number at the i-th row and j-th column is xji .
The usual scalar product on Rp (seen as a space of row vectors) is denoted by h·, ·i, the as-
sociated Euclidean norm is denoted by k · k. For any (row) vector x ∈ Rp , x⊤ is the (column)
transposed vector.
Name 100m L. jump Shot put H. jump 400m 110m h. Discus Pole vault Javelin 1500m
10.46 7.94 14.73 2.01 46.07 13.8 45.49 5.2 59.77 263.3
Eaton
985 1045 773 813 1005 1000 777 972 734 789
10.81 7.6 15.76 2.04 48.28 14.02 46.78 5.4 65.04 265.5
Mayer
903 960 836 840 896 972 804 1035 814 774
10.3 7.67 13.66 2.04 47.35 13.58 44.93 4.7 63.19 264.9
Warner
1023 977 708 840 941 1029 765 819 786 778
10.78 7.69 14.2 2.1 46.75 14.62 43.25 5 64.6 271.2
Kazmirek
910 982 741 896 971 896 731 910 807 736
10.75 7.52 13.78 2.1 47.98 14.15 42.39 4.6 66.49 254.6
Bourrada
917 940 715 896 910 955 713 790 836 849
11.21 7.14 14.27 2.07 48.15 14.48 47.07 4.9 72.32 268.3
Suarez
814 847 745 868 902 913 810 880 925 756
10.71 7.49 13.44 2.1 49.83 14.77 49.42 5.2 60.92 283
Ziemek
926 932 694 896 822 878 858 972 752 662
11.24 7.66 12.84 2.16 49.63 15.01 43.58 5.4 62.09 274.2
v. d. Plaetsen
808 975 657 953 832 848 738 1035 769 717
10.93 7.42 14.77 2.07 49.14 14.79 45.1 4.5 69.92 270.5
Felix
876 915 776 868 855 875 769 760 888 741
10.77 7.48 15.26 1.92 48.14 14.17 45.1 4.9 57.28 271.5
A. de Araujo
912 930 806 731 902 953 769 880 697 735
11.01 7.45 14.92 2.19 48.78 14.57 39.91 5 51.29 262
Taiwo
858 922 785 982 872 902 663 910 608 798
11.06 7.35 15.11 2.04 49.51 14.37 44.13 4.7 68.2 274.4
Helcelet
847 898 796 840 837 927 749 819 862 716
11.17 7.07 15.41 1.98 49.34 14.82 42.23 5.1 61.91 280.5
Auzeil
823 830 815 785 845 871 710 941 767 677
10.86 7.47 11.49 2.13 48.18 14.3 38.89 4.9 51.82 272.1
Dubler
892 927 575 925 900 936 642 880 616 731
10.87 6.97 15.03 1.98 49.02 14.12 44.66 4.5 64.13 293.1
Abele
890 807 792 785 860 959 760 760 800 600
10.83 7.11 14.8 1.98 49.8 15.74 53.24 4.4 63.54 284.7
Victor
899 840 777 785 824 762 938 731 791 651
11.32 7.33 13.69 2.01 50.81 14.99 46.31 5.2 60.15 286.3
Tonnesen
791 893 709 813 778 851 794 972 740 641
10.81 6.83 14.58 2.01 48.69 14.25 40.34 4.5 64.7 285
Garcia
903 774 764 813 876 942 671 760 809 649
10.84 7.33 13.4 1.89 48.61 14.39 38.09 4.8 61.83 273.5
Distelberger
897 893 692 705 880 925 626 849 765 722
11.3 6.83 14.14 1.98 50.43 15.09 49.9 4.9 66.63 286.3
Ushiro
795 774 737 785 795 839 868 880 838 641
10.88 6.73 14.17 2.01 50.18 15.09 48.32 4.5 56.68 282.3
Wiesiolek
888 750 739 813 806 839 835 760 688 666
11.04 7.13 12 1.92 48.93 14.57 34.91 4.7 51.24 258.4
Nakamura
852 845 606 731 865 902 562 819 607 823
10.82 7.02 13.88 1.77 50.32 16.51 42.96 4.5 46.42 279.4
Saluri
901 818 721 602 800 676 725 760 536 684
Table 1.1: Overall results for the 23 best ranked athletes at men’s decathlon during the 2016
Summer Olympics. Each cell contains the raw result (in seconds or metres according to the event)
on the first row, and the corresponding number of points on the second row. The cells with bold
text indicate the best performance in each event. Source: Wikipedia.
1.1 Empirical correlation 11
We deduce that λkuk2 ≥ 0, and since kuk2 > 0 by the definition of an eigenvector, then λ ≥
0.
Remark 1.1.4. To highlight the connection with the usual notions of expectation and covariance,
one may introduce the probability measure
n
1X
µn = δxi
n
i=1
We now introduce the notion of empirical correlation, which plays an essential role in data
analysis.
Definition 1.1.5 (Empirical correlation). For all j, j ′ ∈ {1, . . . , p}, the empirical correlation
′
between the features xj and xj is defined by
′ ′
j Knjj
j′ Knjj
Corr(x , x ) = p q = j j′ .
′ ′
Knjj Knj j σn σn
The empirical correlation coefficient possesses a first geometrical interpretation: indeed, with
the notation
1
..
1n = . ∈ Rn ,
1
′
it is easily observed that Corr(xj , xj ) is the cosine of the angle made in Rn by the (column)
′
vectors xj − xjn 1n and xj − xjn 1n . In particular, the empirical correlation takes its values between
′
−1 and 1, and it is close to 1 (respectively −1, and 0) if the vectors are aligned (respectively aligned
in opposite directions, and orthogonal).
These remarks also lead to the statistical interpretation of the empirical correlation that a
positive correlation implies that the features typically take simultaneously large or small values; a
negative correlation implies that the features typically vary in opposite directions; and a correlation
close to 0 implies that the value of one feature has a weak influence on the value of the other.
Positive correlations are often interpreted as the trace of a causal relation between two phe-
nomena: for instance, the release of carbon dioxyde in the atmosphere is strongly correlated with
global temperatures, and it is natural to assume that global warming is, at least partially, caused
by the increase in the emission of polluting gases. However, basing this conclusion only on the
observation of the correlation is a methodological error, for several reasons — among which is
the fact that the correlation is symmetric, so that it would be equally legitimate to conclude that
global warming is responsible for the increase of carbon dioxyde emissions. Figure 1.1 depicts an
example of two data sets with a strong empirical correlation but no actual causality.
4 lms
Nicholas Cage
120 drownings
80 drownings 0 lms
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Figure 1.1: An example of two data sets with a strong correlation, taken from the website Spurious
Correlations: http://www.tylervigen.com/spurious-correlations.
1.1 Empirical correlation 13
In order to visualise the pairwise correlations between the features, one may represent all
′
scatter plots5 (xji , xji ), 1 ≤ i ≤ n, for all pairs of indices (j, j ′ ). The example of decathlon results
is shown on Figure 1.2. Another representation of pairwise correlations is sometimes provided by
heatmaps6 , in which cells are coloured according to the value of the corresponding correlation.
Scatter plots and heatmaps are a first possible representation of the relations between the dif-
ferent features, but on the one hand it does not allow to see the relations between groups of k ≥ 3
features, and on the other hand its analysis may become difficult when the number p of features is
large. Principal Component Analysis, presented in the next section, provides a more global view
of the set of points x1 , . . . , xn in Rp , which is also based on the analysis of empirical correlations.
750 950 600 900 700 950 750 950 600 800
1000
R100m
800
750 950
LongJump
800
ShotPut
600
900
HighJump
600
800 950
R400m
950
R110mH
700
900
Discus
600
750 950
PoleVault
900
Javelin
600
800
R1500m
600
800 1000 600 800 800 950 600 900 600 900
Figure 1.2: Scatter plot of the results of Table 1.1 converted in points. The results for 400m and
110m hurdles appear to be very positively correlated, those for 100m and shot put look rather
negatively correlated, and those for long jump and discus throw look rather uncorrelated.
5
Nuages de points en français.
6
Type help("heatmap") in R.
14 Data analysis
D1
+
+
D2
Figure 1.3: A set of n = 3 points in the plane, projected onto the affine lines D1 and D2 . The
projection onto D1 seems to yield a better representation of the original set than the projection
onto D2 .
n
1X H 2
(σnH )2 = kxi k .
n
i=1
By Pythagoras’s Theorem,
n
1X
σn2 = (σnH )2 + kxi − xH 2
i k ,
n
i=1
xi
xH
i H
Figure 1.4: xH
i is the orthogonal projection of xi onto H.
PCA is based on the idea that the deformation of the set of points by the orthogonal projection
is measured by the quantity σn2 − (σnH )2 . As a consequence, given k ≤ p, we shall look for the
linear subspace H, with dimension k, which maximises (σnH )2 .
Theorem 1.2.2 (Computation of optimal subspaces). For all k ∈ {1, . . . , p}, let Hk = Span(e1 , . . . , ek )
be the linear space spanned9 by e1 , . . . , ek . We have
k
X
(σnHk )2 = λj ,
j=1
(σnH )2 ≤ (σnHk )2 .
Hence, the computation of optimal subspaces is reduced to the diagonalisation of the empirical
covariance matrix.
therefore
n k k k
1 XX X X
(σnHk )2 = hej x⊤
i x i , ej i = hej Kn , ej i = λj ,
n
i=1 j=1 j=1 j=1
9
Sous-espace vectoriel engendré en français.
1.2 Principal Component Analysis 17
where eH
l is the orthogonal projection of el onto H. P
To complete the proof, it remains to show that pl=1 λl keH 2
l k is bounded from above by
Pk
l=1 λl . To this aim, we rely on the following two remarks:
(i) by Pythagoras’ Theorem, keH 2 2
l k ≤ kel k = 1;
Pp H 2
Pp Pk 2
Pk Pp 2
Pk 2
(ii) l=1 kel k = l=1 j=1 hfj , el i = j=1 l=1 hel , fj i = j=1 kfj k = k;
which allow to write
p
X k
X p
X
λl keH
l k
2
= λl keH
l k
2
+ λl keH
l k
2
l=1 l=k+1
k k
!
X X
= λl keH 2
l k + λk k− keH
l k
2
l=1 l=1
Xk
= λl keH 2 H 2
l k + λk (1 − kel k )
l=1
k
X k
X
≤ λl keH
l k2
+ λl (1 − keH 2
l k ) = λl ,
l=1 l=1
where we have used (ii) at the third line, (i) at the fifth line, and the fact that λ1 ≥ · · · ≥ λp . This
inequality completes the proof.
The l-th principal component is the column vector of Rn which contains the scores of the n
individuals on the l-th loading.
Proposition 1.2.4 (Properties of principal components). The principal components are uncorre-
′
lated: Corr(cl , cl ) = 0 if l 6= l′ . Besides, for all j ∈ {1, . . . , p},
√ j
l j λl el
∀l ∈ {1, . . . , p}, Corr(c , x ) = ,
σnj
and
X p
Corr(cl , xj )2 = 1. (∗)
l=1
18 Data analysis
′
This implies that Corr(cl , cl ) = 0 if l 6= l′ . Notice that this result also shows that if λl = 0, then
cl = 0.
Now for l, j ∈ {1, . . . , p},
n n p
1X l j 1 XX k k j
ci xi = el xi xi = (el Kn )j = λl ejl ,
n n
i=1 i=1 k=1
Pn Pn j 2
and by the first computation above, 1
n
l 2
i=1 (ci ) = λl while 1
n i=1 (xi ) = Knjj = (σnj )2 . As a
consequence,
√ j
l j λl ej λl el
Corr(c , x ) = √ pl = ,
λl Knjj σnj
therefore
p p
X X λl (ej )2
Corr(cl , xj )2 = l
Remark 1.2.5 (Correlation sphere). Notice that Equation (∗) implies that, for all j ∈ {1, . . . , p},
the point of Rp with coordinates Corr(c1 , xj ), . . . , Corr(cp , xj ) is located on the unit sphere of
Rp . This sphere is called the correlation sphere, see Figure 1.5. ◦
The plane spanned by (c1 , c2 ) is called the first factorial plane. In order to visualise the cor-
relations between the original features and the principal components, it is customary to represent
the points of coordinates (Corr(c1 , xj ), Corr(c2 , xj )), for j ∈ {1, . . . , p}, in this plane. By Re-
mark 1.2.5, these points are located within the unit circle, which is called the correlation circle of
1.2 Principal Component Analysis 19
c3
c2
c1
Figure 1.5: The correlation sphere for p = 3: the dot on the sphere is the point with coordi-
nates Corr(c1 , xj ), Corr(c2 , xj ), Corr(c3 , xj ); its projection onto the first factorial plane gives the
position of xj within the correlation circle.
1.0
1.0
Javelin
Discus
ShotPut R100m
0.5
0.5
R400m
R110mH
Javelin
Correlation with c2
Correlation with c3
HighJump R1500m
ShotPut
PoleVault
0.0
0.0
R110mH
LongJump
LongJump
R100m R400m
HighJump Discus
R1500m
−0.5
−0.5
PoleVault
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Figure 1.6: Correlation circles of the first three principal components for the unnormalised results
of decathlon.
20 Data analysis
the first factorial plane. The closer to the circle a point is, the better the feature xj is explained by
the principal components c1 and c2 . On the contrary, if a point is close to the origin, the corre-
sponding feature is weakly correlated to the first two principal components, and it may be useful
to plot the correlation circle in subsequent factorial planes so as to make the correlations with
subsequent principal components appear. How to select the total number of principal components
to take into account will be discussed in Section 1.2.5.
For the example of decathlon results, the correlation circles of the first two factorial planes are
plotted on Figure 1.6. We observe that the first principal component is highly correlated with the
results of long jump, 400m, 110m hurdles, high jump, 1500m and pole vault. We may therefore
suggest to interpret the score of an athlete on this component as an indication of his ability to run
relatively short distances, and jump. The second principal component is positively correlated to
the results to throw events, so that the score of an athlete on this component quantifies his ability
to throw things. The third principal component allows to discriminate, among the athletes having
the same score on the first component, between jumpers and runners.
The projections of the set of points x1 , . . . , xn on the first two factorial planes are plotted on
Figure 1.7. Eaton, Mayer and Warner appear to be the best runners/jumpers (in the sense of the
first principal component), but their score on the third principal component allows to refine their
profile by showing that Warner is more a runner, Mayer a jumper, and Eaton is between both.
Likewise, the scores on the second principal component allow to identify Suarez and Victor as
having a thrower profile.
150
Suarez Warner
Victor
Bourrada
100
Felix Abele
100
Helcelet Felix
Mayer
Abele
Ziemek Helcelet
50
A. d. Araujo
Eaton
Auzeil Suarez
Kazmirek Victor
Kazmirek
0
v. d. Plaetsen
0
Bourrada
Warner
e2
e3
A. d. Araujo Eaton
−50
Dubler Mayer
Auzeil
Taiwo
−100
−100
Taiwo Ziemek
−150
−200
−200
Dubler v. d. Plaetsen
e1 e2
Figure 1.7: Projections of the set of points x1 , . . . , xn on the first two factorial planes. For the
sake of legibility, only the first 16 athletes are represented.
15000
3.0
2.5
10000
2.0
Lambda
Lambda
1.5
5000
1.0
0.5
0.0
0
2 4 6 8 10 2 4 6 8 10
Figure 1.8: Scree plots for the PCA on the data of Table 1.1, with normalised raw results on the
left, and unnormalised results converted into points on the right. An inflection in the decrease
is observed at the fourth eigenvalue. The proportion of variance explained by the 4 first axes is
79.5% for raw results, and 81.0% for converted results.
Exercise 1.3.1. Show that the Bell numbers satisfy the recursive relation
n
X n
Bn+1 = Bk ,
k
k=0
for all n ≥ 0. Hint: fix one of the n + 1 points of the set, and for all k ∈ {1, . . . , n + 1}, compute
how many partitions are such that the class containing this given point has cardinality k. ◦
10
Diagramme d’éboulis en français.
22 Data analysis
+
+ +
+ +
+
The result of Exercise 1.3.1 indicates that Bn grows very fast. In practice, it is therefore
unconceivable to compare all possible partitions. We describe two families of clustering methods:
the k-means algorithm, and hierarchical clustering.
Lemma 1.3.3 (Monotonicity of the WCSS). For all t ≥ 0, let WCSS(t) denote the within-cluster
(t) (t)
Sum of Squares of the partition S1 , . . . , Sk . Then WCSS(t+1) ≤ WCSS(t) .
(t)
Proof. The assignment step ensures that, for all j ∈ {1, . . . , k}, for all i ∈ Sj ,
(t) (t)
kxi − xj k2 ≥ kxi − xJ(i) k2 .
As a consequence,
k X
X (t)
WCSS(t) = kxi − xj k2
j=1 i∈S (t)
j
k X
X (t)
≥ kxi − xJ(i) k2
j=1 i∈S (t)
j
n
X (t)
= kxi − xJ(i) k2
i=1
Xk X (t)
= kxi − xj k2 ,
j=1 i∈S (t+1)
j
(t+1)
where we have used the definition of the classes Sj in the updated step at the last line. For all
(t+1)
j ∈ {1, . . . , k}, for all i ∈ Sj ,
D E
(t) (t+1) 2 (t+1) (t+1) (t) (t+1) (t)
kxi − xj k2 = kxi − xj k + 2 xi − xj , xj − xj + kxj − xj k2
D E
(t+1) 2 (t+1) (t+1) (t)
≥ kxi − xj k + 2 xi − xj , xj − xj ,
(t+1)
and the definition of xj shows that
* +
X D (t+1) (t+1) (t)
E X (t+1)
(t+1) (t)
xi − xj , xj − xj = xi − xj , xj − xj = 0.
(t+1) (t+1)
i∈Sj i∈Sj
As a consequence,
k
X X k
X X
(t) (t+1) 2
kxi − xj k2 ≥ kxi − xj k = WCSS(t+1) ,
j=1 i∈S (t+1) j=1 i∈S (t+1)
j j
The statement of Lemma 1.3.3 is not sufficient to ensure that the number of iterations of the al-
(t) (t)
gorithm is finite, as one might imagine that after a certain index t0 , the sequence {S1 , . . . , Sk ; t ≥
t0 } be periodic, visiting partitions which all have the same WCSS. This point is addressed in Ex-
ercise 1.A.6, where it is proved that such periodic situations cannot occur.
In general, the outcome of the algorithm is not the optimal partition: it is only some kind of
‘local minimum’ of the WCSS, which heavily depends on the choice of the initial partition. In
order to find an acceptable solution, it is customary to draw several different initial configurations
(say at random), run the k-means algorithm in parallel, and select the final configuration with the
smallest WCSS.
24 Data analysis
Suarez
Victor
Felix
100
Helcelet
Mayer
Abele
Ziemek
Auzeil
Kazmirek
v. d. Plaetsen
0
Bourrada
Warner
e2
A. d. Araujo Eaton
−100
Taiwo
−200
Dubler
e1
Figure 1.10: Outcome of one realisation of the clustering by the k-means algorithm for 5 classes
on the results of decathlon. The athletes are represented in the first factorial plane.
The algorithm depends on a choice of a notion of distance d(S, S ′ ) between two set of points
S and S ′ . Among the many possible choices of distances13 , the Ward distance
|S||S ′ | 1 X 1 X
dW (S, S ′ ) = kxS − xS ′ k2 , xS := x, xS ′ := x,
|S| + |S ′ | |S| |S ′ | ′
x∈S x∈S
Lemma 1.3.4 (Ward distance and WCSS). Let S and S ′ be two set of points with respective means
xS and xS ′ , and let xS∪S ′ refer to the mean of S ∪ S ′ . We have
X X X
kx − xS∪S ′ k2 = kx − xS k2 + kx − xS ′ k2 + dW (S, S ′ ).
x∈S∪S ′ x∈S x∈S ′
13
See for example https://en.wikipedia.org/wiki/Hierarchical_clustering.
1.3 Clustering methods 25
Cluster Dendrogram
600
500
400
300
Height
Saluri
Nakamura
200
Dubler
Taiwo
Ushiro
Kazmirek
Ziemek
v. d. Plaetsen
Distelberger
A. d. Araujo
Eaton
Mayer
Victor
Wiesiolek
100
Bourrada
Warner
Auzeil
Tonnesen
Suarez
Abele
Garcia
Helcelet
Felix
distances
hclust (*, "complete")
Figure 1.11: Dendrogram of the agglomerative hierarchical clustering on the results of decathlon.
and a similar identity holds for the sum over so that S′,
X X X
kx−xS∪S ′ k2 = kx−xS k2 +|S|kxS −xS∪S ′ k2 + kx−xS ′ k2 +|S ′ |kxS ′ −xS∪S ′ k2 .
x∈S∪S ′ x∈S x∈S ′
On the other hand,
1 X |S| X |S ′ | X
xS∪S ′ = x = x + x,
|S ∪ S ′ | ′
|S| + |S ′ | |S| + |S ′ | ′
x∈S∪S x∈S x∈S
therefore
2 ′ 2 |S||S ′ |2 |S|2 |S ′ |
|S|kxS − xS∪S ′ k + |S |kxS ′ − xS∪S ′ k = + kxS − xS ′ k2
(|S| + |S ′ |)2 (|S| + |S ′ |)2
= dW (S, S ′ ),
which completes the proof.
26 Data analysis
As a corollary of Lemma 1.3.4, it is an easy exercise to deduce that the Ward distance allows
to minimise the increase of the WCSS along the iterations of the agglomerative algorithm.
Exercise 1.3.5. For all t ∈ {0, . . . , n − 1}, let WCSS(t) denote the WCSS of the partition
(t) (t)
S1 , . . . , Sn−t returned by the algorithm. For all t ∈ {0, . . . , n − 2}, for any j 6= l ∈ {1, . . . , n −
(t) (t) (t)
t}, we denote by WCSSjl the WCSS of the partition in which the classes Sj and Sl have been
merged. Show that
(t)
WCSS(t+1) = min WCSSjl . ◦
j6=l
Once again, the partitions of the sequence returned by the algorithm are generally not optimal.
In contrast with the k-means algorithm, they do not depend on an arbitrary choice of the initial
configuration. However, the algorithmic complexity of the method makes is not very useful for
large data sets.
Exercise 1.3.6. Find an example of a data set where the k-means algorithm, or the agglomerative
hierarchical classification, do not provide optimal results. ◦
1.A Exercises
Exercise 1.A.1. Let Kn be the covariance matrix of x1 , . . . , xn . Show that the row space14 of
Kn , namely the set of (row) vectors {uKn : u ∈ Rp } is the linear subspace of Rp spanned by
x1 − xn , . . . , xn − xn . ◦
Exercise 1.A.2. We observe the set of pairs (x1i , x2i ), i ∈ {1, . . . , n}, which are related by the
identity x2i = αx1i + β, where α, β ∈ R. What is the value of the empirical correlation between
the features x1 and x2 ? ◦
Exercise 1.A.3. We observe the set of pairs x1i , x2i , i ∈ {1, . . . , n}, which are related by the
identity x2i = sin(πkx1i /2). For k large, draw a scatter plot of the sample, using a statistical
software, and assuming that the x1i are randomly distributed over [0, 1]. Compute the empirical
correlation. What do you conclude about the interpretation of the empirical correlation? ◦
Exercise 1.A.4 (Spearman’s coefficient). For data sets which are related by a monotonic but non-
linear function, the Bravais–Pearson coefficient may be inadequate to represent the dependency
between the features. Let us take a set of pairs x1i , x2i , i ∈ {1, . . . , n}, such that both vectors of
features have pairwise distinct coordinates. For any index i ∈ {1, . . . , n}, we let ri1 and ri2 be
the respective rank of x1i and x2i among the series x11 , . . . , x1n and x21 , . . . , x2n sorted in increasing
order; in other words, if σ 1 and σ 2 are the permutations of {1, . . . , n} such that
x1σ1 (1) < · · · < x1σ1 (n) , x2σ2 (1) < · · · < x2σ2 (n) ,
rs = Corr(r 1 , r 2 ).
1. Compute r 1 , r 2 and the Spearman coefficient for the set (15, 9), (12, 7), (18, 8).
14
Image à droite en français.
1.A Exercises 27
3. Show that
X n
6
rs = 1 − (ri1 − ri2 )2 . ◦
n(n2 − 1)
i=1
Exercise 1.A.5 (Dobinski’s formula). Let X be a random variable with Poisson distribution of
parameter 1. For all n ≥ 0, we define bn = E[X n ].
2. In general, using the proof of Lemma 1.3.3, show that if WCSS(t+1) = WCSS(t) , then the
(t+1) (t+1) (t+2) (t+2)
partitions S1 , . . . , Sk and S1 , . . . , Sk necessarily coincide.
For all k ∈ {1, . . . , K}, we write Ik = {n1 + · · · + nk−1 + 1, . . . , n1 + · · · + nk } and for all
i ∈ Ik , we define xi = µk + ǫi .
1. When ρ ≪ min1≤j<j ′ ≤p kµj −µj ′ k, describe the general shape of the set of points (xi )1≤i≤n
in Rp .
2. Give an upper bound on the WCSS of the partition {I1 , . . . , IK } of {1, . . . , n} in terms of
ρ. Show that when ρ → 0, this partition has the smallest WCSS.
P
3. Express the vector xn ∈ Rp in terms of µ1 , . . . , µK and ǫn = n1 ni=1 ǫi .
4. Show that the empirical covariance matrix Kn satisfies Kn = Kn0 + O(ρ), and give the
expression of the matrix Kn0 .
5. Describe the shape of the scree plot of the Principal Component Analysis of this data set
when ρ → 0. You may admit that the function A 7→ Spectrum of A is Lipschitz continuous
on the space of symmetric matrices.
6. Interpreting, for each individual xi , the vector µi as ‘actual information’ and the vector ǫi
as ‘noise’, what do you conclude? ◦
28 Data analysis
1.B Summary
1.B.1 Statistical description of a data set
The data set is the n × p matrix xn , with coefficients xji , i ∈ {1, . . . , n}, j ∈ {1, . . . , p}.
P
• Empirical mean: row vector xn = n1 ni=1 xi in Rp .
P
• Empirical covariance: symmetric nonnegative matrix Kn = n1 ni=1 (xi − xn )⊤ (xi − xn ).
q
jj ′ ′ ′
• Empirical correlation between features x and x : Corr(x , x ) = Kn / Knjj Knj j in
j j ′ j j ′
[−1, 1].
• The loadings (e1 , . . . , ep ) are interpreted as ‘aggregated features’ of the model, their relation
with the original features is visualised in the correlation circles.
1.B.3 Clustering
• Purpose: find a partition into k classes with similar features.
Parametric estimation
The basic problematic of statistical inference can be summarised as follows: given independent
realisations x1 , . . . , xn of a random variable X with unknown distribution P , how to reconstruct,
or estimate, the probability measure P ? The knowledge of P is useful to understand the physical
origin of the random phenomenon under study, and to predict the probability of occurrence of its
future outcomes.
The main technical issue of the estimation of P is that one has to search within the set of
probability measures on a given state space, which may be huge. In the context of parametric
estimation, this search is restricted to a low-dimensional space by assuming that P has a certain
predetermined shape, which only depends on a few parameters.
P = {Pθ , θ ∈ Θ}
Once a parametric model is fixed, the issue of estimating a probability measure P is thus
reduced to the estimation of a parameter θ ∈ Rq .
Example 2.1.2 (The Exponential model). The lifespan of a lightbulb is a positive random vari-
able X > 0. If one assumes that this variable has an exponential distribution E(λ) for some
1
Tribu en français.
30 Parametric estimation
unknown parameter λ > 0, then the statistician’s task amounts to estimating the value of λ from
the observation of the values of independent random variables X1 , . . . , Xn distributed according
to E(λ).
The choice of a model depends on the physical features of the phenomenon under study. Ta-
ble 2.1 depicts a few possible (and classical) choices, depending on the type of the data.
P = {Pθ , θ ∈ Θ}, Θ ⊂ Rq ,
is fixed. For all θ ∈ Θ, it is convenient to denote by Pθ the probability measure under which, for
all n ≥ 1, the random variables X1 , . . . , Xn ∈ X are iid according to Pθ . The notation Eθ , Varθ ,
etc. is defined accordingly. We shall also write
Xn = (X1 , . . . , Xn ) ∈ Xn
2.1.2 Estimator
Definition 2.1.3 (Statistic). A statistic Tn is a random variable which writes
Tn = tn (Xn ),
In other words, a statistic is a random variable whose value can be computed from the obser-
vation of Xn without knowing θ.
As an example with X = R, the empirical mean
n
1X
Xn = Xi
n
i=1
Gaussian model, one may be only willing to estimate µ and not σ 2 , even if the latter is not known;
or in a more general context, it may be interesting to estimate directly some quantities such as
Eθ [X1 ] or Pθ (X1 ≥ x) without resorting to the estimation of the complete value of θ. For these
reasons, we fix a function g : Θ → Rd and turn our interest to the estimation of g(θ). We denote
by k · k and h·, ·i the Euclidean norm and the scalar product in Rd .
Definition 2.1.4 (Estimator of g(θ)). An estimator of g(θ) is a statistic Zn which takes its values
in g(Θ).
Exercise 2.1.5. In the Gaussian model {N(µ, σ 2 ), µ ∈ R, σ 2 > 0}, we let g(µ, σ 2 ) = µ. Which
of the random variables 0, µ, X1 , X n , are estimators of g(µ, σ 2 )? ◦
The exercise above shows that several estimators can exist for g(θ), some of which are intu-
itively better than others. The quality of an estimator can be measured with several criteria.
Definition 2.1.6 (Bias of an estimator). Let Zn be an estimator of g(θ) such that, for all θ ∈ Θ,
Eθ [kZn k] < +∞. The bias of Zn is the function b(Zn ; ·) : Θ → Rd defined by
b(Zn ; θ) = Eθ [Zn ] − g(θ).
If b(Zn ; ·) ≡ 0, the estimator Zn is said unbiased.
Example 2.1.7 (Empirical mean and variance). The empirical mean X n defined above is easily
seen to be an unbiased estimator of Eθ [X1 ]. On the contrary, it is an elementary exercise to show
that, for all θ ∈ Θ,
1
Eθ [Vn ] = 1 − Varθ [X1 ],
n
so that as soon as Varθ [X1 ] > 0, the empirical variance is biased. This motivates the introduction
of the unbiased estimator of the variance
n
1 X
Sn2 = (Xi − X n )2 .
n−1
i=1
In the latter example, the biased estimator of the variance is easily corrected into an unbiased
estimator. However, we shall see in Exercise 2.A.3 that there are situations in which there is no
unbiased estimator.
Definition 2.1.8 (Mean Squared Error). Let Zn be an estimator of g(θ) such that, for all θ ∈ Θ,
Eθ [kZn k2 ] < +∞. The Mean Squared Error2 (MSE) of Zn is the function R(Zn ; ·) : Θ →
[0, +∞) defined by
R(Zn ; θ) = Eθ [kZn − g(θ)k2 ].
The MSE naturally measures a certain distance between Zn and g(θ), and therefore is an
indication of the quality of the estimator Zn . It has the general shape
R(Zn ; θ) = Eθ [ℓ(Zn ; g(θ))],
with ℓ : Rd × Rd → [0, +∞) given by ℓ(z, z ′ ) = kz − z ′ k2 . In this formulation, ℓ is called a loss
function. By taking loss functions that are not quadratic, one can construct different risk functions
R(Zn ; ·), which provide other measures of the accuracy of the estimator Zn .
The choice of a quadratic loss function has the advantage to entail a decomposition of the MSE
into a term of bias and a term of variance. We recall that the variance of a random vector Z ∈ Rd
is defined by Var(Z) = E[kZ − E[Z]k2 ].
2
Risque quadratique en français.
32 Parametric estimation
Proposition 2.1.9 (Bias and variance decomposition of the MSE). Let Zn be an estimator of g(θ)
such that, for all θ ∈ Θ, Eθ [kZn k2 ] < +∞. For all θ ∈ Θ,
Exercise 2.1.10 (Comparison of two estimators). Assume that X = R and, for all θ ∈ Θ, Eθ [|X1 |] <
+∞. Let g(θ) = Eθ [X1 ]. Compute the MSE of the estimators X1 and X n of g(θ). ◦
In the previous exercise, X n has a lower MSE than X1 , uniformly with respect to the parameter
θ. We now present an example of a family of estimators for which it is not possible to minimise
the MSE uniformly over the parameter.
Exercise 2.1.11 (On the bias-variance tradeoff). In the Bernoulli model {B(p), p ∈ [0, 1]}, we
consider the ‘natural estimator’
n
1X
pbn = Xi
n
i=1
h
pbhn = (1 − h)b
pn + , h ∈ [0, 1].
2
4. For p ∈ [0, 1], compute the value h∗n (p) of h which minimises the MSE R(b
phn ; p). ◦
Exercise 2.1.11 shows that in general, the bias and the variance cannot be simultaneously
minimised, and a trade-off between the bias and the variance must be considered.
Remark 2.1.13 (Consistency and MSE). If R(Zn ; θ) converges to 0 for all θ ∈ Θ, then Zn is
consistent. ◦
Exercise 2.1.14. If X = R and Eθ [X1 ] and Varθ [X1 ] exist, show that the empirical mean and
variance are strongly consistent. ◦
More generally, it is often the case that an estimator can be written as a continuous function
of the empirical mean of a sequence of iid random variables, in which case consistency properties
can be deduced from the Law of Large Numbers.
Definition 2.1.15 (Asymptotic normality). A consistent estimator Zn of g(θ) is asymptotically
normal if, for all θ ∈ Θ, there exists a symmetric and nonnegative matrix K(θ) ∈ Rd×d such
√
that n(Zn − g(θ)) converges in distribution, under Pθ , to the d-dimensional Gaussian measure
Nd (0, K(θ)). The matrix-valued function θ 7→ K(θ) is called the asymptotic covariance of Zn .
When d = 1, K(θ) is a nonnegative scalar, called the asymptotic variance of Zn . In the sequel
of the course, the notion of asymptotic normality shall play a central role in the construction of
asymptotic confidence intervals and tests.
Asymptotic normality results are almost always obtained by applying the Central Limit The-
orem, possibly in its multidimensional version recalled in Appendix A.3, p. 137. Sometimes, it
needs to be combined with the Delta Method.
Theorem 2.1.16 (Delta Method). Let (ζn )n≥1 be a sequence of Rk -valued random variables and
√
a ∈ Rk such that ζn → a in probability and n(ζn − a) converges in distribution to some random
vector G ∈ Rk . Let φ : Rk → Rd be a C 1 function. Then
√
lim n (φ(ζn ) − φ(a)) = φ′ (a)G, in distribution,
n→+∞
Proof. In order to avoid technical arguments, we assume that φ is C 2 , with globally bounded
second derivatives3 .
Let i ∈ {1, . . . , d}. For all n ≥ 1,
Z 1
√ √ d
n(φi (ζn ) − φi (a)) = n φi ((1 − t)a + tζn )dt
t=0 dt
Z 1
√
= n(ζn − a), ∇φi ((1 − t)a + tζn )dt .
t=0
As a consequence
Z 1
lim ∇φi ((1 − t)a + tζn )dt = ∇φi (a), in probability,
n→+∞ t=0
Examples of applications of this theorem are given in the next section. In particular, we shall
generally use the combination of the Central Limit Theorem and the Delta Method under the
following form: if X1 , . . . , Xn ∈ R are iid with finite variance and φ is C 1 , then
√
n φ X n − φ (E[X1 ]) → N 0, φ′ (E[X1 ])2 Var(X1 ) , in distribution.
from which we conclude that the estimator λ en is asymptotically normal, with asymptotic variance
′ 2 2
φ (1/λ) /λ = λ . 2
The abstract generalisation of this procedure is called the method of moments. For the estima-
tion of g(θ) ∈ Rd in the model {Pθ , θ ∈ Θ}, it consists in finding functions ϕ and m such that,
for all θ ∈ Θ,
Eθ [ϕ(X1 )] = m(g(θ)).
P
Then the Law of Large Numbers allows to approximate m(g(θ)) with n1 ni=1 ϕ(Xi ), so that as
soon as m has a continuous inverse function m−1 ,
n
!
1 X
Zn = m−1 ϕ(Xi )
n
i=1
is a strongly consistent estimator of g(θ). Under further regularity assumptions on m−1 , the
Central Limit Theorem and the Delta Method show that Zn is asymptotically normal.
Exercise 2.1.17. Write the general expression of the asymptotic variance of Zn in terms of Eθ [ϕ(X1 )]
and Varθ [ϕ(X1 )]. ◦
When constructing an estimator with the methods of moments, one usually tries the ‘simplest’
functions ϕ, for which the computation of Eθ [ϕ(X1 )] is possible or easy: typically, ϕ(x) = x or
ϕ(x) = |x|2 , |x|3 ... are natural candidates. The method can also be employed with functions of
the form ϕ(x) = 1{x≤x0 } for given x0 , see Exercise 2.A.8 for instance.
4
But it is an instructive exercise to check that it is biased.
2.2 Maximum Likelihood Estimation and efficiency 35
The Maximum Likelihood Estimation relies on the principle of selecting the parameter which
makes the observed realisation of the sample the most likely.
Definition 2.2.2 (Maximum Likelihood Estimator). Assume that, for all xn = (x1 , . . . , xn ) ∈ Xn ,
the function θ 7→ Ln (xn ; θ) reaches a global maximum at θ = θn (xn ). The Maximum Likelihood
Estimator (MLE) of θ is the statistic defined by
θbn = θn (Xn ).
Remark 2.2.3 (On the notation). In this definition, the notation θn refers to the deterministic
function Xn → Θ, while θbn denotes the random variable obtained by applying the function θn to
the random vector Xn . ◦
When the function θ 7→ Ln (xn ; θ) is differentiable, θn (xn ) can be computed by looking for
the points at which the derivative of Ln (xn ; θ) vanishes — without forgetting to check that these
points actually correspond to a maximum! In this perspective, it may be more convenient to take
the derivative of the log-likelihood
rather that Ln (xn ; θ), because the product over i in the definition of the likelihood is turned into a
sum. Since the logarithm is increasing, both approaches are equivalent.
Remark 2.2.4 (Notions of Z- and M-estimator). Moment estimators are defined as the solution to
an algebraic equation: such estimators are called Z-estimators (because they are Zeroes of a func-
tion). In contrast, the MLE is given as the solution to an optimisation problem: such estimators
are called M-estimators (because they are Maxima of a function). The Least Square Estimator in-
troduced in the context of linear regression detailed in Section 5.1 of Chapter 5 is another example
of an M-estimator.
In the academic examples of the present notes, the resolution of algebraic equations or op-
timisation problem can generally be made analytically (to the notable exception of the logistic
regression described in Chapter 5). This is however not possible in general, and computing Z- or
M-estimators often requires to employ a numerical method. ◦
5
Vraisemblance en français.
36 Parametric estimation
The log-likelihood is
n
X
ℓn (xn ; λ) = n log λ − λ xi ,
i=1
Since
d d
ℓn (xn ; λ) > 0 if λ < λn (xn ), ℓn (xn ; λ) < 0 if λ > λn (xn ),
dλ dλ
the log-likelihood — and therefore the likelihood — actually attains its maximum at λn (xn ). As a
consequence, the MLE of λ is
bn = λn (Xn ) = 1 .
λ
Xn
The MLE coincides with moment estimator obtained in Section 2.1.4. We shall see in Exam-
ple 2.2.9 that it is not always the case.
Exercise 2.2.6 (The Bernoulli model). Compute the MLE of p in the Bernoulli model {B(p), p ∈
[0, 1]}. Show that it is strongly consistent and asymptotically normal. ◦
Example 2.2.7 (The Gaussian model). The likelihood of a realisation xn = (x1 , . . . , xn ) ∈ Rn
in the Gaussian model {N(µ, σ 2 ), µ ∈ R, σ 2 > 0} writes
n n
!
Y 1 (xi − µ)2 1 X
2 2 −n/2 2
Ln (xn ; µ, σ ) = √ exp − = (2πσ ) exp − 2 (xi − µ) .
2πσ 2 2σ 2 2σ
i=1 i=1
The log-likelihood is
n
2 n 2 1 X
ℓn (xn ; µ, σ ) = − log(2πσ ) − 2 (xi − µ)2 ,
2 2σ
i=1
and the fact that the log-likelihood actually attains its maximum at the point (µn (xn ), σn2 (xn ))
can be checked by studying the sign of the Hessian matrix of ℓn (xn ; µ, σ 2 ) at this point, which is
left as an exercise to the reader. As a conclusion, the MLE of (µ, σ 2 ) is given by
bn = µn (Xn ) = X n ,
µ bn2 = σn2 (Xn ) = Vn ,
σ
Remark 2.2.8. By Proposition A.4.3, p. 139, the estimators µ bn2 are independent.
bn and σ ◦
Example 2.2.9 (The Uniform model). We consider the set of uniform distributions {U([0, θ]), θ >
0}. For a given θ > 0, the random variables X1 , . . . , Xn take their values in [0, θ], Pθ -almost
surely; but since θ can a priori take any positive value, we have to work with the state space
X = [0, +∞). For any xn = (x1 , . . . , xn ) ∈ [0, +∞)n , the likelihood of the realisation xn writes
n
Y 1
Ln (xn ; θ) = 1{xi ≤θ} = θ −n 1{max1≤i≤n xi ≤θ} .
θ
i=1
As a function of θ > 0, the log-likelihood is not differentiable at the point max1≤i≤n xi , therefore
the method of the previous examples cannot be applied. On the other hand, the study of the
variations of the function θ 7→ Ln (xn ; θ) is straightforward (see Figure 2.1) and shows that the
latter attains its maximum for θ = θn (xn ) = max1≤i≤n xi . As a consequence, the MLE is given
by
θbn = max Xi .
1≤i≤n
θ −n
max xi
1≤i≤n θ
Figure 2.1: The likelihood of the Uniform model.
As a comparison, using the remark that Eθ [X1 ] = θ/2, we immediately obtain the moment
estimator θen = 2X n , which is very different from the MLE.
Exercise 2.2.10. Consider the MLE θbn in the Uniform model of Example 2.2.9.
2. Using the remark that the sequence (θbn )n≥1 is monotonic, show that consistency holds in
the strong sense.
In the Uniform model of Example 2.2.9, the support of the law of X1 depends on the param-
eter θ, which makes the likelihood not differentiable with respect to θ. Trying to differentiate
the likelihood, or the log-likelihood, with respect to θ, then leads to wrong results. Models for
which the support depends on the parameter generally belong to the class of nonregular models,
as introduced in the next paragraph.
Example 2.2.12 (Regular models). The Bernoulli, Exponential and Gaussian models are regular,
but the Uniform model of Example 2.2.9 is not (why?).
Definition 2.2.13 (Score). The score of a regular model is the random vector
∇θ ℓ1 (X1 ; θ) ∈ Rq ,
with coordinates
∂ 1 ∂
ℓ1 (X1 ; θ) = p(X1 ; θ), i ∈ {1, . . . , q}.
∂θi p(X1 ; θ) ∂θi
In the case where Pθ possesses a density p(x; θ) on Rl , assuming the validity of the computa-
tion Z Z
∂ ∂
p(x; θ)dx = p(x; θ)dx = 0 (∗)
∂θi x∈Rl x∈Rl ∂θi
| {z }
=1
Definition 2.2.14 (Fisher information). For a regular model, the Fisher information I(θ) is the
covariance matrix of the score, with coefficients
∂ ∂
Iij (θ) = Eθ ℓ1 (X1 ; θ) ℓ1 (X1 ; θ) .
∂θi ∂θj
Exercise 2.2.15. Taking the derivative with respect to θj of Equation (∗), show that (whenever the
computation is justified) the coefficients of the Fisher information also write
∂2
Iij (θ) = −Eθ ℓ1 (X1 ; θ) . ◦
∂θi ∂θj
Throughout this paragraph, we take g : Θ → R. In this context, Exercise 2.1.11 shows that,
in general, an estimator cannot minimise the MSE uniformly in θ, because of the bias-variance
tradeoff. However, if one restricts itself to the class of unbiased estimators, then the search for
an estimator with minimal MSE is reduced to problem of variance minimisation. Exercise 2.1.10
provides an example of two unbiased estimators, one of which has a lower MSE uniformly in θ.
For regular models, the Fisher information provides an interesting lower bound on the variance of
an unbiased estimator.
Theorem 2.2.17 (Fréchet–Darmois–Carmér–Rao (FDCR) bound). For a regular model such that
I(θ) is positive definite for all θ ∈ Θ, let g : Θ → R be smooth, and let Zn = zn (Xn ) be an
unbiased estimator of g(θ). For all θ ∈ Θ,
1
Varθ [Zn ] ≥ h∇g(θ), I −1 (θ)∇g(θ)i,
n
Proof. We assume that we are in the case where X is a subset of Rl and for all θ ∈ Θ, the
probability measure Pθ possesses a density with respect to the Lebesgue measure on Rl . The
adaptation of the proof to the case of discrete random variables is straightforward.
Since Zn is unbiased,
Z
g(θ) = Eθ [Zn ] = zn (xn )Ln (xn ; θ)dxn ,
xn ∈Xn
40 Parametric estimation
so that " #
n
X
∇g(θ) = Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ) .
i=1
Thus,
* " n
# +
X
−1 −1
∇g(θ), I (θ)∇g(θ) = Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ) , I (θ)∇g(θ)
" * i=1
n
+#
X
−1
= Eθ (Zn − g(θ)) ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ) ,
i=1
Since Zn is unbiased,
Eθ (Zn − g(θ))2 = Varθ [Zn ],
while using Equation (∗∗) on p. 36 again together with the Definition 2.2.14 of I(θ) shows that
* +2 "* n +#
X n X
−1 −1
Eθ ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ) = Varθ ∇θ ℓ1 (Xi ; θ), I (θ)∇g(θ)
i=1 i=1
= n Varθ ∇θ ℓ1 (X1 ; θ), I −1 (θ)∇g(θ)
= n I −1 (θ)∇g(θ), I(θ)I −1 (θ)∇g(θ)
= n I −1 (θ)∇g(θ), ∇g(θ) ,
Definition 2.2.18 (Efficient estimator). An unbiased estimator Zn of g(θ) such that Varθ [Zn ] =
h∇g(θ), I −1 (θ)∇g(θ)i/n is called efficient7 .
Theorem 2.2.17 shows that efficient estimators have the smallest possible MSE among the
class of unbiased estimators.
Exercise 2.2.19. Check that in the Bernoulli model, the estimator X n of p is efficient. ◦
Definition 2.2.20 (Asymptotically efficient estimator). A consistent estimator Zn of g(θ) such that
√
n(Zn −g(θ)) converges in distribution to a random variable with variance h∇g(θ), I −1 (θ)∇g(θ)i
is called asymptotically efficient.
The MLE θbn is consistent and asymptotically normal, with asymptotic covariance matrix I −1 (θ).
(i) computing the asymptotic covariance K(θ) of θbn (often using the Central Limit Theorem
and the Delta Method) on the one hand;
A heuristic justification of why this assertion holds in general is sketched in Remark 2.2.23 below.
Exercise 2.2.21. Compute I(µ, σ 2 ) in the Gaussian model. Compare the results with Exam-
ple 2.2.7. ◦
Proposition 2.2.22 (Asymptotic efficiency of the MLE). Assume that θbn is consistent and asymp-
totically normal, with asymptotic covariance matrix I −1 (θ). Then for any C 1 function g : Θ → R,
the estimator Zn = g(θbn ) of g(θ) is asymptotically efficient.
The proof is a straightforward application of the Delta Method and is left as an exercise.
This proposition emphasises the interest of Maximum Likelihood Estimation, especially in regular
models.
Remark 2.2.23. The fact that, in general, θbn is consistent and asymptotically normal with asymp-
totic covariance matrix I −1 (θ), can be derived thanks to the following heuristic computation8 . For
the sake of simplicity, we assume that q = 1 so that Θ ⊂ R.
By the definition of θn (xn ) and the regularity assumption on ℓ1 (x1 ; ·), we have
d
ℓn (xn ; θn (xn )) = 0.
dθ
7
Efficace en français.
8
We refer to [6, Theorem 5.39, p. 65] for a complete proof.
42 Parametric estimation
d d d2
ℓn (xn ; θn (xn )) ≃ ℓn (xn ; θ) + 2 ℓn (xn ; θ)(θn (xn ) − θ),
dθ dθ dθ
we thus deduce that
d 1 d
ℓn (Xn ; θ) √ ℓn (Xn ; θ)
√ dθ √ n dθ
b
n(θn − θ) ≃ − n 2 =− .
d 1 d2
ℓn (Xn ; θ) ℓn (Xn ; θ)
dθ 2 n dθ 2
On the one hand, Equation (∗∗) p. 36 shows that
n !
1 d √ 1X d d
√ ℓn (Xn ; θ) = n ℓ1 (Xi ; θ) − Eθ ℓ1 (X1 ; θ) ,
n dθ n dθ dθ
i=1
which by the Central Limit Theorem converges in distribution, under Pθ , to N(0, I(θ)). On the
other hand, by the Law of Large Numbers,
n
1 d2 1 X d2
2
ℓn (Xn ; θ) = ℓ1 (Xi ; θ)
n dθ n dθ 2
i=1
Definition 2.3.1 (Conditional expectation). For any random variable X ∈ R such that E[|X|] <
+∞, there exists an almost surely unique random variable Z ∈ R such that:
The random variable Z is called the conditional expectation of X given S and it is denoted by
E[X|S].
It is a nice exercise to check that the conditional expectation enjoys the following properties.
9
http://cermics.enpc.fr/~delmas/Enseig/proba2.html
2.3 * Sufficient statistics and the Rao–Blackwell Theorem 43
Proposition 2.3.2 (Properties of conditional expectation). (i) For any measurable function ρ :
S → R such that E[|ρ(S)X|] < +∞, E[ρ(S)X|S] = ρ(S)E[X|S].
(v) Jensen’s inequality: for any convex function f : R → R, f (E[X|S]) ≤ E[f (X)|S].
Exercise 2.3.3 (Further properties). 1. Show that if X is independent from S, E[X|S] = E[X].
In the particular case where S is a finite or countably infinite space, the function ϕ appearing
in Definition 2.3.1 has an elementary expression.
Proposition 2.3.4 (Conditional expectation in the discrete case). Assume that S is a finite or count-
ably infinite space. Let X ∈ R be a random variable such that E[|X|] < +∞, and let ϕ : S → R
be defined by
E[X1{S=s} ] if P(S = s) > 0,
ϕ(s) = P(S = s)
0 otherwise.
Proof. Let Z = ϕ(S). By construction, the random variable Z satisfies the point (ii) of Defini-
tion 2.3.1. Besides,
E[|Z|] = E[|ϕ(S)|]
X
= |ϕ(s)|P(S = s)
s∈S
X
= E[X1{S=s} ]
s∈S
X
≤ E[|X|1{S=s} ]
s∈S
" #
X
= E |X| 1{S=s}
s∈S
= E[|X|] < +∞,
which proves the point (iii) of Definition 2.3.1. Finally, we let ψ : S → R be a bounded and
measurable function, and compute
X X
E [ψ(S)Z] = E [ψ(S)ϕ(S)] = ψ(s)ϕ(s)P(S = s) = ψ(s)E[X1{S=s} ] = E[Xψ(S)],
s∈S s∈S
Definition 2.3.5 (Sufficient statistic). The statistic Sn is called sufficient10 for θ if, for any (mea-
surable and) bounded function f : Xn → R, the function ϕθ introduced above does not depend on
θ; in other words, if the conditional distribution of the sample Xn given Sn does not depend on
the parameter.
Intuitively, this definition means that the knowledge of the whole sample Xn does not bring
more information on the parameter θ than the mere knowledge of the value of Sn . Usually, Sn has
a low dimension, and therefore is easy to store, while the size of the sample Xn basically depends
on n and may be huge. In other words, Sn provides an exhaustive summary of the data, as far the
estimation of θ is concerned.
P
Example 2.3.6. In the Bernoulli model, we show that Sn = ni=1 Xi is a sufficient statistic for the
parameter p. This statistic takes its values in the finite set S = {0, . . . , n}. Fixing f : {0, 1}n → R
and s ∈ S, we first use Proposition 2.3.4 to write
As a consequence, X
f (xn )1{x1 +···+xn =s}
xn ∈Xn
ϕp (s) = ,
n
s
which does not depend on p.
10
Statistique exhaustive en français.
2.3 * Sufficient statistics and the Rao–Blackwell Theorem 45
Remark 2.3.7 (On the conditional distribution of Xn given Sn ). In the example above, ns is
exactly the number of elements xn ∈ {0, 1}n such that x1 + · · · + xn = s, so that the conditional
distribution of Xn given Sn can be interpreted as the uniform distribution in the set of possible
realisations of the sample which are compatible with the prescribed value of Sn . ◦
In general, the computation of the conditional distribution of a random variable is a nice exer-
cise of probability theory, but may happen to be tedious. The next result allows to find sufficient
statistics without resorting to such computations.
Proof. We detail the proof in the case where X is a finite or countably infinite set. Then so is
S = sn (Xn ), and the conditional expectation with respect to Sn is described by Proposition 2.3.4.
The case where X ⊂ Rl follows from similar arguments, expressed in an appropriate formalism.
Necessary condition: let us assume that Sn is sufficient. Then
n
Y
Ln (xn ; θ) = p(xi ; θ)
i=1
= Pθ (Xn = xn )
= Pθ (Xn = xn , Sn = sn (xn ))
= Pθ (Xn = xn |Sn = sn (xn ))Pθ (Sn = sn (xn )).
Since Sn is sufficient,
Pθ (Xn = xn |Sn = sn (xn )) = Eθ 1{Xn =xn } |Sn = sn (xn ) = ψ(xn )
so that
X X
φ(s; θ) f (xn )ψ(xn ) f (xn )ψ(xn )
xn ∈Xn xn ∈Xn
sn (xn )=s sn (xn )=s
Eθ [f (Xn )|Sn = s] = X = X ,
φ(s; θ) ψ(xn ) ψ(xn )
xn ∈Xn xn ∈Xn
sn (xn )=s sn (xn )=s
Exercise 2.3.9. Show that, in the Exponential model, X n is a sufficient statistic for λ. ◦
Exercise 2.3.10. Show that, in the Gaussian model, (X n , Vn ) is a sufficient statistic for (µ, σ 2 ). ◦
Remark 2.3.11. If the likelihood possesses the factorisation of Theorem 2.3.8, then the functions
θ 7→ Ln (xn ; θ) and θ 7→ φ(sn (xn ); θ) reach their maximum at the same points. As a consequence,
the MLE necessarily depends on xn only through sufficient statistics. ◦
Since Sn is sufficient for θ, the random variable ZnSn is a statistic, and under mild assumptions
under g(Θ) (for instance, if g(Θ) is an interval), ZnSn takes its values in g(Θ) so that ZnSn is also
an estimator of g(θ).
Proposition 2.3.12 (Rao–Blackwell Theorem). The estimators Zn and ZnSn satisfy
As a consequence,
R(ZnSn ; θ) ≤ R(Zn ; θ).
Proof. By (ii) in Proposition 2.3.2,
so that
h 2 i 2
Varθ ZnSn = Eθ ZnSn − Eθ ZnSn ≤ Eθ Zn2 − Eθ [Zn ]2 = Varθ (Zn ) .
As a practical consequence of Proposition 2.3.12, it is always a good idea to take the condi-
tional expectation of an estimator given a sufficient statistic: this procedure, which improves the
MSE, is sometimes called Rao–Blackwellisation. Of course, if Zn already depends on the sam-
ple through a sufficient statistic, which by Remark 2.3.11 is in particular the case for the MLE,
then this procedure has no effect: indeed, in this case Zn writes ρ(Sn ) for some function ρ, and
therefore by (i) in Proposition 2.3.2, ZnSn = Eθ [ρ(Sn )|Sn ] = ρ(Sn ) = Zn .
2.4 Confidence intervals 47
Example 2.3.13. In the Bernoulli model, X1 is an estimator of p, and it was proved in Exam-
ple 2.3.6 that Sn = X1 + · · · + Xn is a sufficient statistic. The Rao–Blackwellisation of X1 is then
the statistic Ep [X1 |Sn ], which by Example 2.3.6 with f (xn ) = x1 writes
1 X
Ep [X1 |Sn ] = ϕ(Sn ), ϕ(s) = x1 1{x1 +···+xn =s} .
n n
xn ∈{0,1}
s
n−1
Among the ns elements (x1 , . . . , xn ) ∈ {0, 1}n such that x1 + · · · + xn = s, there are s−1
elements for which x1 = 1. Therefore
n−1
s−1 s
ϕ(s) = = ,
n n
s
Definition 2.4.1 (Confidence interval). A confidence interval with level 1−α for g(θ) is an interval
In = [In− , In+ ] such that:
It is to be emphasised that in the event {g(θ) ∈ In }, it is the interval In which is random, and
not the quantity g(θ).
Sometimes it is difficult, tedious or not possible to construct confidence intervals in the sense
of Definition 2.4.1, which are called exact, and one has to resort to weaker notions. Hence, an
interval In whose boundaries In− , In+ are statistics is said to be:
Of course, both definitions can be combined to lead to the notion of asymptotic approximate
confidence interval, such that limn→+∞ Pθ (g(θ) ∈ In ) ≥ 1 − α.
We recall the definition of a quantile, which is essential in the construction of confidence
intervals.
Definition 2.4.2 (Quantile). Let X be a real-valued random variable. For any r ∈ (0, 1), a
quantile of order r for X is a number qr such that
P(X ≤ qr ) = r.
In general a quantile need not exist, and it need not be unique either. However in most cases
of interest, the variable X possesses a density which is positive on R (or on [0, +∞)), in which
case qr exists and is unique.
Since qr only depends on the law of X, we shall often directly speak of the quantile of a
distribution (for instance the standard Gaussian distribution N(0, 1)).
Definition 2.4.3 (Free random variable). A random variable Q is free if its law under Pθ does not
depend on θ.
Definition 2.4.3 does not require the random variable Q to be a statistic, and in general we use
free random variables Q which depend on the parameter θ. For example, in the Gaussian model,
the random variables
Xi − µ
Xi′ = , i = 1, . . . , n,
σ
are iid according to the standard Gaussian distribution N(0, 1), which does not depend on (µ, σ 2 ):
they are free random variables. This example is an instance of a pivotal function.
Definition 2.4.4 (Pivotal function). A pivotal function for g(θ) is a function πn : Xn × g(Θ) → R
such that πn (Xn ; g(θ)) is free.
We now detail the construction of a confidence interval for the mean µ in the Gaussian model
{N(µ, σ 2 ), (µ, σ 2 ) ∈ R × (0, +∞)}.
The MLE of µ is X n , whose law under Pµ,σ2 is N(µ, σ 2 /n). As a consequence, the random
variable
Xn − µ
ζn = p
σ 2 /n
is free, with law N(0, 1). Still, this does not mean that the function
xn − µ
πn (xn ; µ) = p
σ 2 /n
is pivotal, because following Definition 2.4.4 it should only depend on the parameter (µ, σ 2 )
through µ.
2.4 Confidence intervals 49
1−α φ1−α/2
90% 1.65
95% 1.96
99% 2.58
Figure 2.2: Quantiles of the standard Gaussian distribution. The hatched area on the figure is equal
to 1 − α.
3. Show that there exists a unique such u and give the unique solution to (1). ◦
If we assume that σ 2 is not known, then we have to find another pivotal function, which will
no longer depend on σ 2 . A simple idea consists in replacing σ 2 by an estimator of the variance,
and check whether the resulting random variable is free.
We first write
Xn − µ ′
σ Xn
ξn = v , = p ′2 ,
u n Sn /n
u 1 X (Xi − X n )2
t n
n−1 σ2
i=1
which already shows that ξn is free since the right-hand side no longer depends on (µ, σ 2 ). The
exact law of ξn is given by Proposition A.4.3 in Appendix A.
which no longer depends on σ 2 but in turn requires to estimate the variance of the sample, is
pivotal. As a consequence, for all a, b ∈ R such that a < b,
r r ! Z
b
Sn2 Sn2
Pµ,σ2 (ξn ∈ [a, b]) = Pµ,σ2 X n − b ≤ µ ≤ Xn − a = pn−1 (x)dx,
n n x=a
where pn−1 is the density of the law t(n − 1). Once again, as soon as a and b are chosen so that
Z b
pn−1 (x)dx = 1 − α,
x=a
we get a confidence interval with level 1−α for µ. The smallest such interval is obtained by taking
where tn,r denotes the quantile of order r of the Student distribution t(n − 1), which is obtained
with the command qt(r,n) in R.
As a conclusion, the confidence interval with level 1 − α for µ is
" r r #
σ2 σ2
In = X n − φ1−α/2 , X n + φ1−α/2
n n
if σ 2 is known, and
" r r #
Sn2 Sn2
In = X n − tn−1,1−α/2 , X n + tn−1,1−α/2
n n
if σ 2 is not known. Notice that tn−1,1−α/2 is larger than φ1−α/2 , so that (up to the fluctuations in
the estimation of σ 2 by Sn2 ) the latter confidence interval is larger than the former. This is natural:
less information is available in the second case, so that there is more uncertainty on the parameter
µ.
Exercise 2.4.7. The purpose of this exercise is to prove that for all n ≥ 2, for all r ∈ (1/2, 1),
tn−1,r ≥ φr .
1. Show that this is equivalent to an inequality on the cumulative distribution functions of
t(n − 1) and N(0, 1) on [0, +∞).
p
2. Show that if x > 0 and Tn ∼ t(n), P(Tn ≤ x) = E[Φ(x Yn /n)], where Φ is the
cumulative distribution function of N(0, 1) and Yn ∼ χ2 (n).
3. Conclude using Jensen’s inequality (twice). ◦
We now summarise the method to construct an exact confidence interval for g(θ).
(i) Find a pivotal function πn (xn ; g(θ)); denote by Qn the free random variable πn (Xn ; g(θ)).
(ii) Fix a, b ∈ R which minimise b − a under the constraint that P(Qn ∈ [a, b]) = 1 − α.
(iii) Rewrite the condition Qn ∈ [a, b] as g(θ) ∈ In , where the boundaries of In are statistics.
Denoting by qn,r the quantile of order r of Qn , it is often the case that the optimal choice for
a, b be the pair qn,α/2 , qn,1−α/2 , or 0, qn,1−α if Qn is the law of a nonnegative random variable.
Exercise 2.4.8. In the Gaussian model, find a confidence interval with level 1 − α for σ 2 . ◦
Exercise 2.4.9. In the Exponential model, find a confidence interval with level 1 − α for λ. ◦
In such a case, one can construct approximate confidence intervals, thanks to concentration
inequalities.
A famous concentration inequality for random variables Y such that E[|Y |2 ] < +∞ is the
Bienaymé–Chebychev inequality
Var(Y )
P(|Y − E[Y ]| ≥ r) ≤ ,
r2
which follows from Markov’s inequality. We first explain how to obtain an approximate confi-
dence interval from such an inequality, for the Beta model of Example 2.4.10. For all r > 0, the
Bienaymé–Chebychev inequality yields
a Vara,b [X n ] Vara,b [X1 ]
Pa,b X n − ≥ √r ≤ = .
a + b n r 2 /n r2
As a consequence, taking r such that
Vara,b [X1 ]
≤α (∗)
r2
ensures that
a r r
Pa,b ∈ Xn − √ , Xn + √ ≥ 1 − α.
a+b n n
But since Vara,b [X1 ] depends on the parameters a and b, the condition on r prevents the bounds of
the interval from being statistics. In the case of the Beta model, and more generally for bounded
random variables, the next lemma provides a universal bound on the variance.
Lemma 2.4.12 (Universal bound on the variance). Let Y be a random variable taking its values
in [0, 1]. Then
1
Var(Y ) ≤ .
4
Exercise 2.4.13. Find a variable Y ∈ [0, 1] such that Var(Y ) = 1/4. ◦
and notice that on the event Y ≤ E[Y ], we have (Y − E[Y ])2 ≤ E[Y ]2 ; while on the event
Y > E[Y ], we have (Y − E[Y ])2 ≤ (1 − E[Y ])2 . As a consequence,
and we finally note that, for any p ∈ [0, 1], for any a ∈ [0, 1],
1
pa2 + (1 − p)(1 − a)2 ≤ ,
4
which completes the proof.
2.4 Confidence intervals 53
Exercise 2.4.14. Let Y a random variable such that E[|Y |2 ] < +∞.
to ensure that (∗) holds. Thus, we get the approximate confidence interval
r Cheb (α) r Cheb (α)
InCheb = X n − √ , Xn + √ .
n n
Lemma 2.4.15 (Hoeffding’s inequality). Let X1 , . . . , Xn be iid random variables taking their
values in [0, 1]. For all n ≥ 1, for all r ≥ 0,
n
!
X √
P (Xi − E[Xi ]) ≥ r n ≤ exp(−2r 2 ).
i=1
E[Y1 exp(λY1 )]
F ′ (λ) = ,
E[exp(λY1 )]
and
E[Y12 exp(λY1 )]E[exp(λY1 )] − E[Y1 exp(λY1 )]2
F ′′ (λ) =
E[exp(λY1 )]2
" #
1 E[Y1 exp(λY1 )] 2
= E Y1 − exp(λY1 )
E[exp(λY1 )] E[exp(λY1 )]
" #
1 E[X1 exp(λY1 )] 2
= E X1 − exp(λY1 ) .
E[exp(λY1 )] E[exp(λY1 )]
Using the same arguments as in the proof of Lemma 2.4.12, we get F ′′ (λ) ≤ 1/4. Noting that
F ′ (0) = E[Y1 ] = 0 and F (0) = log 1 = 0, and integrating twice yields F (λ) ≤ λ2 /8, whence
the claimed result.
54 Parametric estimation
where we have used the fact that y 7→ exp(λy) is increasing at the first line and the Markov
√
inequality at the second line. The minimum of the quantity λ2 n/8 − λr n for λ > 0 is reached
√
at the value λ = 4r/ n, at which point it equals −2r 2 , which completes the proof.
for all events A, and then check that F ′ (λ) = Eλ [Y1 ], F ′′ (λ) = Varλ [Y1 ] = Varλ [X1 ].
Since X1 takes its values in [0, 1], the bound on F ′′ (λ) is in fact a direct consequence of
Lemma 2.4.12.
• The combination of: the application of the increasing function y 7→ exp(λy); the use of
Markov’s inequality; and the optimisation of the result over the values of λ > 0; is a classical
method, called Chernoff’s method. It is as simple as powerful. ◦
Lemma 2.4.15 is often applied under the form of the following corollary.
Corollary 2.4.17 (Application of Hoeffding’s inequality). Let X1 , . . . , Xn be iid random vari-
ables taking their values in [0, 1]. For all n ≥ 1, for all r ≥ 0,
√
P X n − E[X1 ] ≥ r/ n ≤ 2 exp(−2r 2 ).
Applying Corollary 2.4.17 to the Beta model presented in Example 2.4.10 yields, for all r ≥ 0,
a r
Pa,b X n − ≥ √ ≤ 2 exp(−2r 2 ).
a + b n
2.4 Confidence intervals 55
1
5.0x10
1
10
0
5.0x10
0
10
−4 −3 −2 −1
10 10 10 10
Chebychev
Hoeffding
Figure 2.3: Log-log plot of the functions r Cheb (α) and r Hoeff (α), for α ranging from 10−4 to
10−1 .
In general, it is not difficult to find a consistent estimator for V (θ): as soon as V is continuous
and θbn is a consistent estimator of θ, one can take Vbn = V (θbn ). If there is no such estimator
available, the procedure of variance stabilisation described in Exercise 2.A.9 can be applied.
Proof of Proposition 2.4.18. We start from the asymptotical normality of Zn , which writes
r
n
(Zn − g(θ)) → N(0, 1), in distribution.
V (θ)
Following the same arguments as in the construction of exact confidence intervals for the Gaussian
model, we take b = −a = φ1−α/2 , which results in the expected confidence interval.
P
Example 2.4.19 (The Bernoulli model). We want to employ the estimator pbn = n1 ni=1 Xi to
construct an asymptotic confidence interval for p. This estimator is (strongly) consistent, and
asymptotically normal with asymptotic variance V (p) = p(1 − p). As a consequence, a consistent
estimator of the asymptotic variance is given by pbn (1 − pbn ), and we get that
" r r #
pbn (1 − pbn ) pbn (1 − pbn )
In = pbn − φ1−α/2 , pbn + φ1−α/2
n n
Remark 2.4.20. For the Bernoulli model, when the size n of the sample is too small for the asymp-
totic approach to be used, approximate confidence intervals can be constructed based on Cheby-
chev’s or Hoeffding’s inequality. However, finding a pivotal function to obtain exact confidence
intervals is more difficult. ◦
The sketch of the proof of Proposition 2.4.18 allows to construct asymptotic confidence in-
tervals even when the estimator Zn is not asymptotically normal. As an example, recall that for
the Uniform model of Exercise 2.2.10, the MLE θbn = max1≤i≤n Xi is strongly consistent, and
satisfies
n(θ − θbn ) → E(1/θ), in distribution.
As a consequence, !
θbn
n 1− → E(1), in distribution,
θ
13
We recall that if the random variable ζn converges in distribution to a random variable ζ which possesses a density,
then for any interval [a, b], P(ζn ∈ [a, b]) converges to P(ζ ∈ [a, b]), see [2, Remarque 5.3.11, p. 86].
2.5 * Kernel density estimation 57
We then have
! " #
θbn θbn θbn
n 1− ∈ [a, b] if and only if θ∈ , ,
θ 1 − a/n 1 − b/n
and the choice of a and b such that exp(−a) − exp(−b) = 1 − α for which b − a is minimal is
a = 0, b = − log α. As a conclusion, an asymptotic confidence interval for θ is
" #
bn
θ
In = θbn , .
1 + (log α)/n
By the strong Law of Large Numbers, for all x ∈ R, Fbn (x) converges to F (x) almost surely15 , so
that Fbn (x) is a strongly consistent (and unbiased) estimator of F (x). Since p = F ′ , it is natural
to try to estimate p by the derivative of Fbn . However, since Fbn is piecewise constant, it is not
differentiable. To overcome this difficulty, one may replace the derivative with a finite difference
approximation. The centered version of such an approximation leads to the following estimator of
p.
Definition 2.5.1 (Rosenblatt estimator). For any h > 0, the Rosenblatt estimator16 of p is the
random density pbn,h defined by
Fbn (x + h) − Fbn (x − h)
∀x ∈ R, pbn,h (x) = .
2h
14
This section is based on the first chapter of [5].
15
More properties of the random function Fbn will be studied in Section 4.2.
16
On parle aussi d’estimateur à fenêtre glissante en français.
58 Parametric estimation
n
1 X
pbn,h (x) = 1{x−h<Xi ≤x+h} ,
2nh
i=1
Notice that the expression obtained in Exercise 2.5.2 writes under the abstract form
n
1 X x − Xi
pbn,h (x) = K ,
nh h
i=1
where K(z) = 12 1{−1≤z<1} is the density of the uniform distribution on [−1, 1]. This remark
allows to generalise the Rosenblatt estimator as follows.
n
1 X x − Xi
∀x ∈ R, pbn,h (x) = K .
nh h
i=1
The KDE pbn,h associated with a given kernel K is also called the Parzen–Rosenblatt estimator
of p. This estimator depends on the parameter h > 0, which is called the bandwidth17 . We have
thus naturally constructed a one-parameter family of estimators for p, similarly to (but perhaps
less artificially than) the example of Exercise 2.1.11. And just like in this example, the ‘good’
choice of h results from a bias-variance tradeoff.
Let us fix x0 ∈ R. The MSE of pbn,h (x0 ) possesses the bias-variance decomposition
h i
E (b pn,h (x0 )] − p(x0 ))2 + Var (b
pn,h (x0 ) − p(x0 ))2 = (E [b pn,h (x0 )) ,
(i) the density p is C 2 on R and the functions p and p′′ are bounded on R,
R R 2 K(z)dz
R 2
(ii) the kernel K satisfies z∈R zK(z)dz = 0, z∈R z < +∞, z∈R K(z) dz < +∞.
Then there exist Cb , Cv ∈ [0, +∞) depending only on p and K such that, for all x0 ∈ R, the bias
of the KDE satisfies
pn,h (x0 )] − p(x0 ))2 ≤ Cb h4 ,
(E [b
Cv
Var (b
pn,h (x0 )) ≤ .
nh
17
Largeur de bande en français.
2.5 * Kernel density estimation 59
An elementary computation shows that given the size n of the sample, the optimal value h∗n
minimising the bound Cb h4 + Cv /nh on the MSE is
∗ Cv 1/5
hn = ,
4nCb
in which case the MSE is of order n−4/5 and thus converges to 0 slower than the usual rate n−1
obtained for asymptotically normal estimators in the parametric framework! Observe however that
the value of h∗n depends on the constants Cb and Cv , which depend themselves on the density p,
which is the quantity which we are trying to estimate. As a consequence, in practice one cannot
compute h∗n , and selecting a ‘good’ value of the bandwidth is part of the problem of nonparametric
estimation.
60 Parametric estimation
Remark 2.5.5. Computing the MSE at a given point x0 is not the only possible criterion assessing
the quality of the estimator pbn,h . For instance, another popular choice is the Mean Integrated
Squared Error, defined by
Z
2
MISE(b
pn,h ; p) = E (b
pn,h (x) − p(x)) dx . ◦
x∈R
Let us conclude this brief introduction to nonparametric estimation with an empirical observa-
tion of the under- and oversmoothing phenomena. Assume that the size n of the sample is fixed.
If h is too small, then each function x 7→ h1 K( x−X h ) is very peaked around Xi , and the resulting
i
KDE looks very irregular. Besides, if one draws another sample Xn , the corresponding KDE will
look very different — in other words, the variance of the KDE is large when h is small. Thus, the
KDE is not informative on the actual density p because it is not smooth enough. On the contrary,
if h is taken too large, then all functions x 7→ h1 K( x−X
h ) look the same, and are just small trans-
i
1 x
lations of the function x 7→ h K( h ). In this case, the KDE just looks like a dilatation of the kernel
K, and does not provide any information on the density p either. These phenomena are illustrated
on Figure 2.4. They are to be compared with the over- and underfitting phenomena which will be
discussed in Chapter 5 in the context of linear regression.
h too small
h good
h too large
0.5
0.4
0.3
0.2
0.1
0.0
−2 −1 0 1 2 3 4
Figure 2.4: Kernel Density Estimation: the density p and the points of the sample are plotted in
red. Three KDEs, corresponding to the same (Gaussian) kernel but with different bandwidths, are
superposed.
2.A Exercises 61
2.A Exercises
Exercise 2.A.1 (Asymptotic variance of the empirical variance). Let X1 , . . . , Xn be iid random
variables, such that E[X14 ] < +∞ and E[X1 ] = 0. We write
n n n
1X 1X 1X 2 2
Xn = Xi , Vn = (Xi − X n )2 = Xi − X n .
n n n
i=1 i=1 i=1
√
The purpose of the exercise is to show that n(Vn − Var(X1 )) converges to a Gaussian distribu-
tion, and to compute the associated variance. For k = 2, 3, 4, we write ρk = E[X1k ].
1. Let us define the vectors Yn = (Xn , Xn2 ) and y = (0, ρ2 ) in R2 . Compute the covariance
matrix of Yn .
↸ Exercise 2.A.2 (Poisson model). The Poisson model is the set {P(λ), λ > 0}, under which we
recall that, for all λ > 0,
λk
∀k ∈ N, Pλ (X1 = k) = exp(−λ) .
k!
1. A first moment estimator.
e(1)
(a) Compute Eλ [X1 ] and deduce a moment estimator λ n .
2. We assume that there exists an unbiased estimator Zn of g(p). Show that there exist real
numbers a0 , . . . , an such that
n
X 1
ak pk (1 − p)n−k = ,
p
k=0
5. Write the MSE R(θbnt ; θ) as a function of t, α = E[U 2 ] and γ = E[(1 − V )U ]. Which value
of t minimises this quantity?
6. Compute the joint density of (U, V ) and deduce an expression for the MSE of the optimal
choice of θbnt . ◦
↸ Exercise 2.A.5 (Estimation in the geometric model). We consider the geometric model {Geo(p), p ∈
(0, 1)}.
2. Compute the Maximum Likelihood Estimator pbn of p, and show that it is strongly consistent.
3. Show that pbn is asymptotically normal and compute its asymptotic variance.
↸ Exercise 2.A.6 (MLE in the Weibull model). A random variable X > 0 is said to follow the
Weibull distribution with parameter m > 0 if
1. Warm-up.
For a ∈ R, we define [a]+ = max(a, 0), [a]− = max(−a, 0), and recall that |a| = [a]+ +
[a]− . To prove that m
b n is a consistent estimator m, we study [m
b n − m]− and [m
b n − m]+
separately.
(b) Study of [m
b n − m]− .
i. Check that ϕ′ (m; xn ) ≤ −1/m2 for all m > 0.
ii. If mn (xn ) ≤ m, show that mn (xn ) − m ≤ m2 |ϕ(m; xn )|.
iii. Deduce that [mb n − m]− converges almost surely to 0.
(c) Study of [m
b n − m]+ .
i. Let ǫ > 0. Show that if mn (xn ) − m ≥ ǫ, then
n
1 1X
0≤ + (1 − xm
i ) log xi .
m+ǫ n
i=1
2. Let X1 , . . . , Xn iid random variables with law Pθ . For all i ∈ {1, . . . , n}, we define Ui =
1{Xi ≤0} . Compute Eθ [U1 ].
3. Deduce a moment estimator θen of θ. Show that this estimator is strongly consistent.
4. Show that θen is asymptotically normal, and compute its asymptotic variance. Hint: use the
relation tan′ = 1 + tan2 .
5. For α ∈ (0, 1), deduce an asymptotic confidence interval for θ, with level 1 − α. ◦
1. Recall the width of the asymptotic confidence interval In for g(θ) given by Proposition 2.4.18.
2. Let Φ : g(Θ) → R be a strictly monotonic and C 1 function. Show that Φ(Zn ) is a consistent
and asymptotically normal estimator of Φ(g(θ)), and compute its asymptotic variance.
3. Using Proposition 2.4.18, construct an asymptotic confidence interval for Φ(g(θ)). Using
the fact that Φ is strictly monotonic, deduce an asymptotic confidence interval InΦ for g(θ).
and compute ϕn (0) and ϕ′n (0). If the function ϕn is either convex or concave, which of the
intervals In and InΦ is the smallest?
p
5. Assume that Φ satisfies the relation Φ′ (g(θ)) = 1/ V (θ). What is the expression for the
corresponding confidence interval InΦ ? Such a choice of Φ is called variance stabilisation.
6. Consider the MLE λ bn = 1/X n of λ in the Exponential model. We recall from Section 2.1.4
that this estimator is asymptotically 2
′
p normal, with asymptotic variance V (λ) = λ . Find a
function Φ such that Φ (λ) = 1/ V (λ), and compute the associated asymptotic confidence
interval for λ. Does variance stabilisation reduce the width of the confidence interval? ◦
2.B Summary
2.B.1 Basic definitions
• Parametric model: P = {Pθ , θ ∈ Θ}, with parameter set Θ ⊂ Rq .
• Estimator of g(θ): function of the sample Zn = zn (Xn ), where zn : Xn → g(Θ) does not
depend on θ.
2.B Summary 65
• FDCR bound: for any unbiased estimator Zn , Varθ [Zn ] ≥ h∇g(θ), I −1 (θ)∇g(θ)i/n.
I(θ) is the Fisher information.
An unbiased estimator reaching this bound is efficient.
Asymptotic criteria:
• Rao–Blackwell Theorem, based on sufficient statistics, allows to improve the MSE by con-
ditioning.
• h is the bandwidth.
too small: undersmoothing, large variance;
too large: oversmoothing, large bias.
Hypothesis testing
which is called the p-value. Since this probability is small, Fisher declared that the result of
the experiment was too unlikely under the null hypothesis, and thus rejected the latter, hence
concluding that Bristol was actually able to tell whether the milk or the tea was poured first.
Fisher’s argument is considered as one of the first attempts to formalise the design and statisti-
cal analysis of scientific experiments. It is the basis of the theory of hypothesis testing, which was
then developed in particular by Neyman and Pearson in the 1930’s. In this chapter, we present this
theory in the framework of parametric estimation as introduced in Chapter 2. Hence, throughout
the chapter, a parametric model P = {Pθ , θ ∈ Θ} is fixed, with Θ ⊂ Rq . The state space for the
sample Xn = (X1 , . . . , Xn ) is denoted by Xn .
Definition 3.1.1 (Test). A test of H0 against H1 is a decision rule determining, given an observa-
tion xn ∈ Xn , whether θ ∈ H0 or θ ∈ H1 .
Definition 3.1.2 (Region of rejection). The region of rejection of a test is the set Wn of realisations
xn ∈ Xn for which H0 is rejected.
Example 3.1.3 (Bernoulli model). A sequence of coin flips is modelled by Bernoulli random vari-
ables with parameter p ∈ [0, 1]. The experimenter wants to know whether the coin is biased or
not. She sets:
Notice that the null hypothesis is simple, while the alternative hypothesis is composite.
For a sufficiently large sample size n, the Law of Large Numbers asserts that X n is close to
p. As a consequence, an intuitive test consists in rejecting H0 as soon as X n is ‘far enough’ from
1/2. Formally, this amounts to taking
for some a ∈ (0, 1/2), which has to be determined in order to control the risk of taking a wrong
decision — this notion will be made precise below.
Remark 3.1.4. In Example 3.1.3, the null hypothesis is that under which there is no bias. This is
a general fact: when one wants to check whether a certain effect is present in the data, one defines
the null hypothesis as the absence of this effect. ◦
Definition 3.1.5 (Type I and type II errors). A type I error2 is the incorrect rejection of H0 . It is
measured by the type I risk θ ∈ H0 7→ Pθ (Xn ∈ Wn ).
A type II error3 is the incorrect acceptance of H0 . It is measured by the type II risk θ ∈ H1 7→
Pθ (Xn 6∈ Wn ).
Remark 3.1.6. With the interpretation that the null hypothesis is the absence of the effect that one
wants to identify, the type I error corresponds to a false positive, as the test incorrectly concludes
to the presence of the effect. On the contrary, the type II error corresponds to a false negative. ◦
2
Erreur de première espèce en français.
3
Erreur de seconde espèce en français.
3.1 General formalism 69
In Example 3.1.3, taking a very small leads the test to accept H0 only when X n is very close to
1/2, and therefore increases the type I risk (the test is called conservative). On the contrary, taking
a close to 1/2 makes the test accept H0 for many values of the sample set, and thus increases the
type II risk. This example shows that one cannot minimise both risks simultaneously. In order to
select a, the Neymann–Pearson approach consists in:
(i) fix a level α ∈ (0, 1), usually 1%, 5% or 10%;
(ii) define the rejection region which minimises the type II risk under the constraint that the type
I risk be lower than α.
Definition 3.1.7 (Level and statistical power). The level, or size of a test is
α = sup Pθ (Xn ∈ Wn ).
θ∈H0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.1: Statistical power of the test of level α = 5% in the Bernoulli model, for n =
10, 20, 30, 40, 50. The larger n, the more peaked the curve.
Exercise 3.1.10. Show that, in the Bernoulli model of Example 3.1.3, the test with rejection region
√
Wn = {xn ∈ {0, 1}n : |xn − 1/2| ≥ φ1−α/2 /2 n} is consistent. ◦
Remark 3.1.11. Tests with a poor statistical power face the risk of returning a type II error with a
large probability. As Figure 3.1 shows, increasing the size of the sample allows to reduce this risk.
A standard value for the acceptable power of a test is 80%. Yet, Figure 3.1 also shows that this
value can generally not be reached uniformly over H1 . For hypotheses of the form H0 = {θ =
θ0 }, H1 = {θ 6= θ0 }, a possible approach consists in fixing a power level ρ (say ρ = 0.8) and a
threshold δ > 0, and looking for n such that
3.1.3 p-value
In Example 3.1.3, the null hypothesis H0 = {1/2} is simple. On may also consider a composite
hypothesis H0 = [0, 1/2], in which case H1 = (1/2, 1] and a natural rejection region has the form
In both cases (simple and composite null hypothesis), the rejection region has the generic form
In the first case, ζn (xn ) = |xn − 1/2| and the test is called two-sided5 because both the events
xn − 1/2 ≤ −a and xn − 1/2 ≥ a must be taken into account to compute the type I error; while
in the second case, ζn (xn ) = xn − 1/2 and the test is called one-sided6 .
When the rejection region of a test has the form (∗), the random variable ζn (Xn ) is called the
test statistic.
Definition 3.1.13 (p-value). Consider a test with rejection region Wn of the form (∗). For all
xn ∈ Xn , the p-value of an observation xn is
p-value = sup Pθ (ζn (Xn ) ≥ ζn (xn ));
θ∈H0
in other words, it is the probability, under H0 , that the test statistic takes values more unfavourable
for the acceptance of H0 than the observed value in the data.
In the (two-sided) test for the Bernoulli model of Example 3.1.3, under H0 the empirical mean
X n should take values concentrated around 1/2. However, due to the randomness of the sample, it
may happen that X n takes values which are far from 1/2. For a given value xn of the test statistic,
the p-value assesses how likely it is that this value be due to random fluctuations of the sample: the
smaller the p-value, the more unlikely the realisation xn under H0 , and the more the experimenter
is encouraged to reject H0 . Indeed, it is easily checked that a test with determined level α rejects
H0 if and only if the p-value of the observation is smaller than α. As a consequence, the p-value
indicates all levels at which H0 will be rejected.
Example 3.1.14. In the Bernoulli model considered in Example 3.1.3, we give the p-values of the
observation xn = 0.6 for different values of n in Table 3.1. For small values of n, the approxima-
√
tion X n ≃ 1/2 + G/2 n under H0 is not valid and we rather used exact computations with the
Binomial distribution.
n 1 10 100 1000
p-value 1 0.75 0.046 2.5 10−10
Table 3.1: p-values of the observation xn = 0.6 in the Bernoulli model, for various values of n.
At the level α = 5%, H0 is rejected for n = 100, n = 1000 but accepted for n = 1, n = 10.
Exercise 3.1.15. Assume that H0 = {θ0 } is a simple hypothesis, and that under Pθ0 , the test
statistic ζn (Xn ) has a continuous cumulative distribution function. Show that under H0 , the p-
value is a statistic uniformly distributed on [0, 1]. ◦
In general, to compute the exact p-value of an observation, it is necessary to compute the
value of the cumulative distribution function of ζn (Xn ) at the point ζn (xn ). This is not always
possible analytically, and one may either resort to a scientific computing software, Monte-Carlo
simulations, or employ statistical tables, in which case only lower and upper bounds on the p-value
might be available.
Remark 3.1.16 (The ‘5σ-rule’). Levels of 1%, 5% or 10% are considered standard in many fields
of applications, such as biology, medicine, and social sciences. In particle physics, the standard
rule, called the 5σ-rule, is much more conservative: the null hypothesis, namely the nonexistence
of the sought particle, is usually rejected if the observed value of the test statistic is larger than 5
times its standard deviation. For a one-sided test with a Gaussian test statistic, this rule leads to
reject H0 if the p-value is lower than P(G ≥ 5) = 3 10−7 . ◦
5
Bilatéral en français.
6
Unilatéral en français.
72 Hypothesis testing
(ii) To respect the presumption of innocence, we define the null hypothesis as that under which
the manufacturer does not lie, and therefore set
H0 = {λ ≤ λ0 }, H1 = {λ > λ0 }.
(iii) Since the typical lifespan of a smartphone is larger under H0 than under H1 , we take a
rejection region of the form
(iv) Let us compute the type I error. For all λ ≤ λ0 , for all a ≥ 0,
where the random variable Sn = λX n is free in the sense of Definition 2.4.3, with law
Γ(n, n). As a consequence, the type I error λ 7→ Pλ (Xn ∈ Wn ) is nondecreasing, so that
Denoting by γn,r the quantile of order r of the law Γ(n, n), we deduce that the level of the
test is lower than α if and only if
γn,α
a≤ .
λ0
For all λ ∈ H1 , the power of the test now writes
which is maximal for the largest allowed value of a. As a conclusion, we finally take the
rejection region
Wn = {xn ∈ [0, +∞)n : xn ≤ γn,α /λ0 }.
With the values n = 1000, α = 0.05 and λ0 = 1/3, we get γn,α /λ0 = 2.85.
(v) Since the observed value xn = 2.8 is lower than 2.85, we are in the rejection region and H0
is rejected at the level 5%. Alternatively, the p-value of the observation is equal to 0.016,
which shows that at the level 1%, H0 would not be rejected.
Exercise 3.2.2. Using the Central Limit Theorem for X n in place of the free random variable Sn
in Example 3.2.1, construct a consistent test with asymptotic level α. Compare the p-value for this
asymptotic test with the p-value found above. ◦
∀θ ∈ Θ, Pθ (g(θ) ∈ Cn ) = 1 − α.
With this definition, a confidence interval is a confidence region which is an interval. An approxi-
mate confidence region is a subset such that Pθ (g(θ) ∈ Cn ) ≥ 1 − α.
74 Hypothesis testing
Proposition 3.2.3 (Duality between tests and confidence intervals). Let Cn = cn (Xn ) be a confi-
dence region for g(θ) with level 1 − α. For all g0 ∈ g(Θ), the test with rejection region
H0 = {g(θ) = g0 }, H1 = {g(θ) 6= g0 }.
Reciprocally, assume that for all g0 ∈ g(Θ), a test with level α and rejection region Wn (g0 )
is available for the hypotheses H0 and H1 defined above. The random region
Cn = {g ∈ g(Θ) : Xn 6∈ Wn (g)}
For all a ≥ 0,
n(λ1 − λ0 ) + log a
ζnLR (xn ) ≥ a if and only if xn ≥ ,
n(log λ1 − log λ0 )
3.2 General construction of a test 75
which shows that the rejection region actually takes the form xn ≥ a′ , so that the likelihood ratio
test coincides with the intuitive approached mentionned above. Since nX n ∼ P(nλ0 ) under Pλ0 ,
we deduce that the level of the test is lower than α if and only if
n(λ1 − λ0 ) + log a
≥ qnλ0 ,1−α ,
log λ1 − log λ0
where qλ,r is the quantile of order r of the Poisson distribution with parameter λ.
Besides providing a test statistic automatically, the likelihood ratio test has the following opti-
mality property.
Proposition 3.2.7 (Neyman–Pearson Lemma). Among all tests of level α for the simple hypotheses
H0 = {θ = θ0 } and H1 = {θ = θ1 }, the likelihood ratio test is the most powerful.
The likelihood ratio test is said Uniformly Most Powerful (UMP)7 .
Proof. Let Wn be a subset of Xn such that Pθ0 (Xn ∈ Wn ) ≤ α. We want to show that
Since
Ln (xn ; θ1 )
xn ∈ WnLR if and only if ζnLR (xn ) = ≥ a,
Ln (xn ; θ0 )
we get
Z Z
Ln (xn ; θ1 )dxn ≥ a Ln (xn ; θ0 )dxn ,
xn ∈WnLR \Wn xn ∈WnLR \Wn
Z Z
Ln (xn ; θ1 )dxn ≤ a Ln (xn ; θ0 )dxn ,
xn ∈Wn \WnLR xn ∈Wn \WnLR
so that
Since the level of the likelihood ratio test is α while Pθ0 (Xn ∈ Wn ) ≤ α, we deduce that the
right-hand side above is nonnegative, which proves the claimed inequality.
7
Uniformément Plus Puissant (UPP) en français.
76 Hypothesis testing
In order to implement the construction of the likelihood ratio test, it is necessary to know the
distribution of the test statistic ζnLR (Xn ) under H0 . Exercise 3.A.6 provides an asymptotic answer.
The likelihood ratio test can be extended to composite hypotheses, by defining the likelihood
ratio
supθ1 ∈H1 Ln (xn ; θ1 )
ζnLR (xn ) = .
supθ0 ∈H0 Ln (xn ; θ0 )
Definition 3.3.1 (Z-, t-, F-, and χ2 -tests). A Z-test is a test where the law of the test statistic
under H0 is a Gaussian distribution.
A t-test is a test where the law of the test statistic under H0 is a Student distribution.
A F-test is a test where the law of the test statistic under H0 is a Fisher distribution.
A χ2 -test is a test where the law of the test statistic under H0 is a χ2 distribution.
for a sample Xn = (X1 , . . . , Xn ), and detail the construction of various tests for µ and σ 2 .
↸ Exercise 3.3.2 (Test for the mean with known variance). We fix µ0 ∈ R and construct a test for
the hypotheses
H0 = {µ = µ0 }, H1 = {µ 6= µ0 },
assuming that the variance σ 2 is known.
↸ Exercise 3.3.3 (Test for the mean with unknown variance). We fix µ0 ∈ R and construct a test
for the hypotheses
H0 = {µ = µ0 }, H1 = {µ 6= µ0 },
assuming that the variance σ 2 is unknown.
↸ Exercise 3.3.4 (Test for the variance). We fix σ02 > 0 and construct a test for the hypotheses
1. Looking at Proposition A.4.3 again, find a free random variable with χ2 distribution.
which becomes larger and larger as m grows. As a consequence, even if there is no difference
between the two populations, the more experiments are conducted, the more likely it is that at
least one of these experiments will return a false positive and conclude to a difference between the
two populations. This is the look-elsewhere effect, which is also illustrated on Figure 3.2 and can
be considered as a central issue in modern statistics, as the large size of available datasets enables
multiple comparisons.
Definition 3.4.1 (Family-Wise Error Rate). The Family-Wise Error Rate (FWER) of the family of
tests Wn1 , . . . , Wnm is defined by
FWER = sup Pθ ∃k ∈ {1, . . . , m} : Xn ∈ Wnk .
θ∈H0
Exercise 3.4.2. If the tests are independent and H0 is simple, compute FWER in terms of the
individual levels α1 , . . . , αm . ◦
The Bonferroni method consists in rejecting H0 at the level α ∈ (0, 1) as soon as at least one
of the p-values p1 , . . . , pm is lower than α/m. It is based on the next result, which does not require
the tests to be independent.
As a consequence, if one takes αk = α/m for all k ∈ {1, . . . , m}, then FWER is lower than α.
78 Hypothesis testing
Figure 3.2: An illustration of the look-elsewhere effect. Taken from XKCD by Randall Munroe:
http://www.xkcd.com/882.
3.4 * Multiple comparisons 79
Pn
Proof. Using the union bound P(∪nk=1 Ak ) ≤ k=1 P(Ak ), we have, for all θ ∈ H0 ,
X
m X
m
k k
Pθ ∃k ∈ {1, . . . , m} : Xn ∈ Wn ≤ P θ Xn ∈ W n ≤ αk ,
k=1 k=1
which leads to the expected bound by taking the supremum of the left-hand side over θ ∈ H0 .
The union bound employed in the proof of Lemma 3.4.3 can be very rough, and as a conse-
quence the Bonferroni method has the counterpart of generally increasing the type II risk severely.
Table 3.2: Notations for multiple comparisons. V , S, R, U , T are random variables, among which
only R is a statistic. m0 is not random but depends on θ.
Definition 3.4.4 (False Discovery Rate). The False Discovery Rate (FDR) is
FDR(θ) = Eθ [V /R],
Exercise 3.4.5. Show that with this procedure, the number of discoveries is R = J. ◦
Theorem 3.4.6 (Control of the FDR by the Benjamini–Hochberg procedure). Assume that the
tests are independent. With the Benjamini–Hochberg procedure,
m0
∀θ ∈ Θ, FDR(θ) = α ≤ α.
m
Proof. Let θ ∈ Θ, I be the set of indices k for which θ ∈ H0k , and for all k ∈ {1, . . . , m},
For all k ∈ I, let Jk denote the number of discoveries of the procedure if the p-value of the k-th
test is replaced with the value 0. Then it follows from the definition of the Benjamini–Hochberg
procedure that, for all j ∈ {1, . . . , n},
so that
m
XX 1
FDR(θ) = Eθ 1{pk ≤αj/m} 1{Jk =j} .
j
k∈I j=1
and the fact that k ∈ I implies that under Pθ , the p-value pk is uniformly distributed on [0, 1],
therefore
αj
Eθ 1{pk ≤αj/m} = .
m
We deduce that
XX m
1 αj m0
FDR(θ) = Eθ 1{Jk =j} = α ,
j m m
k∈I j=1
3.A Exercises
↸ Exercise 3.A.1 (Power and sample size). The probability for an individual to be infected by
a virus is denoted by p0 , and assumed to be known. A new vaccine is tested on a sample of
n individuals. We denote by p ∈ [0, p0 ] the probability to be infected after the vaccine (the
probability to be infected cannot be increased by the vaccine). For all i ∈ {1, . . . , n}, we define
Xi = 1 if the i-th individual is infected by the virus after the vaccine and Xi = 0 otherwise, so
that Xi ∼ B(p). We introduce the hypotheses
H0 = {p = p0 }, H1 = {p < p0 }.
3. It turns out that p = 15%, so that the vaccine actually reduces the number of infected indi-
viduals by 1/6. What is the probability that an experiment carried over n = 100 individuals
succeed in detecting the efficiency of the vaccine at the level α = 5%? What do you think
of this result?
4. What should be the minimum size of the sample in order to detect, with probability at least
80%, a diminution of the number of infected people by 1/6? By 1/3? ◦
Exercise 3.A.2 (Nonasymptotic test for the Bernoulli model). The purpose of this exercise is to
construct a nonasymptotic test, with level lower than α, for the hypotheses
H0 = {p = p0 }, H1 = {p 6= p0 }, p0 ∈ [0, 1],
1. Using Hoeffding’s inequality, construct an approximate confidence interval for p (see Sec-
tion 2.4).
2. Deduce a nonasymptotic test from the duality between tests and confidence intervals. ◦
Exercise 3.A.3 (A simple quiz). This exercise is taken from A. Reinhart’s book Statistics Done
Wrong: the Woefully Complete Guide [4], which is a very good reference concerning the practical
applications of hypothesis testing in experimental sciences.
A 2002 study found that an overwhelming majority of statistics students — and instructors
— failed a simple quiz about p-values. Try the quiz (slightly adapted for this book) for yourself
to see how well you understand what the p-value really means. Suppose you are testing two
medications, Fixitol and Solvix. You have two treatment groups, one that takes Fixitol and one
that takes Solvix, and you measure their performance on some standard task (a fitness test, for
instance) afterward. You compare the mean score of each group using a simple significance test,
and you obtain p-value = 0.01, indicating there is a statistically significant difference between
means.
Based on this, decide whether each of the following statements is true or false:
1. You have absolutely disproved the null hypothesis: there is no difference between means.
3. You have absolutely proved the alternative hypothesis: there is a difference between means.
4. You can deduce the probability that the alternative hypothesis is true.
5. You know, if you decide to reject the null hypothesis, the probability that you are making
the wrong decision.
6. You have a reliable experimental finding, in the sense that if your experiment were repeated
many times, you would obtain a significant result in 99% of trials. ◦
82 Hypothesis testing
Exercise 3.A.4 (The Student–Wald test). Let Zn be a consistent and asymptotically normal estima-
tor of g(θ) ∈ R, with asymptotic variance V (θ). Assume that a consistent estimator Vbn of the vari-
ance is available. For g0 ∈ g(Θ), construct a consistent test for the hypotheses H0 = {g(θ) = g0 },
H1 = {g(θ) 6= g0 }, with asymptotic level α. ◦
1 Exercise 3.A.5 (From the 2015-2016 final exam9 ). Let θ > 0 and α > 0. We consider the
probability density
1
fθ,α (x) = C(θ, α) α+3 1{x∈[θ,+∞)} ,
x
and an iid sample X1 , . . . , Xn from this distribution.
1. Compute C(θ, α), E[X] and Var(X) if X has the density fθ,α(x).
2. We assume that θ is known and α is unknown. We shall admit that
1 1
E[log X] = log θ + , and Var(log X) = .
α+2 (α + 2)2
(a) Compute the MLE α
bn of α and show that it is strongly consistent.
(b) Use E[X] to compute a strongly consistent estimator α
en of α by the method of mo-
ments.
(c) Show that the estimators α
bn and αen are asymptotically normal and compute their
asymptotic variance. Which estimator do you prefer?
(d) Let α0 > 0. Construct an asymptotic test for H0 = {α = α0 }, H1 = {α > α0 }, with
level 5%, based on the better estimator of α.
3. We now assume that α is known and θ is unknown.
(a) Compute the MLE θbn of θ.
(b) Compute the bias of θbn . R∞
Hint: for a nonnegative random variable Y , E[Y ] = x=0 P(Y ≥ x)dx.
(c) Compute the cumulative distribution function of n(θbn − θ) and determine its limit.
Deduce that n(θbn − θ) converges in distribution and give the law of the limit.
(d) Construct an asymptotic confidence interval for θ, with level 95%. You may look for
an interval of the form [θbn /(1 + nb ), θbn ] with b > 0 to be determined. ◦
Exercise 3.A.6 (Asymptotics of the likelihood ratio statistic). We consider simple hypotheses
H0 = {θ = θ0 } and H1 = {θ = θ1 }. We recall the Definition 3.2.5 of the likelihood ratio
ζnLR (Xn ) and assume that the quantity
L1 (X1 ; θ0 )
h = E θ1 φ , φ(u) = u log u,
L1 (X1 ; θ1 )
where L1 (x1 ; θ) is the likelihood of a sample with a single value, is well-defined.
1. Using Jensen’s inequality, show that h ≥ 0. In the sequel, we shall assume that the model
is chosen so that h > 0.
1
2. Show that under H0 , n log ζnLR (Xn ) converges almost surely to −h.
3. Using the Central Limit Theorem, construct a consistent test with asymptotic level α based
on the likelihood ratio. ◦
9
Written by C. Butucea.
3.B Summary 83
3.B Summary
3.B.1 Vocabulary
• Null and alternative hypotheses: partition H0 , H1 of the parameter set Θ.
• Test: procedure accepting or rejecting H0 depending on the value of the data.
• Rejection region: set of values of the sample for which H0 is rejected.
• Type I error: reject H0 while actually θ ∈ H0 , measured by type I risk.
• Type II error: accept H0 while actually θ ∈ H1 , measured by type II risk.
• Level: maximum of type I risk.
• Power: 1 − type II risk. A test is consistent if the power converges to 1 when the size of the
sample goes to +∞.
• p-value: probability under H0 that the test statistic takes worse values for the acceptance of
H0 than the observed value.
Golden rule: the smaller the p-value, the more unlikely the observation under H0 , so that
reject H0 at level α ⇐⇒ p-value ≤ α.
Nonparametric tests
Nonparametric statistics refer to the framework where no parametric assumption is made on the
common law P of the iid sample X1 , . . . , Xn which we want to estimate. Theoretically, it is still
possible to employ the formalism of parametric estimation, and look for P within the set
P = {Pθ : θ ∈ Θ},
where the parameter set Θ is the set of all probability measures on the space X in which X1 , . . . , Xn
take their values, and Pθ = θ, but the interest of the dimension reduction of the parametric frame-
work is lost. Thus, specific tools need to be introduced. An overview of such tools for the prob-
lematic of nonparametric estimation is given in Section 2.5 of Chapter 2. In the present chapter,
we address the problematic of nonparametric tests.
The basic idea of nonparametric methods consists in approximating the law P of the variables
X1 , . . . , Xn by the empirical distribution Pbn defined by
n
1X
Pbn (A) = 1{Xi ∈A} ,
n
i=1
for any (measurable) subset A ⊂ X. Notice that Pbn is a random probability measure, as it depends
on the value of the sample Xn = (X1 , . . . , Xn ).
Exercise 4.0.1 (Warm-up). Show that Pbn (A) converges almost surely to P (A) = P(X1 ∈ A). ◦
In this chapter, we shall focus on two classes of tests.
• Goodness-of-fit tests1 , where a specific probability measure P0 on X is given and the hy-
potheses are
H0 = {P = P0 }, H1 = {P 6= P0 }.
An example of such a test could be: ‘are the grades of an exam distributed according to
the binomial distribution with parameters N = 20, p = 0.5 (which would mean that the
students have answered the questions at random)?’.
• Goodness-of-fit tests to a family of distributions, where a subset P0 of the set of probability
measures on X (usually, a parametric model) is given, and the hypotheses are
H0 = {P ∈ P0 }, H1 = {P 6∈ P0 }.
An example of such a test could be: ‘do the logarithms of the daily variations of the price
of a stock follow a Gaussian distribution?’.
1
Test de conformité, d’ajustement ou d’adéquation en français.
86 Nonparametric tests
In any case, we shall construct tests by measuring a certain distance between the empirical
distribution Pbn and either P0 or P0 , and reject H0 as soon as this distance is larger than a certain
threshold a. The choice of a ‘good’ notion of distance (in the space of probability measures on X)
is therefore crucial. We shall discuss two particular cases: the case where X is a finite set, and the
case where X = R and P is assumed to have a continuous cumulative distribution function.
In the first line, when px = qx = 0, we take the convention that (px − qx )2 /qx = 0.
Remark 4.1.2. The χ2 distance is not a distance, because it is not symmetric. However, it can be
checked that χ2 (P |Q) = 0 if and only if P = Q, so that χ2 (P |Q) can still be understood as a
measure of how close P and Q are. ◦
The probability mass function of the empirical distribution Pbn of a sample X1 , . . . , Xn iid
according to P is the random vector (b
pn,x )x∈X defined by
n
1X
∀x ∈ X, pbn,x = 1{Xi =x} .
n
i=1
Remark 4.1.3. If there exists x ∈ X such that px = 0, then almost surely, pbn,x = 0 for all n ≥ 1.
Therefore up to removing x from X, we shall always assume that px > 0 for all x ∈ X in the
sequel. ◦
√ b
(ii) n(Pn −P ) (seen as a random vector of Rm ) converges in distribution to a Gaussian vector
Nm (0, K) with covariance matrix K = (Kx,y )x,y∈X given by
(
px (1 − px ) if x = y,
Kx,y =
−px py if x 6= y.
Now by the multidimensional Central Limit Theorem (see Theorem A.3.1 in Appendix A),
√
lim n(Pbn − P ) = Nm (0, K), in distribution,
n→+∞
Corollary 4.1.5 (Asymptotic behaviour of χ2 (Pbn |Q)). Under the assumptions of Proposition 4.1.4,
(i) for any probability measure Q on X, with a probability mass function (qx )x∈X such that
qx > 0 for all x ∈ X, χ2 (Pbn |Q) converges to χ2 (P |Q),
(ii) nχ2 (Pbn |P ) converges in distribution to a random variable with distribution χ2 (m − 1).
Proof. The first point follows from the continuity of P 7→ χ2 (P |Q) and Proposition 4.1.4 (i). In
order to prove the second point, let us introduce the diagonal matrix M ∈ Rm×m with diagonal
√
coefficients (1/ px )x∈X , so that
X pbn,x − px 2
√
2
nχ2 (Pbn |P ) = n √ =
M n(Pbn − P )
.
px
x∈X
By Proposition 4.1.4 (ii), nχ2 (Pbn |P ) converges in distribution to kM U k2 , where U ∼ Nm (0, K).
Following Proposition A.2.3, M U ∼ Nm (0, Π) with Π = M KM ⊤ . The coefficients (Πx,y )x,y∈X
are easy to compute and write
(
1 1 − px if x = y,
Πx,y = √ √ Kx,y = √
px py − px py if x 6= y.
It is now straightforward to check that
√
Π = Im − ee⊤ , e = ( px )x∈X , kek = 1,
so that Π is the orthogonal projection of Rm onto the m − 1 dimensional space e⊥ . Therefore by
Proposition A.4.1, kM U k2 ∼ χ2 (m − 1).
88 Nonparametric tests
Wn = {dn ≥ χ2m−1,1−α }
Remark 4.1.9 (Validity of the asymptotic approximation). The χ2 test is based on the approxi-
mation of the law of Pearson’s statistic under H0 by the χ2 distribution, which theoretically only
holds when n goes to +∞. In practice, this approximation is considered to be legitimate when the
property
∀x ∈ X, np0,x (1 − p0,x ) ≥ 5
holds. Notice that this property holds if and only if
n min p0,x 1 − min p0,x ≥5 and n max p0,x 1 − max p0,x ≥ 5,
x∈X x∈X x∈X x∈X
Exercise 4.1.10. Apply the χ2 test to answer the question of Example 4.1.6 (with the R command
qchisq(.95,9), one obtains χ29,.95 = 16.9). ◦
Remark 4.1.11 (Extension to infinite state spaces). When the state space X is infinite, the χ2
test cannot be applied as the number of degrees of freedom m − 1 of the χ2 statistic should
be infinite. However, it can be adapted by partitioning the state space X in a finite number of
e = {A1 , . . . , Am } and testing whether the random variables X
classes X e1 , . . . , X
en defined from
the sample X1 , . . . , Xn by
Xei = Aj if Xi ∈ Aj
e defined by
are distributed according to the probability measure Pe0 on X
The result of the test will of course depend on the choice of the partition. In practice, the latter
should be chosen so that under P0 , all classes A1 , . . . , Am have approximately the same proba-
bility 1/m; in this case, following Remark 4.1.9, the number m of classes must be chosen such
that
1 1
n 1− ≥ 5.
m m
This procedure is particularly adapted to countably infinite state spaces, in which case P0 remains
represented by its probability mass function (p0,x )x∈X and
X
Pe0 (Aj ) = p0,x .
x∈Aj
For continuous probability measures on R, nonparametric tests such as those introduced in Sec-
tion 4.2 should be preferred. ◦
When the condition of Remark 4.1.9 ensuring the validity of the χ2 approximation is not
satisfied because px takes too small values, it is a common practice to aggregate the classes for
which px is small into larger classes, in the spirit of Remark 4.1.11.
Example 4.1.12 (Binomial model). In a group of N students, the number of students missing the
i-th session of the course is denoted by Xi , i = 1, . . . , n, where n is the total number of ses-
sions. If each student chooses to skip a class independently of each other, the variables Xi should
follow a binomial distribution B(N, p) where p is unknown. To test whether this independence
property holds (null hypothesis), or whether there is a contagion effect in absenteeism (alternative
hypothesis), one may set P0 = {B(N, p), p ∈ [0, 1]}.
Notice that if P0 = {P0 } is a singleton, then we are in the framework of the previous para-
graph. If this is not the case, then Pearson’s statistic is ill-defined as one does not know a priori
which of the measures P0 ∈ P0 should be compared with Pbn . In order to circumvent this issue,
we shall assume that P0 is a parametric family with low dimension, which writes
P0 = {P0,θ , θ ∈ Θ},
90 Nonparametric tests
where Θ ⊂ Rq with q < m. Example 4.1.12 fulfills this assumption, with q = 1 (the unknown
parameter p has dimension 1) and m = N + 1 (a B(N, p) variable can take the N + 1 values
0, . . . , N ).
Proposition 4.1.13 (Pearson’s statistic for the goodness-of-fit to a family of distributions). Assume
that Θ has a nonempty interior in Rq , and that the mapping θ 7→ P0,θ is injective2 . Let θbn be a
consistent estimator of θ. Consider the statistic
X (b
pn,x − p0,θb n ,x
)2
d′n = nχ2 (Pbn |P0,θbn ) = n .
p0,θbn ,x
x∈X
We omit the proof of Proposition 4.1.13 (and refer for example to [6, Section 17.5]) but insist
on the fact that the number of degrees of freedom of the limiting χ2 distribution under H0 is not
the same as in the case of a simple null hypothesis: the larger the dimension q of the parameter
set Θ, the lower the number of degrees of freedom!
Corollary 4.1.14 (χ2 goodness-of-fit test for a family of distributions). The test with rejection
region
Wn = {d′n ≥ χ2m−q−1,1−α }
is consistent and has asymptotic level α.
Exercise 4.1.15 (Continuation of Example 4.1.12). The number of students missing each session
of the course during the year 2017/2018 is reported below, for a group of N = 20 students.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Figure 4.1: Plot of the empirical CDF Fbn of a sample, superposed with its actual CDF F .
The empirical CDF of a sample Xn is plotted on Figure 4.1, together with the actual CDF F
of the sample.
As is intuitively expected, Fbn approximates F when the size of the sample increases. A first
justification of this approximation is that, for a fixed x ∈ R, the strong Law of Large Numbers
yields
lim Fbn (x) = E[1{X1 ≤x} ] = F (x), almost surely. (LLN)
n→+∞
Theorem 4.2.1 (Glivenko–Cantelli Theorem). Let F be a CDF on R and (Xi )i≥1 be a family of
independent random variables with CDF F . We have
lim sup Fbn (x) − F (x) = 0, almost surely.
n→+∞ x∈R
Before starting the proof, the reader should wonder in which sense the statement of the
Glivenko–Cantelli is stronger than the convergence result (LLN) asserted above.
Proof. For any x ∈ R, we denote by Fbn (x− ) and F (x− ) the respective left limits of Fbn and F at
x; since these functions are right continuous, there is no need to introduce a notation for the right
limits. By the strong Law of Large Numbers, for all x ∈ R, in addition to (LLN) we also get
n
1X
lim Fbn (x− ) = lim 1{Xi <x} = E[1{X1 <x} ] = F (x− ), almost surely. (LLN-2)
n→+∞ n→+∞ n
i=1
Let ǫ > 0. Since F is nondecreasing and bounded, there is only a finite number of points x ∈ R
such that F (x) − F (x− ) > ǫ. Thus, there exist k ≥ 1 and −∞ = x0 < x1 < · · · < xk = +∞
2
Such a model is called identifiable.
3
Do not take this as a challenge to beat...
4
Continu à droite, avec une limite à gauche en français.
92 Nonparametric tests
so that
Fbn (x) − F (x) ≤ Fbn (x− −
ℓ ) − F (xℓ ) + ǫ.
which completes the proof since the left-hand side does not depend on ǫ.
It should be clear that the Glivenko–Cantelli Theorem is a result of the same nature as the
strong Law of Large Numbers, stated in a functional space. It is accompanied by a Central Limit
Theorem, which relies on a random process5 (β(t))t∈[0,1] called the Brownian bridge.
Definition 4.2.2 (Brownian bridge). The Brownian bridge is the unique (in law) random process
(β(t))t∈[0,1] such that:
(i) almost surely, the mapping t 7→ β(t) is continuous on [0, 1], and β(0) = β(1) = 0;
(ii) for any d ≥ 1 and t1 , . . . , td ∈ [0, 1], the random vector (β(t1 ), . . . , β(td )) is Gaussian,
centered, with covariance matrix given by Cov(β(ti ), β(tj )) = min{ti , tj } − ti tj .
A proof of why the two conditions in Definition 4.2.2 ensure the existence and uniqueness of
(β(t))t∈[0,1] is beyond the scope of these notes. At a heuristic level, the Brownian bridge must be
understood as ‘the Brownian motion on [0, 1] conditioned to take the value 0 at the points 0 and 1’,
see Figure 4.2 and the course Stochastic Processes and their Applications6 .
Theorem 4.2.3 (Donsker Theorem). Let F be a CDF on R and (Xi )i≥1 be a family of independent
random variables with CDF F . The random function
√
Gn : x ∈ R 7→ n Fbn (x) − F (x)
G : x ∈ R 7→ β(F (x)),
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Corollary 4.2.4 (Asymptotic of the supremum). Under the assumptions of Theorem 4.2.3, the
random variable √
gn = n sup Fbn (x) − F (x)
x∈R
converges in distribution to the random variable g∞ = supx∈R |β(F (x))|. When F is continuous,
then
g∞ = sup |β(t)|,
t∈[0,1]
H0 = {F = F0 }, H1 = {F 6= F0 }.
Its asymptotic behaviour is described by the Glivenko–Cantelli and Donsker Theorems, which
provide a natural asymptotic test.
7
A. N. Kolmogorov, Sulla Determinazione Empirica di una Legge di Distribuzione, Giornale dell’Istituto Italiano
degli Attuari, vol. 4, 1933, p. 83–91.
94 Nonparametric tests
Proposition 4.2.6 (Asymptotic Kolmogorov test). If P0 has a continuous CDF F0 on R, the test
with rejection region √
Wn = { nζn ≥ a},
where a > 0 is defined by the relation
+∞
X
(−1)k exp −2k2 a2 = 1 − α,
k=−∞
Exercise 4.2.9 (Cramér–von Mises test). As an alternative to the Kolmogorov test, the Cramér–von
Mises test is based on the statistic
Z 2
ξn = Fbn (x) − F0 (x) F0′ (x)dx,
x∈R
where F0 is assumed to be C 1 . This statistic is another measure of the distance between Fbn and F0 .
Thanks to heuristic computations based on the Donsker Theorem, describe the region of rejection
of this test. ◦
The proof of Lemma 4.2.10 is postponed below. It relies on the notion of pseudo-inverse of a
CDF.
This notion, which does not suffer from any issue of existence or uniqueness, can be seen as a
precision of the Definition 2.4.2 of a quantile given Chapter 2.
(ii) Let U be a uniform random variable on [0, 1]. Then F is the CDF of the random variable
F −1 (U ).
Proof. Since F is right-continuous, for any u ∈]0, 1[, the set {x ∈ R : F (x) ≥ u} is closed,
so that F (F −1 (u)) ≥ u. Since F is nondecreasing, we deduce that if F −1 (u) ≤ x, then u ≤
F (F −1 (u)) ≤ F (x). Reciprocally, if u ≤ F (x), then by the definition of F −1 , F −1 (u) ≤ x: this
proves the first point.
We check the second point by writing
Z F (x)
−1
P(F (U ) ≤ x) = P(U ≤ F (x)) = du = F (x),
u=0
where the first identity follows from the first part of the lemma.
Proof of Lemma 4.2.10. Let U1 , . . . , Un be independent uniform random variables on [0, 1]. By
Lemma 4.2.13, under H0 , ζn has the same law as
n n
1 X 1 X
sup 1{F −1 (Ui )≤x} − F0 (x) = sup 1{Ui ≤F0 (x)} − F0 (x)
x∈R n i=1 0 x∈R n
i=1
1 X n
= sup 1{Ui ≤u} − u ,
u∈(0,1) n
i=1
where we have set u = F0 (x) and used the continuity of F0 to ensure that F0 (x) takes all values
u ∈ (0, 1) in the second inequality. The law of the right-hand side does not depend on F0 , which
is the announced statement.
96 Nonparametric tests
Definition 4.2.14 (Kolmogorov’s law). Let (Ui )i≥1 be a sequence of independent random vari-
ables uniformly distributed on [0, 1]. For all n ≥ 1, the law of the random variable
1 Xn
Zn = sup 1{Ui ≤u} − u
u∈(0,1) n
i=1
As a conclusion, we may thus derive a nonasymptotic test, based on the statistic ζn . In R, this
test is performed with the command ks.test.
Corollary 4.2.15 (Nonasymptotic Kolmogorov test). Under the assumptions of Lemma 4.2.10,
the test with rejection region
Wn = {ζn ≥ zn,1−α }
where zn,r is the quantile of order r of Kolmogorov’s law with parameter n, has level α.
H0 = {P ∈ P0 }, H1 = {P 6∈ P0 }.
We shall study the specific case (once again, similar to that of Section 4.1.3) where P0 is a para-
metric family, which thus writes
(ii) compare the distance between the empirical CDF Fbn of the sample, and the CDF F0,θbn of
the probability measure in P0 corresponding to the estimated value θbn of θ.
In some cases, it may then be proved that the law of the statistic
ζn′ = sup Fbn (x) − F0,θbn (x)
x∈R
is free under H0 , that is to say that it depends on n (and on the model P0 ), but non on the under-
lying value θ of the parameter. In order to check this property, it is useful to mimic the proof of
Lemma 4.2.10 and to express both Fbn (x) and θbn in terms of independent uniform random vari-
ables U1 , . . . , Un . Just like the estimation of θ by θbn changes the number of degrees of freedom
of the limiting χ2 distribution in the context of Section 4.1.3 (see Proposition 4.1.13), the law
of ζn′ under H0 will generally not be the Kolmogorov law from Section 4.2.3, but a certain
4.2 Continuous models on the line: the Kolmogorov test 97
probability measure, depending on the model P0 , whose quantiles zn,r ′ need to be computed, for
instance by numerical simulation.
The example of the case where P0 is the Gaussian model is detailed below, and we refer
to Exercise 4.A.5 for the case of the Exponential model (hinted at in Example 4.2.16). In the
Gaussian case, the resulting test is called the Lilliefors test9 . By extension, the two-step approach
described above is sometimes referred to as the Lilliefors correction to the Kolmogorov test.
x = 1.05, v = 1.422 .
The corresponding histogram is plotted on Figure 4.4, together with the density of the N(x, v)
distribution. We want to know whether the series is normally distributed. We employ Lilliefors’
Histogram of Sample
0.30
0.20
Density
0.10
0.00
−2 0 2 4 6
Sample
Figure 4.4: Histogram of the sample, and density with estimated parameters, for the Lilliefors test.
(ii) To check the freeness, under H0 , of ζn′ , we first write, for all (µ, σ 2 ) ∈ R × (0, +∞),
x−µ
F0,(µ,σ2 ) (x) = Φ ,
σ
−1
As a consequence, letting Xi = F0,(µ,σ 2 ) (Ui ), where U1 , . . . , Un are independent uniform
where u = F0,(µ,σ2 ) (x). We denote by ΦUn (u) the right-hand side, and finally obtain that
1 Xn
ζn′ = sup 1{Ui ≤u} − ΦUn (u) ,
u∈(0,1) n
i=1
the law of which does not depend on the parameters µ and σ 2 but only on n.
(iii) With our data, Kolmogorov’s statistic takes the value ζn′ = 0.05. To compute the p-value
(m)
of the test, we take Nsim ≫ 1 and, for all m ∈ {1, . . . , Nsim }, draw a sample Un =
(m) (m)
(U1 , . . . , Un ) of independent uniform random variables on [0, 1] and compute the value
of n
1 X
(m)
Zn = sup 1{U (m) ≤u} − ΦU(m) (u) .
u∈(0,1) n i n
i=1
We then approximate the p-value corresponding to the realisation with ζn′ = 0.05 by
1 X
Nsim
p-value = P Zn(1) ≥ 0.05 ≃ 1 (m) .
Nsim m=1 {Zn ≥0.05}
4.A Exercises
Exercise 4.A.1 (χ2 test in the parametric framework). When the state space X is finite with cardi-
nality m, one can actually resort to the parametric framework by introducing the notation
( )
X
Θ = (px )x∈X ∈ [0, 1]m : px = 1 ,
x∈X
4.A Exercises 99
and consider that a probability measure on X is parametrised by its probability mass function
θ = (px )x∈X . In this context, show that the MLE of θ is Pbn . ◦
↸ Exercise 4.A.2. During 300 minutes, the number of clients entering a shop per minute is
recorded. The results are reported below:
Number of clients 0 1 2 3 4 5
Number of minutes 23 75 68 51 53 30
Does the number of clients entering the shop per minute follow a Poisson distribution? You may
use the R command 1-pchisq(x,n) to compute the probability that a χ2 (n)-distributed variable
takes values larger than x, and thus return a p-value. ◦
Exercise 4.A.3 (On Donsker’s Theorem). We recall the Definition 4.2.2 of the Brownian bridge.
The purpose of this exercise is to give a proof of a weak form of the convergence stated in Theo-
rem 4.2.3.
1. For all t ∈ [0, 1], compute the variance of β(t).
2. Let F be a CDF on R, and let x1 ≤ · · · ≤ xd in R. What is the law of the random vector
G = (β(F (x1 )), . . . , β(F (xd )))?
3. With the notation of Theorem 4.2.3, show that when n → +∞, the random vector Gn =
(Gn (x1 ), . . . , Gn (xd )) converges in distribution to G.
The property which we established is called the convergence in finite-dimensional distribution. ◦
Exercise 4.A.4 (Nonasymptotic Kolmogorov test). Let (Xi )i≥1 be a sequence of iid random vari-
ables with CDF F on R.
1. Using Hoeffding’s inequality, show that for all a > 0,
√
sup P n|Fbn (x) − F (x)| ≥ a ≤ 2 exp(−2a2 ).
x∈R
In 1956, Dvoretzky, Kiefer and Wolfowitz10 proved that there exists a constant C such that
√
P sup n|Fbn (x) − F (x)| ≥ a ≤ C exp(−2a2 ).
x∈R
In 1958, Birnbaum and McCarty11 conjectured that the best constant was C = 2, and this
conjecture was proved by Massart12 in 1990.
2. Deduce a nonasymptotic version of the Kolmogorov test, with level at most α, which does
not require to compute the quantiles of the Kolmogorov statistic. What is another benefit of
this test? ◦
↸ Exercise 4.A.5 (Lilliefors correction for the Exponential model). Let X1 , . . . , Xn be iid positive
random variables, with common but unknown distribution P . We wish to test whether P is an
exponential distribution, and therefore write H0 = {there exists λ > 0 such that P = E(λ)}.
10
A. Dvoretzky, J. Kiefer and J. Wolfowitz, Asymptotic minimax character of the sample distribution function and
of the classical multinomial estimator, Annals of Mathematical Statistics, vol. 27, 1956, p. 642–669.
11
Z. W. Binbaum and R. McCarty, A distribution-free upper confidence bound for P(Y < X), based on independent
samples of X and Y , Annals of Mathematical Statistics, vol. 29, 1958, p. 558–562.
12
P. Massart, The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality, The Annals of Probability, vol. 18,
1990, p. 1269–1283.
100 Nonparametric tests
bn of λ.
1. Under H0 , recall the expression of the MLE λ
2. Let Fbn be the empirical CDF of the sample, and for all λ > 0, let F0,λ be the CDF of the
distribution E(λ). Show that under H0 , the law of the statistic
ζn′ = sup Fbn (x) − F0,λbn (x)
x∈R
3. We consider two samples of n = 200 values each, whose histograms superposed with the
bn ) distribution are reported below.
density of the associated E(λ
Density
0.4
0.0
0 1 2 3 4 5 0 2 4 6
s s
The associated respective values of the statistic ζn′ are 0.066 and 0.089. Using Monte-Carlo
simulations, in R, of the law of ζn′ under H0 , compute the p-values associated with each
sample. What do you conclude? ◦
4.B Summary
4.B.1 General framework
• No shape assumption is made on the common law P of the observed data X1 , . . . , Xn .
• Goodness-of-fit test:
a probability measure P0 on X is fixed;
H0 = {P = P0 }, H1 = {P 6= P0 };
H0 is rejected as soon as some ‘distance’ between Pbn and P0 is too large.
4.B.2 χ2 tests
• X is a finite space with cardinality m.
• Probability measures P are represented by their probability mass functions (px )x∈X .
In the framework of regression, two observed variables x and y are assumed to be related by the
identity
y = f (x, ǫ), (Reg)
where the function f is unknown and ǫ is an unobserved random term, often called noise or error,
accounting either for the inaccuracy of the model (for example, the dependence of y upon other
variables than x which are not taken into account), or for the natural variability of y. Starting from
a sample composed of pairs (x1 , y1 ), . . . , (xn , yn ), where the variables x1 , . . . , xn may be either
determined by the experimenter (fixed design) or randomly observed (random design), the purpose
of regression is to:
• reconstruct the function f ;
• estimate parameters of the law of ǫ;
in order to be able to predict, with a controlled degree of accuracy, the value of an outcome y for
a new value of x.
In general, x is multivariate, and its coordinates x1 , . . . , xp are called explanatory variables,
also features, regressors, or independent variables, although this designation may be confusing
because these variables need not be statistically independent. The variable y is the explained
variable, also response or dependent variable1 .
The term regression is usually employed when y takes continuous — or at least numerical
— values. If y (or equivalently f ) takes values in a categorical set, those values may be seen as
classes, and the problem is recast as finding the class to which a realisation of x belongs. In this
context, the term classification is preferred.
Example 5.0.1 (Regression and classification). An Internet search engine records, in n regions,
the number of queries for a certain number p of keywords associated to flu, for instance ‘flu
symptoms’, ‘fever’ and ‘sneezing’. For each region i, a vector xi = (x1i , . . . , xpi ) of Rp is thus
obtained. Certainly, the number yi of people actually infected by flu in the i-th region depends on
xi , and it is reasonable to postulate a relation of the type of Equation (Reg), where f is increasing
with respect to each variable xj , while ǫ encodes the natural variability of the model. Estimating
f from the n samples is a regression problem, and allows to predict, for a new region, the number
yn+1 of infections from the observation of queries xn+1 in this region. If one is not interested
in the precise number of infections, but only in determining whether a region is undergoing an
epidemic or not, then one faces a classification problem.
1
En économétrie, les coordonnées de x sont les variables exogènes, et y est la variable endogène. En anglais, les
terminologies independent et dependent sont employées.
104 Linear and logistic regression
Remark 5.0.2 (Supervised and unsupervised learning). Regression and classification methods
base their predictions of y on a dataset where the pairs (xi , yi ) are available. These methods
are called supervised, in the sense they start by a learning phase on a training dataset. In contrast,
data analysis methods of Chapter 1 are called unsupervised, as they retrieve information in the
data without any preliminary training. ◦
so that
yn = xn β + ǫn . (LinReg)
The parameter β0 is called the intercept. It may be removed from the model, in which case the
first column of the matrix xn must be removed. All subsequent computations then remain valid,
up to replacing p + 1 with p. In the sequel we always include the intercept in the model.
1.5
1.0
Proposition 5.1.1 (LSE for the simple linear regression). Assume that n ≥ 2 and that there exist
i, j such that xi 6= xj . The LSE of (β0 , β1 ) is given by
Cov(xn , yn )
βb0 = y n − βb1 xn , βb1 = .
Var(xn )
Exercise 5.1.2. Prove Proposition 5.1.1. ◦
Once the LSE (βb0 , βb1 ) is estimated, it may be interesting to quantify the quality of the linear
model by measuring whether the points (xi , yi ) are far from the line with equation y = βb0 + βb1 x
or not. To do so, one may define y bn = (b y1 , . . . , ybn ) ∈ Rn by ybi = βb0 + βb1 xi , and define the
residual error between yn (the data) and y bn (the model) by
n
1 1X
bn k2 =
kyn − y (yi − βb0 − βb1 xi )2
n n
i=1
n
1X
= (yi − yn + βb1 xn − βb1 xi )2
n
i=1
n
1X
= (yi − yn )2 − 2βb1 (yi − y n )(xi − xn ) + βb12 (xi − xn )2
n
i=1
= Var(yn ) − 2βb1 Cov(xn , yn ) + βb12 Var(xn )
Cov(xn , yn )2
= Var(yn ) −
Var(xn )
= Var(yn ) 1 − Corr(xn , yn )2 .
Remark 5.1.4 (Geometric interpretation of LSE). In its geometric interpretation, the LSE problem
amounts to minimising the Euclidean distance, in the vector space Rn , between the vector yn and
the linear subspace spanned by the vectors 1n = (1, . . . , 1) and xn — in other words, finding the
orthogonal projection of yn onto this linear subspace. The assumption made in Proposition 5.1.1
that there exist i, j such that xi 6= xj ensures that this subspace has dimension 2. ◦
This geometric interpretation is the cornerstone of the estimation of β in the general case
p ≥ 2, which is developed in the next paragraph.
where k · k denotes the Euclidean norm in Rn . Thus, following the geometric interpretation of
Remark 5.1.4 leads to the next result.
Proposition 5.1.5 (LSE for the multiple linear regression). Assume that p ≤ n − 1, and that xn
has full rank2 p + 1. The LSE of β is the unique vector βb ∈ Rp+1 such that ybn = xn βb is the
orthogonal projection of yn onto the range of xn , and it writes
−1
βb = x⊤n xn x⊤n yn .
Proof. Let y bn ∈ Rn denote the orthogonal projection of yn onto the range of xn . By the definition
bn , there exists βb ∈ Rp+1 such that y
of y b besides for all u ∈ Rp+1 , the vectors y
bn = xn β, bn − yn
n
and xn u are orthogonal in R , which implies that
x⊤
n (b
yn − yn ) = 0,
so that3
x⊤ b ⊤
n xn β = xn yn .
1. Assume that there is u ∈ Rp+1 \ {0} such that xn u = 0. Show that the rank of xn is at most
p. Hint: think of the Rank-nullity Theorem4 .
2
Est de rang complet en français.
3
You may also observe that this is the first-order optimality condition associated with the Least Square problem.
4
Théorème du rang en français.
5.1 Linear regression 107
Notice that if p + 1 ≤ n but the rank of xn is lower than p + 1, then at least one of the columns
of xn is a linear combination of the other ones, so that it may be removed from xn without loss
of information, and the process can be iterated until the matrix xn recovers full rank. Still, this
method requires the number of features p to be lower than the number of observations n.
where λ > 0 is a given numerical parameter, and kβk0 denotes the number of nonzero coordi-
nates of β: an optimal solution should then have a low k · k0 -norm. This problem is actually
computationally difficult, and it may be fruitfully relaxed to the minimisation problem
min kyn − xn βk2 + λkβk1 ,
β∈Rp+1
P
where kβk1 = pj=0 |βj |. The obtained estimator is called LASSO (for Least Absolute Shrinkage
and Selection Operator), and was introduced by Tibshirani6 in 1996. It is nowadays considered as
a standard variable selection tool.
When using such a penalisation method, the value of the estimator depends on the preliminary
choice of the parameter λ. If λ is small, then βb is close to the LSE but not very sparse. On
the contrary, if λ is large, then the estimator may be far from the LSE but very sparse. This
phenomenon is related to the bias-variance tradeoff discussed in Exercise 2.1.11, p. 30: if λ is
small, then the value of βb highly depends on the data, so that it is expected to have a small bias but
a large variance; in the context of regression, it is an example of overfitting. On the contrary, if λ
is large, then βb only weakly depends on the data, so that is has a small variance but a large bias:
this is a situation of underfitting. ◦
As in the case of simple regression, once the LSE of β is computed, it is useful to assess the
quality of the linear fit on the data. Therefore, we would like to define a coefficient of determina-
tion R2 such that the residual error writes
1
bn k2 = Var(yn )(1 − R2 ).
kyn − y
n
However in the present case, the expression of Definition 5.1.3 may not be straightforwardly ex-
tended, because the fact that each xi is multivariate prevents the correlation between xn and yn
from being well-defined. We thus provide a more general definition by writing first
bn ) + (b
yn − y n 1n = (yn − y yn − y n 1n ).
5
Parcimonieux en français.
6
R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B
(Methodological), vol. 58(1), 1996, p. 267–288.
108 Linear and logistic regression
By construction, yn − y
bn is orthogonal to the range of xn . On the other hand, since 1n is the first
bn and y n 1n belong to the range of xn . Therefore, by Pythagoras’ Theorem,
column of xn , both y
1 1 1
kyn − y n 1n k2 = kyn − y yn − y n 1n k2 ,
bn k2 + kb
n
| {z } |n {z } |n {z }
Var(yn ) residual error Var(b
yn )
The relevance of the choice of the LSE as an estimator of β is justified by the following result.
Exercise 5.1.10. Under the assumptions of Proposition 5.1.5, show that the LSE βb of β is the
Maximum Likelihood Estimator of β, and that it is unbiased. ◦
A further justification is provided by the Gauss–Markov Theorem studied in Exercise 5.A.1. In
this Gaussian setting, the law of the LSE βb is explicit, and an estimator σ
b2 of σ 2 is also available.
Proposition 5.1.11 (Law of (β,b σb2 )). Assume that ǫ1 , . . . , ǫn are independent N(0, σ 2 ) random
variables, that p ≤ n − 1, and that the matrix xn has full rank p + 1.
(i) The LSE βb satisfies βb ∼ Np+1 (β, σ 2 (x⊤ −1
n xn ) ).
b2 of σ 2 defined by
(ii) The estimator σ
bn k2
kyn − y
b2 =
σ
n−p−1
σ 2 /σ 2 ∼ χ2 (n − p − 1); in particular it is unbiased.
is such that (n − p − 1)b
5.1 Linear regression 109
which by Proposition A.2.3 in Appendix A shows that βb is a Gaussian vector in Rp+1 with mean
β and covariance matrix
−1 −1 ⊤ −1
x⊤ x
n n x ⊤
n σ 2
I n x⊤
x
n n x ⊤
n = σ 2
x ⊤
x
n n .
−1
We now denote by Π = xn x⊤ n xn x⊤ n
n the orthogonal projection of R onto the range of
xn , and Π⊥ = In − Π. By definition,
bn = Π⊥ yn = Π⊥ (xn β + ǫn ) .
yn − y
Proposition 5.1.11 allows for instance to construct confidence region for β and σ 2 , see an
example in Exercise 5.A.2. It may also be applied to the prediction of the outcome y corresponding
to a value of x which has not been observed yet. Indeed, for a new vector xn+1 ∈ Rp , the LSE
provides a natural predictor
Xp
b
ybn+1 = β0 + βbj xjn+1
j=1
In the Gaussian model, the precision of this predictor can be measured by a confidence interval.
Proposition 5.1.12 (Prediction with linear regression). Given xn+1 = (x1n+1 , . . . , xpn+1 ) ∈ Rp ,
let us define the row vector x′n+1 ∈ Rp+1 by
where tm,r denotes the quantile of order r of Student’s distribution with m degrees of freedom.
110 Linear and logistic regression
yn+1 − ybn+1
√ ∼ t(n − p − 1),
b2 κ
σ
which completes the proof.
Remark 5.1.13. One may also be interested in the prediction of x′n+1 β, without taking into ac-
count of the noise ǫn+1 associated with the (n + 1)-th observation. In this case, the very same
proof as in Proposition 5.1.12 shows that a confidence interval for x′n+1 β with level 1 − α is given
by h i
√ √
ybn+1 − tn−p−1,1−α/2 σ b2 λ, ybn+1 + tn−p−1,1−α/2 σ b2 λ ,
−1
with λ = x′n+1 x⊤
n xn (x′n+1 )⊤ = κ − 1. ◦
Proof. Let ej = (0, 0, . . . , 1, . . . 0) seen as a row vector of Rp+1 , where the j-th coefficient is
equal to 1 and the coefficients are indexed from 0 to p. By Proposition 5.1.11,
βbj = ej βb ∼ N(βj , σ 2 ρj ),
5.2 Logistic regression 111
ej βb b2
σ
Y =p ∼ N(0, 1), independent from Z= (n − p − 1) ∼ χ2 (n − p − 1),
σ 2 ρj σ2
so that
ej βb Y
p =p ∼ t(n − p − 1),
σ 2
b ρj Z/(n − p − 1)
from which the construction of the Student test follows.
Fisher’s test allows to test the joint influence of several features. Up to applying a permutation
to the indices of the features, the corresponding null and alternative hypotheses may be written
F ≥ fp−q,n−p−1,1−α,
where fp−q,n−p−1,1−α is the quantile of order 1 − α of the Fisher distribution F(p − q, n − p − 1),
has level α.
As for Student’s test above, the proof of Lemma 5.1.15 reduces to showing that under H0 ,
F ∼ F(p − q, n − p − 1). It is left as an exercise.
Example 5.2.1 (Spam detection). Let x1 , . . . , xp denote the frequency of occurrences of p key-
words (such as ‘Viagra’, ‘Drug’, etc.) contained in an email, and y = 1 if this email is a spam:
the larger the number of occurrences in a given email, the larger the probability that the email
be a spam. After training a spam detector on a dataset, say of n = 10000 emails, 500 of which
are actually spams, then an email client may adopt the rule to reject an email as soon as the
probability that it be a spam be larger than a given threshold, say 0.7.
112 Linear and logistic regression
exp(u) 1
∀u ∈ R, Ψ(u) = = ,
1 + exp(u) 1 + exp(−u)
Figure 5.2: The logistic function Ψ, which takes its values between 0 and 1.
In this parametric context, the estimation of x 7→ p(1|x) is reduced to the estimation of the
parameter β ∈ Rp+1 , similarly to linear regression.
Considering y1 , . . . , yn as independent realisations of Bernoulli random variables with respec-
tive parameters p(1|x1 ), . . . , p(1|xn ), we write the likelihood of the model
n
Y
Ln (yn , xn ; β) = p(1|xi )yi (1 − p(1|xi ))1−yi
i=1
Yn
= Ψ(x′i β)yi (1 − Ψ(x′i β))1−yi
i=1
Yn
exp(yi x′i β)
= ,
1 + exp(x′i β)
i=1
with the same notation x′i = (1, x1i , . . . , xpi ) as for linear regression. The MLE of β is denoted
b It does not possess an analytic formula, but can be computed by statistical softwares us-
by β.
ing numerical approximations for the system of optimality conditions (see for instance [1, Sec-
tion 4.4.1]):
n
X n
X
∂ℓn ∂ℓn
= yi − Ψ(x′i β) = 0, = xji yi − Ψ(x′i β) = 0, j = 1, . . . , p.
∂β0 ∂βj
i=1 i=1
b
The corresponding estimator of p(1|x) is denoted by pb(1|x) = Ψ(x′ β).
5.A Exercises 113
Once the parameter β is estimated, the prediction of the value yn+1 associated with a new
vector of parameters xn+1 ∈ Rp can be made following the rule
(
1 if pb(1|xn+1 ) ≥ p0 ,
ybn+1 =
0 otherwise,
where p0 ∈ [0, 1] is a given threshold. Such a rule, assigning a class y ∈ {0, 1} to a vector of
parameters x ∈ Rp , is called a classifier. As is the case for hypothesis testing, the value of p0 can
be tuned if one wants the classifier to be more or less conservative.
The following theoretical result is admitted (it is based on [3, Theorem 12.4.2, p. 515] which
is sometimes called Wilks’ Theorem7 ).
Lemma 5.2.2 (Behaviour of the likelihood ratio). Under H0 and suitable conditions on xn ,
Λ(yn |xn ) converges in distribution to the χ2 (1) distribution when n goes to infinity.
Lemma 5.2.2 ensures that the test rejecting H0 if Λ(yn |xn ) ≥ χ21,1−α has asymptotic level α.
5.A Exercises
Exercise 5.A.1 (Gauss–Markov Theorem). We assume that we observe realisations (xi , yi ) ∈
Rp × R, i ∈ {1, . . . , n}, associated with the linear model
p
X
yi = β0 + βj xji + ǫi ,
j=1
7
S. S. Wilks, The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses, The Annals
of Mathematical Statistics, vol. 9(1), 1938, p. 60–62.
114 Linear and logistic regression
where ǫ1 , . . . , ǫn are supposed to be random variables such that E[ǫi ] = 0 and E[ǫi ǫj ] = σ 2 1{i=j} .
An estimator βe of β is called linear if there exists a matrix an ∈ R(p+1)×n such that
y1
e ..
β = an yn , yn = . .
yn
3. Show that this covariance matrix is larger, in the sense of symmetric matrices, than the
b Hint: introduce the matrix dn = an − (x⊤ xn )−1 x⊤ and compute
covariance matrix of β. n n
⊤
dn dn . ◦
Exercise 5.A.2 (Confidence ellipsoids for β in the linear regression). Under the assumptions of
Proposition 5.1.11, let λ1 ≥ · · · ≥ λp+1 > 0 be the eigenvalues of the matrix x⊤ n xn , and let
(e1 , . . . , ep+1 ) be an associated orthonormal basis.
p
1. Show that the random variables λj hβb − β, ej i are iid according to N(0, σ 2 ).
be such that
P(β ∈ C(a)) = 1 − α,
for a given level α > 0.
3. If σ 2 is no longer assumed to be known and has to be estimated by σb2 , how to modify the
definition of C(a) for the identity P(β ∈ C(a)) = 1 − α to remain valid? ◦
1 Exercise 5.A.3 (Principal Component Regression). The purpose of this exercise is to study
linear regression on the principal components associated with the original explanatory variables
x1 , . . . , xp , which is particularly useful and entails dimensionality reduction when the explanatory
variables are correlated. Throughout the text, we fix n vectors x1 , . . . , xn ∈ Rp . For the sake of
simplicity, we do not include intercepts so that the matrix xn simply writes
1
x1 · · · xp1
xn = ... .. ∈ Rn×p .
.
xn · · · xpn
1
8
We recall that if A and B are symmetric matrices of size q × q, A is said to be larger than B if hu, (A − B)ui ≥ 0
for all u ∈ Rq .
5.B Summary 115
We also assume that the data are centered, and introduce the matrix
1
c1 · · · cp1
cn = ... .. ∈ Rn×p ,
.
c1n · · · cpn
2. Show that, for any β ∈ Rp , there exists γ ∈ Rp such that xn β = cn γ. Express the
coordinates of β in terms of γ.
b γ
3. Let yn ∈ Rn be the corresponding explained variables, and let β, b respectively denote the
LSE associated with the regression of yn on xn and on cn . Show that xn βb = cn γ
b.
4. We now fix k ≤ p and consider the linear regression of yn on the k first principal compo-
nents. To this aim, we introduce the matrix
1
c1 · · · ck1
ckn = ... .. ∈ Rn×k ,
.
c1n · · · ckn
and denote by
⊤ ⊤
bk = (ckn ckn )−1 ckn yn ∈ Rk
γ
bk reduces to k simple linear regres-
the corresponding LSE. Show that the computation of γ
sions.
(Notice the similarity of this definition with the result of Question 2.) Show that, in the
Gaussian model, Cov[βbk ] ≤ Cov[β],b where the inequality has to be understood in the sense
of symmetric matrices. ◦
5.B Summary
5.B.1 Regression and classification
Regression and classification aim to describe a functional relation between explanatory variables
x1 , . . . , xp and an explained variable y.
• Regression: y takes numerical values.
• Both methods are examples of supervised learning: the parameters are estimated on a
dataset containing pairs of explanatory and explained variables, and the estimation is then
employed to predict the value of y given a new value of x.
116 Linear and logistic regression
• In the Gaussian framework, βb is the MLE and it is the Best Unbiased Linear Estimator.
• Tests of independence and confidence intervals for the prediction of y may be constructed.
• The MLE βb is computed numerically, it may then be employed for classification or inde-
pendence test.
Chapter 6
In Chapters 3 and 4, hypothesis testing is made on data which take the form of a sample Xn =
(X1 , . . . , Xn ) of iid random variables. In practice, the formalism of hypothesis testing can be
expanded to address more general situations, the following two of which will be discussed in the
present chapter:
• independence tests: one observes iid realisations (X1 , Y1 ), . . . , (Xn , Yn ) and wishes to
know whether the random variables Xi and Yi are independent;
• homogeneity tests: one observes two (or more) samples X1,n1 = (X1,1 , . . . , X1,n1 ) and
X2,n2 = (X2,1 , . . . , X2,n2 ) and wishes to know whether these samples have the same dis-
tribution.
• to determine whether life expectancy1 depends on the wealth of a country, you may collect
the life expectancy Xi and the GDP2 Yi of a panel of n countries and test whether these
variables are dependent or not;
• to determine whether a vaccine is efficient, you may take two groups of patients, respectively
with n1 and n2 people, administer the vaccine only to the first group, then measure the
concentration of antibodies X1,i1 and X2,i2 in the blood of the patients for both groups,
and finally test whether the samples X1,1 , . . . , X1,n1 and X2,1 , . . . , X2,n2 are identically
distributed.
The present chapter describes several independence and homogeneity tests, adapted either to
a parametric framework or to the nonparametric case. A summary of all these tests is provided in
Section 6.B at the end of the chapter.
two features Xi ∈ X and Yi ∈ Y are collected. A natural question is whether these features are
independent or not.
Example 6.1.1 (Is your second child more likely to be a boy when the first one is a boy?). Over a
population of 1000 families with two children, the gender of both children is recorded. The results
are represented in the contingency table of Table 6.1. We want to check whether the gender of the
first child has an influence on the gender of the second child.
Table 6.1: Contingency table for the study of Example 6.1.1. The cells contain the numbers of
individuals with the corresponding features.
The distribution of the pair (X1 , Y1 ) is denoted by P . It is a probability measure on the finite
space X × Y, with probability mass function (px,y )(x,y)∈X×Y . The marginal distribution of X1 and
Y1 are respectively denoted by P X and P Y , and their probability mass functions (pX x )x∈X and
(pYy )y∈Y satisfy
X X
∀x ∈ X, pX
x = px,y , ∀y ∈ Y, pYy = px,y .
y∈Y x∈X
∀(x, y) ∈ X × Y, px,y = pX Y
x py . (I)
We denote by P0 the set of all probability measures on X × Y satisfying the condition (I). No-
tice that each element P P of P0 is characterised
P by the pair of vectors P X and P Y , therefore
because of the constraints x∈X pX x = 1 and
Y
y∈Y py = 1, P0 is parametrised by a subset Θ of
R(m−1)+(l−1) with nonempty interior.
On the other hand, the quantities px,y , pX Y
x and py are respectively estimated by
n n n
1X 1X 1X
pbn,x,y = 1{Xi =x,Yi =y} , pbX
n,x = 1{Xi =x} , pbYn,y = 1{Yi =y} ,
n n n
i=1 i=1 i=1
and we denote by Pbn and PbnX ⊗ PbnY the random probability measures on X × Y with respective
probability mass functions (b pX
pn,x,y )(x,y)∈X×Y and (b bYn,y )(x,y)∈X×Y .
n,x p
With this notation, Proposition 4.1.13 in Chapter 4 allows to construct a test of independence,
for the hypotheses
d′n ≥ χ2(m−1)(l−1),1−α
In order to apply this test to the case of Example 6.1.1, we take X = Y = {m, f} to encode
the gender of the first and second child, respectively, and compute the empirical frequencies in
Table 6.2.
Table 6.2: Empirical frequencies for the study of Example 6.1.1. For the sake of legibility, we
omit the dependence upon n = 1000 in the notation of the empirical frequencies.
Using the R command qchisq(.9,1), we obtain that the quantile of order 0.9 of the χ2 (1)
distribution is 2.7, so that H0 cannot be rejected at the level α = 10% (and consequently at any
lower level): we conclude that the gender of the second child is independent on the gender of the
first child. Using the command 1-pchisq(.578,1), we actually obtain that the p-value of the
test is approximately 0.45.
Table 6.3: Representation of the samples X1,n1 and X2,n2 in a 2 × 2 contingency table.
We want to test whether the two samples have the same distribution, so that we set
H0 = {p1 = p2 }, H1 = {p1 6= p2 }.
Example 6.2.1 (Decrease in Captain Haddock’s alcoholism after meeting Tintin3 ). In a very se-
rious paper4 , the health issues of Captain Haddock in The Adventures of Tintin are studied. In
particular, it is shown that before meeting Tintin, Captain Haddock sustains n1 = 24 health im-
pairments, 58.3% of which are due to alcohol, while after meeting Tintin, this proportion drops to
10.7% of his n2 = 225 health impairments (see Table 6.4). The test of comparison of proportions
allows to decide whether this decrease is statistically significant.
Table 6.4: Contingency table for Captain Haddock’s health impairments (HI).
An estimator of the parameter (p1 , p2 ) is (X 1,n1 , X 2,n2 ), so that it is natural to reject H0 when
|X 1,n1 − X 2,n2 | takes large values. In order to ensure that the level of the test be lower than α,
one should thus select a > 0 such that
The type I error appearing in the left-hand side is not easily computable, which prevents from
obtaining a satisfactory threshold a. We hereby present an asymptotic and a nonasymptotic test
circumventing this difficulty.
3
Suggested by K. Jean.
4
E. Caumes, L. Epelboin, G. Guermonprez, F. Leturcq and P. Clarke, Captain Haddock’s health issues in the adven-
tures of Tintin. Comparison with Tintin’s health issues, La Presse Médicale, vol. 45(7-8), 2016, p. 225-232.
6.2 Two-sample homogeneity tests 121
Remark 6.2.4 (Hypergeometric distribution). An integer random variable Z is said to have the
hypergeometric distribution with parameters N ≥ 0, K ∈ {0, . . . , N } and n ∈ {0, . . . , N } if for
all k ∈ {0, . . . , N },
K N −K
k n−k
P(Z = k) = .
N
n
Thus, Lemma 6.2.3 can be rephrased as stating that the conditional distribution of Y1 given Y is
the hypergeometric distribution with parameters n1 + n2 , n1 and Y . ◦
Assume that the observed values of x1 = y1 /n1 and x2 = y2 /n2 are such that x1 < x2 . To
know whether this difference is statistically significant, one may try to compute the probability
for X 1,n1 to take even smaller values than x1 under H0 , and reject H0 if this probability is small
enough. Writing
allows to bring the quantity P(p,p) (Y1 ≤ y1 |Y = y1 + y2 ) out, which by Lemma 6.2.3 does
not depend on p and therefore can be computed explicitly. This conditioning procedure is the
cornerstone of Fisher’s exact test.
Definition 6.2.5 (Fisher’s exact test). Fisher’s exact test is defined by the following procedure.
Denote by y1 and y2 the observed value of Y1 and Y2 , and write x1 = y1 /n1 , x2 = y2 /n2 .
• If x1 = x2 , accept H0 .
Proposition 6.2.6 (Fisher’s exact test). Fisher’s exact test has a level lower than α.
Proof. Let Wn1 ,n2 refer to the event ‘H0 is rejected’. For all p ∈ [0, 1], we first write
nX
1 +n2
By the construction of the test, P(p,p)(Wn1 ,n2 , X 1,n1 = X 2,n2 |Y = y) = 0. We now write
P(p,p)(Wn1 ,n2 , X 1,n1 < X 2,n2 |Y = y) = P(p,p)(ρy (Y1 ) ≤ α/2, X 1,n1 < X 2,n2 |Y = y)
≤ P(p,p)(ρy (Y1 ) ≤ α/2|Y = y),
6.2 Two-sample homogeneity tests 123
where ρy is the cumulative distribution function of the hypergeometric distribution with parame-
ters n1 + n2 , n1 and y; in other words, ρy is the cumulative distribution function of the conditional
distribution of Y1 on the event Y = y. Thus, Exercise 6.2.8 below yields
Remark 6.2.7. The denomination ‘exact’ for this test has to be understood as ‘nonasymptotic’.
However, it is important to observe that Proposition 6.2.6 only provides an upper bound on the
level of the test. ◦
Exercise 6.2.8. Let Z be a random variable with cumulative distribution function ρ. Show that,
for all r ∈ [0, 1],
P(ρ(Z) ≤ r) ≤ r.
Hint: recall that in the proof of Lemma 4.2.13, p. 93, we showed that for any u ∈ (0, 1),
ρ(ρ−1 (u)) ≥ u. ◦
Applying Fisher’s exact test on the data of Example 6.2.1 with the R command fisher.test
yields a p-value of the order of 10−7 , so that we reject H0 at all usual levels.
∀i1 ∈ {1, . . . , n1 }, X1,i1 ∼ N(µ1 , σ12 ), ∀i2 ∈ {1, . . . , n2 }, X2,i2 ∼ N(µ2 , σ22 ).
H0 = {µ1 = µ2 }, H1 = {µ1 6= µ2 }.
Example 6.2.9 (Grades of IMI and SEGF students in 20185 ). The statistics associated with the
grades at the final exam of the course Statistics and Data Analysis in 2018 are reported in Ta-
ble 6.5. Assuming that these samples are Gaussian, we want to known whether there is a statisti-
cally significant difference between IMI and SEGF students.
5
Thanks to P. Gréaume for this study.
124 Independence and homogeneity tests
IMI SEGF
Number of students n1 = 43 n2 = 35
Average 14.08 12.86
Standard deviation 1.9 1.84
Table 6.5: Statistics of the grades at the final exam of the course Statistics and Data Analysis in
2018.
H0 = {µ1 = µ2 }, H1 = {µ1 6= µ2 }.
has level α.
while
2
(n1 − 1)S1,n 2
+ (n2 − 1)S2,n
1 2
∼ χ2 (n1 + n2 − 2),
σ2
and these two variables are independent. As a consequence,
√X21,n1 −X 2,n2
σ (1/n1 +1/n2 )
Tn1 ,n2 = r ∼ t(n1 + n2 − 2),
2
(n1 −1)S1,n 2
+(n2 −1)S2,n
1 2
(n1 +n2 −2)σ 2
Remark 6.2.11 (Adaptation when σ12 6= σ22 ). Student’s test can be adapted when the assumption
that the two samples have the same variance is not fulfilled. The resulting test is called Welch’s
t-test [3, Section 11.3.1]. ◦
6.2 Two-sample homogeneity tests 125
where fn1 −1,n2 −1,r is the quantile of order r of F(n1 − 1, n2 − 2). This ensures that this test has
level α.
Remark 6.2.12 (Other tests). Fisher’s test is known to be very sensitive to the assumption that the
sample be Gaussian. More robust tests of homoscedasticity are reviewed in [3, Section 11.6]. ◦
With the data of Example 6.2.9, the application of Fisher’s test yields a p-value of 0.966, which
allows to accept the hypothesis that the two samples have the same variance. Therefore we may
perform Student’s test and obtain a p-value of 0.005, which indicates that there is a statistically
significative difference between IMI and SEGF students: it is up to SEGF students to show that
2018 was an outlier!
12
15
10
8
Frequency
Frequency
10
6
4
5
2
0
0
5 6 7 8 9 10 11 5 6 7 8 9 10 11
Group1 Group2
Three examples are plotted on Figure 6.2 below: one where the samples have the same dis-
tribution, one where the laws of the two samples are linearly related, one where the samples have
distinct laws.
2
8
8
1
6
4
0
X2
X2
X2
4
2
−1
2
0
−2
−2
−2 −1 0 1 2 −1 0 1 2 −2 −1 0 1 2
X1 X1 X1
Figure 6.2: Q-Q plots of: (i) two samples with identical N(0, 1) distribution; (ii) two samples
with respective distributions N(0, 1) and N(3, 2); (iii) two samples with respective distributions
N(0, 1) and E(1).
The interpretation is that the closer the scatter plot is to the diagonal, the more similar both
distributions are. Notice that when the scatter plot has a linear shape, but distinct from the di-
agonal, it expresses the fact that the two CDFs F1 and F2 have a linear relation of the form
F2 (x) = F1 ((x − b)/a). In particular, if the Q-Q plot of a single sample with the standard Gaus-
sian distribution is linear, then it is likely that the sample be Gaussian, and normality tests such as
the Lilliefors test may be applied.
6.3 Many-sample homogeneity tests 127
which can be computed with similar arguments as those detailed in Remark 4.2.8. The test is
based on the following result.
Lemma 6.2.15 (Freeness of the Kolmogorov–Smirnov statistic). Assume that F1 and F2 are con-
tinuous. Under H0 , the statistic ξn1 ,n2 is free: its law only depends on n1 and n2 .
The proof of Lemma 6.2.15 is postponed to Exercise 6.A.1. The law of ξn1 ,n2 under H0 is
called the Kolmogorov–Smirnov distribution with parameters n1 and n2 , its quantile of order r is
denoted by xn1 ,n2 ,r .
Corollary 6.2.16 (Kolmogorov–Smirnov test). The test rejecting H0 when ξn1 ,n2 ≥ xn1 ,n2 ,1−α
has level 1 − α.
11
the right.
6
5 6 7 8 9 10 11
Group1
Based on the results of Section 4.2, it is also possible to design an asymptotic version of the
Kolmogorov–Smirnov test. Another popular nonparametric test of homogeneity is the Mann–
Whitney U -test [6, Example 12.7, p. 66]. Wilcoxon’s test, which addresses matched samples, is
studied in Exercise 6.A.2.
of occurrences of x within the ℓ-th sample. Hence, in this case, the homogeneity test reduces to
the independence test of Section 6.1.1.
of each sample, and to check whether they are highly scattered6 around the global empirical aver-
age
Pk Pn ℓ
i=1 Xℓ,i
X ·,· = ℓ=1 Pk .
ℓ=1 nℓ
This scattering is measured by the Sum of Squares for the Model
k
X 2
SSM = nℓ X ℓ,· − X ·,· .
ℓ=1
6
Dispersé en français.
6.3 Many-sample homogeneity tests 129
k
X
n= nℓ , Xn = (X1,n1 , . . . , Xk,nk ) ∈ Rn ,
ℓ=1
1n = (1n1 , . . . , 1nk ) ∈ Rn ,
and notice that H ⊂ E. The orthogonal projections of Xn onto E and H are respectively denoted
by XE H
n and Xn . Then it is easily checked that
SSM = kXE H 2
n − Xn k , SST = kXn − XH 2
nk , SSE = kXn − XE 2
nk .
Besides, since Xn − XE ⊥ E H
n ∈ E while Xn − Xn ∈ E, Pythagoras’ Theorem yields
kXn − XH 2 E 2 E H 2
n k = kXn − Xn k + kXn − Xn k ,
which rewrites
SST = SSE + SSM. ◦
Following our intuitive approach, the test should reject H0 when SSM takes large values.
However, the magnitude of these values depends on the unknown parameter σ 2 , which in turn is
estimated by SSE (suitably renormalised). The next proposition clarifies these heuristic arguments
and allows to construct a rigorous test.
SSM/(k − 1)
Fn =
SSE/(n − k)
Proof. Under H0 , let µn ∈ Rn be the vector µ1n , where µ is the common value of µ1 , . . . , µk .
Then
Xn = µ n + ǫ n , ǫn ∼ Nn (0, σ 2 In ).
Let us denote by ǫE H
n and ǫn the respective orthogonal projections of ǫn onto E and H, so that
XE E
n = µn + ǫn , XH H
n = µ n + ǫn .
As a consequence,
SSM = kXE H 2 E H 2
n − Xn k = kǫn − ǫn k , SSE = kXn − XE 2 E 2
n k = kǫn − ǫn k .
ǫn = ǫn − ǫE + ǫE − ǫH + ǫH
n ,
| {z n} | n {z n} |{z}
∈E ⊥ ∈H ′ ∈H
where H ′ denotes the orthogonal of H in E, we deduce that SSM and SSE are independent, with
1 1
SSM ∼ χ2 (k − 1), SSE ∼ χ2 (n − k).
σ2 σ2
It follows that Fn ∼ F(k − 1, n − k).
Corollary 6.3.4 (F-test for ANOVA). The test rejecting H0 as soon as Fn ≥ fk−1,n−k,1−α has
level α.
In practice, statistical softwares return an ANOVA table such as depicted in Table 6.7, in which
all useful quantities are summarised.
With the data of Example 6.3.1, where k = 3 and n = 86, the ANOVA table reads
Df Sum Sq Mean Sq F value Pr(>F)
group 2 105.7 52.8 20.51 5.7 10−8
Residuals 83 213.8 2.58
In particular, H0 is rejected at all usual levels.
6.A Exercises
Exercise 6.A.1. Prove Lemma 6.2.15. ◦
1 Exercise 6.A.2 (Wilcoxon’s test). The Wilcoxon signed-rank test is adapted to a slightly differ-
ent framework than homogeneity tests described so far, as it addresses matched samples7 . In this
context, it is assumed that the sample is a series of independent pairs (X1,i , X2,i )1≤i≤n , such that
7
Échantillons appariés en français.
6.B Summary 131
the differences Zi = X1,i − X2,i are identically distributed. We recall that the law of a random
variable ζ is said to be symmetric if ζ and −ζ have the same law, and define the following null and
alternative hypotheses:
An example. The papers of n students at an exam are graded successively by two professors,
yielding the grades X1,i and X2,i for the i-th student. We want to know whether there is a ‘profes-
sor effect’ resulting from different grading methods. We assume that each paper has an intrinsic
value xi , and that the grade given by each professor is a random fluctuation around this intrinsic
value, so that
X1,i = xi + ǫ1,i , X2,i = xi + ǫ2,i ,
where the sequences (ǫ1,i )1≤i≤n and (ǫ2,i )1≤i≤n are independent, and within each of these se-
quences, the variables are iid with respective distributions P1 and P2 . In this context, what is the
relation between the null hypothesis H0 introduced above and the ‘natural’ homogeneity hypoth-
esis H0′ = {P1 = P2 }?
We shall make the technical assumption:
∀z ∈ R, P(Z1 = z) = 0. (∗)
This ensures that, almost surely, the values of Z1 , . . . , Zn are pairwise distinct, and therefore there
is a unique permutation π of {1, . . . , n}, the set of which is denoted by Sn , such that |Zπ(1) | <
· · · < |Zπ(n) |, and thus allows us to introduce the statistic
n
X
T+ = k1{Zπ(k) >0} ,
k=1
on which Wilcoxon’s test for H0 and H1 is based. We insist on the fact that the permutation π
depends on the realisation of the sample (Z1 , . . . , Zn ), and therefore is random.
1. Let ζ be a random variable with symmetric law, satisfying (∗). Show that the random vari-
ables8 sign(ζ) ∈ {−1, 1} and |ζ| > 0 are independent, and that sign(ζ) is a Rademacher
variable (that is to say that P(sign(ζ) = −1) = P(sign(ζ) = 1) = 1/2).
2. Let ζ1 , . . . , ζn be iid random variables, with symmetric law and satisfying (∗). We define
the random permutation π ∈ Sn to be such that |ζπ(1) | < · · · < |ζπ(n) |. Show that the
random variables sign(ζπ(1) ), . . . , sign(ζπ(n) ) are independent Rademacher variables.
3. Deduce that under H0 , the statistic T + is free, and describe a nonasymptotic two-sided test
for H0 .
5. Show that, under H0 , (T + − tn )/σn converges in distribution to N(0, 1). Deduce an asymp-
totic test for H0 . ◦
6.B Summary
We summarise the various tests which have been presented in these notes.
8
The condition (∗) ensures that ζ 6= 0, almost surely, so that there is no need to take a convention to define the value
of sign(0).
132 Independence and homogeneity tests
The following exercises cover the main points of the course and are typical examples of what you
should be able to solve (almost) instantaneously at the end of the semester. The questions for
which a numerical computation is required are marked with the symbol [R].
2. The correlation circle of the first two principal components is plotted on Figure (b). Recall
how this figure is plotted, and give an interpretation of the score of a station on the first two
principal components.
3. The projection of the n stations in the first factoriel plane is plotted on Figure (c), and the
scores of these stations on the first principal component is plotted on Figure (d). What do
you conclude?
(a) The location of the n stations along the (b) The correlation circle for the first two
Doubs river. principal components.
1
Inspired from https://fr.wikipedia.org/wiki/Analyse_en_composantes_principales.
134 Check-out list
(c) The projection of the n stations in the (d) Scores of the stations on the first princi-
first factoriel plane. pal component.
Exercise 2 (Parametric estimation2 ). The Rayleigh distribution with parameter θ > 0 is the
probability distribution with density
r 2
2 x
p(x; θ) = 1{x>0} exp − .
πθ 2θ
We remark that under Pθ , X1 has the same law as |Z|, where Z ∼ N(0, θ), and recall that E[Z 2 ] =
θ, E[Z 4 ] = 3θ 2 .
1. Compute Eθ [X1 ] and deduce a moment estimator θen of θ.
2. Show that θen is asymptotically normal and compute its asymptotic variance.
3. Compute the MLE θbn of θ.
4. Show that θbn is unbiased and strongly consistant.
5. Show that θbn is asymptotically normal and compute its asymptotic variance.
6. Which of the estimators θen and θbn do you prefer?
7. Show that the MLE θbn is asymptotically efficient.
8. Construct an asymptotic confidence interval with level 1 − α for θ.
9. Show that under Pθ , the random variable θbn /θ is free and deduce an exact confidence inter-
val with level 1 − α for θ.
10. With the same arguments, construct a test with level α for the hypotheses H0 = {θ ≤ θ0 },
H1 = {θ > θ0 }. What is the p-value of an observation θbn = θ obs ?
Exercise 3 (Hypothesis testing). In order to be labelled as ‘organic food’, a food producer has
to ensure that each of his products contains less than 1% of GMO3 . He takes a sample of n = 25
products and computes the percentage of GMO in each of these products. We denote by Xi the
logarithm of this percentage for the i-th product and assume that X1 , . . . , Xn are iid under the law
N(θ, 1).
2
Exercises 2 and 3 were proposed by Christophe Denis.
3
OGM en français.
135
1. For the producer, the products do not contain GMO unless the contrary is proved. He there-
fore sets
H0 = {θ ≤ 0}, H1 = {θ > 0}.
2. [R] An environmental organisation wants to make sure that the products do not contain
GMO. In particular, they worry about the ability of the test to detect products with a quantity
of GMO which is 50% larger than what is authorised. Compute the probability that the test
concludes to the absence of GMO when the actual percentage of GMO is 1.5% (that is to
say with θ = log 1.5).
3. [R] Scandalised by this result, the organisation wants to modify the producer’s test. To
them, the products do contain GMO unless the contrary is proved. Which test should be
constructed? With n = 25 and α = 0.05, what is now the probability to conclude to the
absence of GMO when θ = log 1.5?
We take a population of first generation plants in which the phenotypes AB, aB, Ab, ab
are equally distributed. In order to test the hypothesis H0 that A and B are dominant and
a and b are recessive, we perform random interbreedings in the population and obtain a
second generation. Under H0 , the theoretical probability of each phenotype in the second
generation, predicted by Mendel’s theory, is given below. For a population of n = 160
plants in the second population, the actual number of occurrences of each phenotype are
also given. What do you conclude?
Phenotype AB aB Ab ab
Theoretical probability 9/16 3/16 3/16 1/16
Experimental observation 100 18 24 18
(a) You want to know if the time between two consecutive buses at a bus stop has an
exponential distribution.
(b) You want to know if eye colour and hair colour are independent.
(c) You want to know if the number of earthquakes per year follows a Poisson distribution.
(d) You want to know if men between age 20 and 30 have the same size in France and in
Italy.
(e) You want to know if men between age 20 and 30 have the same size all across Europe.
(f) You want to know if having young children has an influence on having a professional
activity, to this aim you compute the unemployment rate among 100 women with
children and among 100 women without children.
136 Check-out list
Exercise 5 (Linear regression4 ). [R] The Cobb–Douglas model in economics relates the total
production Y of a company (the real value of all goods produced) in a year with the labour input
L (the total number of person-hours worked) and the capital input K (a measure of all machinery,
equipment, and buildings) through the formula
Y = ALβ1 K β2 .
y = log A + β1 x1 + β2 x2 + ǫ, ǫ ∼ N(0, σ 2 ).
Applying linear regression on a database of n = 1658 companies for which Y , L and K are given,
we obtain the following results:
βb0 3.136
βb1 0.738 0.0288 0.0012 −0.0034
βb2 0.282 (x⊤
n xn )
−1
= 0.0012 0.0016 0.0010
R2 0.945 −0.0034 0.0010 0.0009
bn k2
kyn − y 148.27
b2 of σ 2 .
1. Compute the value of the unbiased estimator σ
2. Give a confidence interval with level 95% for the total factor productivity.
3. Give a prediction interval with level 95% for the total production of a company with labour
input L = 100 M$ and capital input K = 50 M$.
5. The model is said to display constant returns to scale5 if multiplying L and K by λ > 0
results in multiplying Y by λ.
4
Taken from the lecture notes Régression linéaire by Arnaud Guyader: http://www.lpsm.paris/pageperso/guyader.
5
Rendements d’échelle constants en français.
Appendix A
This appendix contains a short introduction to several distributions related to Gaussian variables or
vectors, which were used in the previous chapters. We refer to [2] for a more detailed exposition.
In the previous definition, the variance σ 2 is allowed to vanish so that we take the convention
that deterministic random variables are Gaussian random variables with zero variance.
If X = (X1 , . . . , Xd ) ∈ Rd is a random vector, then for all u ∈ Rd ,
where m = E[X] and K is the covariance matrix of X. The next result then follows from the
expression of the characteristic function of Gaussian random variables.
X ∼ Nd (m, K).
A very useful property of Gaussian vectors is that independence can be checked from the
covariance matrix.
Proposition A.2.4 (Gaussian vectors and independence). Let X ∈ Rl and Y ∈ Rd be two ran-
dom vectors such that the vector (X, Y ) ∈ Rl+d is Gaussian. Then the following conditions are
equivalent.
Proof. The implications (i) ⇒ (ii) ⇒ (iii) are straightforward. Let us assume (iii) and prove (i).
To this aim, let us fix u ∈ Rl and v ∈ Rd . The characteristic function of (X, Y ) evaluated at (u, v)
writes
Φ(X,Y ) (u, v) = E [exp (i(hu, Xi + hv, Y i))] .
Since the vector (X, Y ) is Gaussian, the random variable hu, Xi + hv, Y i is also Gaussian, with
mean
E [hu, Xi + hv, Y i] = hu, mX i + hv, mY i
and variance
Var (hu, Xi + hv, Y i) = Var (hu, Xi) + 2 Cov (hu, Xi, hv, Y i) + Var (hv, Y i)
= hu, KX ui + hv, KY vi,
where we have used the condition (iii) here. As a consequence, using the expression of the char-
acteristic functions of real-valued Gaussian variables, we get
1
E [exp (i(hu, Xi + hv, Y i))] = exp i (hu, mX i + hv, mY i) − (hu, KX ui + hv, KY vi)
2
= ΦX (u)ΦY (v),
As in the scalar case, the proof of Theorem A.3.1 follows from the computation of the limit,
when n → +∞, of the characteristic function of the random vector in the left-hand side.
By Definition A.2.1, the right-hand side is a Gaussian variable, therefore ζ is a Gaussian vector.
Since E[X] = 0, it is immediate that E[ζ] = 0. Now, by the definition of Π, for all i ∈
{1, . . . , d}, (
1 if i ≤ k,
Var(ζi ) = Var(hX, ei i) = hei , Πei i =
0 if i ≥ k + 1,
so that ζ1 , . . . , ζk are identically distributed according to N(0, 1), while ζk+1 = · · · = ζd = 0.
Therefore, to conclude, it remains to check that ζ1 , . . . , ζk are independent. Since the vector ζ
is Gaussian, it is sufficient to check that Cov(ζi , ζj ) = 0 for 1 ≤ i < j ≤ k. But by the definition
of Π again, and the fact that e1 , . . . , ek are orthogonal, we have
Theorem A.4.2 (Cochran’s Theorem). Let X ∼ Nd (0, Id ) and E1 , E2 be two orthogonal sub-
spaces of Rd , with respective dimensions d1 , d2 . The orthogonal projections X 1 and X 2 of X,
respectively on E1 and E2 , are independent Gaussian vectors, and kX 1 k2 ∼ χ2 (d1 ), kX 2 k2 ∼
χ2 (d2 ).
Proof. Let Π1 and Π2 denote the matrices of the orthogonal projection of Rd onto E1 and E2 ,
respectively. For any u ∈ Rd ,
so that by Definition A.2.1, X 1 is a Gaussian vector. The same arguments also apply to X 2 .
To show that X 1 and X 2 are independent, we compute the characteristic function of the pair
of vectors (X 1 , X 2 ). For any (u, v) ∈ Rd × Rd ,
Φ(X 1 ,X 2 ) (u, v) = E exp(i(hu, X 1 i + hv, X 2 i))
= E [exp(i(hΠ1 u, Xi + hΠ2 v, Xi))]
= ΦX (Π1 u + Π2 v)
1
= exp − hΠ1 u + Π2 v, Π1 u + Π2 vi ,
2
where we have used Lemma A.2.2 and the fact that X ∼ Nd (0, Id ) at the last line. Since Π1 u ∈
E1 , Π2 v ∈ E2 , and E1 and E2 are orthogonal, Pythagoras’ Theorem yields
so that
1 2 1 2
Φ(X 1 ,X 2 ) (u, v) = exp − kΠ1 uk exp − kΠ2 vk .
2 2
This shows that X 1 and X 2 are independent, with respective distributions Nd (0, Π1 ) and Nd (0, Π2 ).
By Proposition A.4.1, we deduce that kX 1 k2 ∼ χ2 (d1 ), kX 2 k2 ∼ χ2 (d2 ), which completes the
proof.
A.4 Cochran’s Theorem 141
Proposition A.4.3 (Empirical mean and variance). Let X1 , . . . , Xn be independent N(0, 1) ran-
dom variables. Let us write
n n
1X 1 X
Xn = Xi , Sn2 = (Xi − X n )2 .
n n−1
i=1 i=1
X
(n − 1)Sn2 ∼ χ2 (n − 1), p n ∼ t(n − 1).
Sn2 /n
142 Complements on Gaussian statistics
Annexe B
Chacun des six sujets proposés ci-dessous décrit une méthode statistique, permettant d’effectuer
un test d’hypothèse ou une régression dans un cadre particulier. Il doit être présenté au cours d’un
exposé d’une quinzaine de minutes, avec slides, qui comportera deux parties :
• une présentation théorique rapide de la méthode, guidée par les indications données ci-
dessous — n’hésitez pas à faire vos propres recherches en dehors du poly ;
La seconde partie ne doit pas se limiter à un simple énoncé du résultat du test (par exemple :
« Nous avons comparé les températures du mois de décembre à Oslo et à Alger, eh bien nous
trouvons qu’elles n’ont pas la même distribution. »), mais doit s’accompagner d’une discussion ;
des indications d’ouvertures possibles sont données ci-dessous pour chacun des sujets.
• Dans la présentation théorique, donnez une vue d’ensemble de la méthode (à quoi ça sert,
dans quel cadre on s’en sert, comment on l’applique) plutôt que d’insister sur les détails
mathématiques : ces derniers prennent énormément de temps à exposer, il est préférable de
renvoyer précisément votre auditoire au poly.
• Lorsque vous présentez le résultat d’un test d’hypothèse, il est beaucoup plus parlant de
donner la p-valeur (immédiatement interprétable) plutôt que la valeur de la statistique de
test (très dépendante du test). Lorsque vous le pouvez, n’hésitez pas à discuter également de
la puissance du test que vous effectuez.
• N’hésitez pas à utiliser des données artificielles, que vous aurez vous-même simulées, pour
explorer les limites des méthodes que vous présentez. Par exemple, si vous décrivez une
méthode qui repose sur l’hypothèse que les données sont gaussiennes, il peut être intéressant
de montrer que la méthode « fonctionne » sur des données que vous avez générées selon une
gaussienne, puis de simuler des données « de moins en moins gaussiennes » et de voir si la
méthode continue de donner des résultats corrects ou non.
• Des répertoires de bases de données sont indiqués sur Teams. Selon le sujet que vous traitez,
il peut également être intéressant de rechercher des études scientifiques utilisant la méthode
144 Sujets de classe inversée
à laquelle votre exposé est consacré, et d’en discuter les résultats : la taille de l’échantillon
est-elle suffisante pour assurer que le test est assez puissant ? les données semblent-elles
vérifier les hypothèses sous lesquelles la méthode est valable ?
• Pour le vendredi 9 octobre, vous devez avoir constitué un groupe et choisi un sujet, sur la
base des « Résumés rapides » donnés ci-dessous.
• Pour le vendredi 27 novembre, vous devez envoyer à votre enseignant de petite classe un
mail décrivant :
1. la base de données sur laquelle vous avez choisi d’appliquer votre méthode,
2. les hypothèses que vous allez chercher à tester, ou les relations que vous allez chercher
à mettre en évidence, sur cette base de données.
De plus, avant les trois séances dédiées aux exposés, vous devez lire les sections du poly corres-
pondant aux sujets traités en séance, et préparer une question à poser au sein de chaque groupe,
pour chaque sujet.
so that I ⊂ J.
To prove the converse inclusion, we show that I ⊥ ⊂ J ⊥ . Let v ⊂ I ⊥ : by definition, for all
u ∈ Rp ,
0 = huKn , vi = hu, vKn⊤ i = hu, vKn i,
and
n n
1X 2 1X 1
(xi − x2n )2 = α2 (xi − x1n )2 .
n n
i=1 i=1
n = 100
1.0
1.0
1.0
0.8
0.5
0.5
0.6
correl
x2a
x2b
0.0
0.0
0.4
−0.5
0.2
−0.5
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50
x1 x1 1:50
The first and second figure show two set of points (x1i , x2i ), i ∈ {1, . . . , 100}, with respective
parameter k = 1 and k = 100. The third picture shows the evolution of the empirical correlation
between x1 and x2 for k ranging from 1 to 100. In the present example, x1 and x2 are related by a
deterministic function, but when k is large, the nonlinearity of this function makes the correlation
between x1 and x2 close to 0, as if these variables were statistically independent.
1. We have x11 = 15, x12 = 12 and x13 = 18, so that x12 < x11 < x13 . As a consequence, r11 = 2
(because x11 is ranked second), r21 = 1 and r31 = 3. Likewise, x21 = 9, x22 = 7 and x23 = 8,
so that x22 < x23 < x21 and r12 = 3, r22 = 1 and r32 = 2. In the definition
1 P3 1
3 − r 1 )(ri2 − r2 )
i=1 (ri
rs = q P q P ,
1 3 1 1 2 1 3 2 2 2
3 i=1 (ri − r ) 3 i=1 (ri − r )
and
3 3
1X 1 1X 2 12 + 02 + 12 2
(ri − r 1 )2 = (ri − r2 )2 = = ,
3 3 3 3
i=1 i=1
while the numerator writes
3
1X 1 1 1
(ri − r1 )(ri2 − r2 ) = [(2 − 2)(3 − 2) + (1 − 2)(1 − 2) + (3 − 2)(2 − 2)] = .
3 3 3
i=1
Therefore rs = 1/2.
We may already notice for further purpose that since, in the general case, r 1 and r 2 are
permutations of {1, . . . , n}, it always holds
1 + ··· + n n+1
r1 = r2 = = ,
n 2
and
n n n
1X 1 1 2 1X 2 2 2 1X n + 1 2 n2 − 1
(ri − r ) = (ri − r ) = k− = .
n n n 2 12
i=1 i=1 k=1
2. Since f is increasing, the series x11 , . . . , x1n and x21 , . . . , x2n are ranked in the same order,
therefore r 1 = r 2 and it is immediate that rs = 1. On the contrary, if f is decreasing, the
series x11 , . . . , x1n and x21 , . . . , x2n are ranked in the opposite order, therefore ri1 = n + 1 − ri2
for any i ∈ {1, . . . , n}. As a consequence, the numerator in the definition of Spearman’s
coefficient writes
n n
1X 1 1 2 2 1 X 1 n+1 1 n+1
(ri − r )(ri − r ) = ri − n + 1 − ri −
n n 2 2
i=1 i=1
n
1X n+1 n+1
= k− n+1−k−
n 2 2
k=1
n
1X n+1 2
=− k−
n 2
k=1
n2 − 1
=− ,
12
where we have used the computation made at the end of the previous question. We conclude
that rs = −1.
3. In general, the numerator in the definition of rs writes
n n
1X 1 1 X 1 n+1 n+1
(ri − r 1 )(ri2 − r 2 ) = ri − ri2 −
n n 2 2
i=1 i=1
n
1X 1 2 1 2 n+1 (n + 1)2
= ri ri − (ri + ri ) + .
n 2 4
i=1
Let us compute separately the sums of the three terms appearing in the right-hand side,
starting from the last two. We get
n
1 X (n + 1)2 (n + 1)2
= ,
n 4 4
i=1
150 Correction of the exercises
and
n n
1X 1 n+1 n+12 X (n + 1)2
(ri + ri2 ) = k= .
n 2 2 n 2
i=1 k=1
Last,
n n
1X 1 2 1 X
ri ri = (ri1 )2 + (ri2 )2 − (ri1 − ri2 )2
n 2n
i=1 i=1
n n
!
1 X X
2 1 2 2
= 2 k − (ri − ri )
2n
k=1 i=1
n
(n + 1)(2n + 1) 1 X 1
= − (ri − ri2 )2 .
6 2n
i=1
1. For all n ≥ 0,
+∞
n+1 1 X n+1 1
bn+1 = E[X ]= k
e k!
k=0
+∞
X
1 1
= kn
e (k − 1)!
k=1
+∞
1X 1
= (l + 1)n
e l!
l=0
+∞
" n #
1 X X n m 1
= l
e m l!
l=0 m=0
X X
n
n 1
+∞
1
= lm
m=0
m e l!
l=0
n
X n
= bm .
m=0
m
2. By Exercise 1.3.1, the sequences (bn )n≥0 and (Bn )n≥0 satisfy the same recursive relation,
and it is immediate that b0 = B0 = 1. By induction, we deduce that bn = Bn for all n ≥ 0.
C.1 Correction of the exercises of Chapter 1 151
2. To show that WCSS(t+1) ≤ WCSS(t) in the proof of Lemma 1.3.3, we have used two
(t) (t) (t)
inequalities: first, that for all j ∈ {1, . . . , k}, for all i ∈ Sj , kxi − xj k ≥ kxi − xJ(i) k;
(t+1) (t)
second, that for all j ∈ {1, . . . , k}, kxj − xj k ≥ 0. If WCSS(t+1) = WCSS(t) , then
it is necessary that both inequalities actually be equalities. In particular, the second one
(t) (t+1)
implies that for all j ∈ {1, . . . , k}, xj = xj .
In the t + 1-th iteration, the assignment step computes, for each i ∈ {1, . . . , n}, the index
(t+1) (t+1)
J ′ (i) which is the index j of the point xj which is the closest to xi . But since xj =
(t)
xj , then J ′ (i) = J(i). Therefore, it is now clear that the update step of the t+1-th iteration
does not modify the partition.
3. Lemma 1.3.3 shows that the sequence WCSS(t) is nonincreasing. Since there is only a
finite number of partitions, this sequence is necessarily stationary: there exists t0 ≥ 0 such
that WCSS(t0 ) = WCSS(t0 +1) = · · · . The counterexample of the first question of this
exercise shows that it is not necessary that the partitions obtained at the t0 -th and t0 + 1-th
iterations of the algorithm coincide, however the second question ensures that it is the case
for the partitions obtained at the t0 + 1-th and t0 + 2-th iterations. Therefore the algorithm
is stopped after the t0 + 2-th iteration.
Take a different partition {J1 , . . . , JK }: then there exists l0 ∈ {1, . . . , K} such that Jl0
contains two indices i ∈ Ik and i′ ∈ Ik′ with k 6= k′ . The WCSS of this partition satisfies
K X
X
kxi − xl k2 ≥ kxi − xl0 k2 + kxi′ − xl0 k2 ,
l=1 i∈Jl
but on the other hand kxi − µk k ≤ ρ and kxi′ − µk′ k ≤ ρ, so that the triangle inequality
again yields
kµk − µk′ k − 2ρ ≤ kxi − xi′ k.
The classical inequality 12 (a + b)2 ≤ a2 + b2 finally yields
1
kxi − xi′ k2 ≤ kxi − xl0 k2 + kxi′ − xl0 k2 ,
2
so that the WCSS of the partition {J1 , . . . , JK } is bounded from below by
1
(kµk − µk′ k − 2ρ)2 .
2
When ρ → 0, this bound becomes larger than 4nρ2 , which proves the optimality of the
partition {I1 , . . . , IK }.
where we recall that vectors of Rp are considered as row vectors. For all k ∈ {1, . . . , K},
for all i ∈ Ik ,
K
X nk ′
xi − xn = µ k + ǫ i − µ k ′ − ǫn = νk + ǫi − ǫn ,
′
n
k =1
where
K
X nk ′
νk = µ k − µk ′ .
′
n
k =1
1 XX ⊤
K
Kn = νk νk + O(ρ) = Kn0 + O(ρ),
n
k=1 i∈Ik
C.2 Correction of the exercises of Chapter 2 153
with
K
X nk
Kn0 = νk⊤ νk .
n
k=1
Notice that Kn0is the covariance matrix of a random vector X ∈ Rp taking the value µk
with probability nk /n.
K
X nk
he0j , νk i2 = λ0j ke0j k2 .
n
k=1
K
X nk
νk = 0
n
k=1
implies that K ′ ≤ K − 1.
Since the eigenvalues of symmetric matrices are Lipschitz continuous, the spectrum of Kn
is a perturbation of order ρ of the spectrum of Kn0 . Therefore, when ρ → 0, the scree plot
of the Principal Component Analysis shall exhibit:
Thus, an inflection point shall occur between the K ′ -th and (K ′ + 1)-th eigenvalues.
6. The ‘actual information’ contained in the data is encoded in a K ′ -dimensional vector, which
is exactly recovered in the ρ → 0 limit by keeping only the K ′ first principal components.
and we have
Var(X1 ) = E[X12 ] − E[X1 ]2 = ρ2 ,
Cov(X1 , X12 ) = E[X13 ] − E[X1 ]E[X12 ] = ρ3 ,
Var(X12 ) = E[X14 ] − E[X12 ]2 = ρ4 − ρ22 .
As a consequence,
0
∇ϕ(y) = .
1
√
3. We have Vn = ϕ(Y n ) and ρ2 = ϕ(y), so that by the Delta Method, n(Vn − ρ2 ) converges
in distribution to N(0, v) with v = ∇ϕ(y)⊤ K∇ϕ(y) = ρ4 − ρ22 .
which is plotted on Figure C.1. We observe that any choice of θ between max1≤i≤n xi − 1
and min1≤i≤n xi maximises this likelihood: the MLE is not unique.
max xi − 1 min xi θ
1≤i≤n 1≤i≤n
P
3. For all ǫ > 0, the strong Law of Large Numbers implies that n1 ni=1 1{Xi ≤θ+ǫ} converges
to ǫ, Pθ -almost surely. As a consequence, 1{Xi ≤θ+ǫ} = 1 for infinitely many values of i,
which implies that min1≤i≤n Xi converges to θ, Pθ -almost surely. By the same arguments,
max1≤i≤n Xi converges to θ + 1, Pθ -almost surely. As a consequence,
where
Z 1
In (u1 , un ) = 1{u1 ≤u2 ≤···≤un −1≤un } du2 · · · dun−1
u2 ,...,un−1 =0
Z un Z un Z un
= 1{u1 ≤un } ··· dun−1 · · · du2
u2 =u1 u3 =u2 un−1 =un−2
(un − u1 )n−2
= 1{u1 ≤un } .
(n − 2)!
156 Correction of the exercises
We notice that the MSE is of order 1/n2 , while for regular model, it is generally expected
to be of order 1/n.
Correction of Exercise 2.A.7 It is clear that ηk,n is a statistic. Using properties of the exponen-
tial distribution, we rewrite
Z1
ηk,n = ,
Z1 + Z2
where Z1 = X1 + · · · + Xk ∼ Γ(k, λ) and Z2 = Xk+1 + · · · + Xn ∼ Γ(n − k, λ) are independent.
Following [2, Proposition 3.4.4, p. 52], we deduce that ηk,n ∼ β(k, n − k), which does not depend
on the value of λ.
2. By definition,
Eθ [U1 ] = Pθ (X1 ≤ 0)
Z 0
dx
= 2
x=−∞ π((x − θ) + 1)
Z −θ
dy
= 2
letting y = x − θ
y=−∞ π(y + 1)
1 π 1 π
= arctan(−θ) + = − arctan(θ) + .
π 2 π 2
Since Eθ [U1 ] ∈ (0, 1), the right-hand side belongs to the interval (− π2 , π2 ), therefore
1 Pn
Approximating Eθ [U1 ] by n i=1 Ui ,we get the estimator
n
!
1 X
θen = g Ui .
n
i=1
so that
" #
′ 1 1 π 2
g (Eθ [U1 ]) = −π 1 + tan π − − arctan(θ) +
2 π 2
h i
= −π 1 + tan (arctan(θ))2
= −π(1 + θ 2 ).
Applying the Delta Method, we get
! !
√ √ 1X
n
n θen − θ = n g Ui − g (Eθ [U1 ])
n
i=1
whence s s
θen − φ1−α/2 v(θen ) e e
v(θn )
, θn + φ1−α/2
n n
2. By the Delta Method, Φ(Zn ) is a consistent and asymptotically normal estimator of Φ(g(θ)),
with asymptotic variance Φ′ (g(θ))2 V (θ).
so that
s s
b
Vn −1 Vbn
InΦ = Φ−1 Φ(Zn ) − Φ′ (Zn )φ1−α/2 ,Φ Φ(Zn ) + Φ′ (Zn )φ1−α/2
n n
4. Let us define s
Vbn
sn = φ1−α/2 ,
n
so that the width of the interval InΦ writes ϕn (sn ), while the width of In is 2sn .
It is obvious that ϕn (0) = 0, and
ϕ′n (s) = Φ′ (Zn )(Φ−1 )′ Φ(Zn ) + Φ′ (Zn )s + Φ′ (Zn )(Φ−1 )′ Φ(Zn ) − Φ′ (Zn )s .
As a consequence,
ϕ′n (0) = 2Φ′ (Zn )(Φ−1 )′ (Φ(Zn )) = 2.
We deduce that the width of In rewrites ϕn (0) + ϕ′n (0)s, from which we conclude that:
C.3 Correction of the exercises of Chapter 3 159
one can take Φ(λ) = log λ. With this choice, the function ϕn introduced in Question 4
writes
bn + s − exp log λ
ϕn (s) = exp log λ bn − s = 2λ bn sinh s ,
bn
λ bn
λ bn
λ
which is convex on [0, +∞). As a consequence, the confidence interval obtained with
variance stabilisation is larger than the original one.
2. Following the approach of Section 3.2.2, we consider the test with rejection region
( r )
log(α/2)
Wn = {p0 6∈ In } = X n − p0 > − .
2n
Then by Question 1,
Pp0 (Wn ) = Pp0 (p0 6∈ In ) ≤ α,
so that the level of this test is lower than α.
160 Correction of the exercises
Correction of Exercise 3.A.3 You should have figured out that none of the proposed statements
are correct!
2.b By the strong Law of Large Numbers and the result of Question 1,
α+2
lim X n = E[X] = θ, almost surely.
n→+∞ α+1
Since
α+2 2 − x/θ
θ=x if and only if α = ,
α+1 x/θ − 1
we deduce that
2 − X n /θ
α
en =
X n /θ − 1
is a strongly consistent estimator of α.
bn and α
2.c To prove the asymptotic normality of α en , we use the Delta method.
bn . We write
Asymptotic normality of α
n ! !
√ √ 1X Xi 1 1
n (b
αn − α) = n g log −g , g(y) = −2 + .
n θ α+2 y
i=1
As a consequence, α
bn is asymptotically normal, with asymptotic variance
1
[−(α + 2)2 ]2 = (α + 2)2 .
(α + 2)2
en . We write
Asymptotic normality of α
√ √ α+2 2 − x/θ
n (e
αn − α) = n h X n − h θ , h(x) = .
α+1 x/θ − 1
162 Correction of the exercises
As a consequence, α
en is asymptotically normal, with asymptotic variance
2
(α + 1)2 α+2 2 (α + 1)2 (α + 2)
− θ = .
θ α(α + 1)2 α
(α + 1)2 (α + 2) α+2
(α + 2)2 − =− < 0,
α α
so that α
bn has a smaller asymptotic variance than α
en .
2.d Since αbn is a consistent estimator of α, it takes larger values under H1 than under H0 .
Therefore we shall look for a rejection region of the form Wn = {b αn ≥ a}. To compute the
value of a, we write the type I error
√ √ √
n(bαn − α0 ) n(a − α0 ) n(a − α0 )
Pα0 (bαn ≥ a) = Pα0 ≥ ≃P Z≥ ,
α0 + 2 α0 + 2 α0 + 2
√
with Z ∼ N(0, 1). For the right-hand side to be equal to 5%, n(a − α0 )/(α0 + 2) must
be equal to the quantile φ0.95 ≃ 1.65 of order 95% of the N(0, 1) distribution, so that one
has to take
α0 + 2
Wn = α bn ≥ α0 + 1.65 √ .
n
This is an example of the Student–Wald asymptotic Z-test studied in Exercise 3.A.4, from
which we immediately deduce that the test is consistent.
with
p(X1 ; θ1 )
σ 2 = Varθ0 log .
p(X1 ; θ0 )
Notice that both h and σ 2 are virtually computable, although depending on the model, these
computations may be more or less straightforward. With the same arguments as in Exer-
cise 3.A.4, we thus deduce that the test with rejection region
1 σ
LR
Wn = log ζn (Xn ) + h ≥ φ1−α/2 √
n n
Let Pbn be the empirical measure associated with the realisation xn . In the product in the right-hand
above, each x ∈ X appears nb pn,x times, so that the likelihood rewrites
Y nb
pn,x
Ln (xn ; θ) = px ,
x∈X
We now check that this function reaches its maximum on Θ for θ = Pbn , by writing, for any θ ∈ Θ,
X X
ℓn (xn ; Pbn ) − ℓn (xn ; θ) = nb
pn,x log(b
pn,x ) − nb
pn,x log(px )
x∈X x∈X
X
pbn,x
=n pbn,x log
px
x∈X
X
pbn,x
=n px φ ,
px
x∈X
where φ(u) = u log u. Since this function is convex, Jensen’s inequality yields
!
X pbn,x X pbn,x
px φ ≥φ px = φ(1) = 0,
px px
x∈X x∈X
1. By Definition 4.2.2, for any t ∈ [0, 1], β(t) is a centered Gaussian variable, with variance
t − t2 = t(1 − t).
Cov(β(F (xi )), β(F (xj ))) = min{F (xi ), F (xj )} − F (xi )F (xj )
= F (min{xi , xj }) − F (xi )F (xj ),
since F is nondecreasing.
By the multidimensional Central Limit Theorem (see Theorem A.3.1 in Appendix A), Gn
converges in distribution to Nd (0, K) where K is the covariance matrix of S1 . The coeffi-
cients of this matrix are given by
h i h i
E 1{X1 ≤xi } 1{X1 ≤xj } − E 1{X1 ≤xi } E 1{X1 ≤xj }
h i h i
= E 1{X1 ≤min{xi ,xj }} − E 1{X1 ≤xi } E 1{X1 ≤xj }
= F (min{xi , xj }) − F (xi )F (xj ),
1. Let x ∈ R. For all n ≥ 1, the iid random variables 1{X1 ≤x} , . . . , 1{Xn ≤x} take their values
in [0, 1] and satisfy E[1{X1 ≤x} ] = F (x), therefore by Corollary 2.4.17 p. 52,
√
P n|Fbn (x) − F (x)| ≥ a ≤ 2 exp(−2a2 ).
has level
√
PH0 sup n|Fbn (x) − F0 (x)| ≥ a ≤ 2 exp(−2a2 ) = α.
x∈R
In addition to provide an easily computable threshold a, this test does not require the CDF
F0 to be continuous.
2. The covariance matrix of βe = an (xn β+ǫn ) is the covariance matrix of the vector an ǫn , and
since ǫn has covariance matrix σ 2 Ip+1 , we deduce that the covariance matrix of βe writes
an σ 2 Ip+1 a⊤ 2 ⊤
n = σ an an .
an a⊤ ⊤ −1 ⊤ ⊤ −1 ⊤
n = ((xn xn ) xn + dn )((xn xn ) xn + dn )
⊤
= ((x⊤ −1 ⊤ ⊤
n xn ) xn + dn )(xn (xn xn )
−1
+ d⊤
n)
= (x⊤
n xn )
−1
+ (x⊤ −1 ⊤ ⊤ ⊤
n xn ) xn dn + dn xn (xn xn )
−1
+ dn d⊤
n.
By Question 1,
Ip+1 = an xn = ((x⊤ −1 ⊤
n xn ) xn + dn )xn = Ip+1 + dn xn ,
C.5 Correction of the exercises of Chapter 5 167
hu, (an a⊤ ⊤ −1 ⊤ ⊤ 2
n − (xn xn ) )ui = hu, dn dn ui = kdn uk ≥ 0,
3. If σ 2 is estimated by σ
b2 , we now write
p+1
X λj b
p+1 2
C(a) = β ∈ R : hβ − β, ej i ≤ a
b2
σ
j=1
and look for the value of a which makes the identity P(β ∈ C(a)) = 1 − α remain valid.
We rewrite to this aim
p+1
X
λj hβb − β, ej i2 = σ 2 Z1 , Z1 ∼ χ2 (p + 1),
j=1
and
Z2
b2 = σ2
σ , Z2 ∼ χ2 (n − p − 1),
n−p−1
where Proposition 5.1.11 ensures that Z1 and Z2 are independent. As a consequence,
p+1
X λj b Z1 /(p + 1)
2
hβ − β, ej i2 = (p + 1) .
σ
b Z2 /(n − p − 1)
j=1
By the definition of the principal components, the third term in the sequence of inequalities
above rewrites !
Xp X p p
X X p Xp
l j j j
γl c = γl el x = γl el xj ,
l=1 l=1 j=1 j=1 l=1
therefore the identification of the coefficients yields
p
X
βj = γl ejl .
l=1
3. The orthogonal projections of yn on the range of xn and on the range of cn are the same,
because these two spaces are identical.
4. By Proposition 1.2.4, the principal components c1 , . . . , ck are pairwise orthogonal in Rn .
As a consequence, the orthogonal projection of yn onto the range of ckn writes
k
X
bnk =
y blk cl ,
γ
l=1
The definition of βbk rewrites βbk = Pk γ bk , where Pk is the p × k matrix whose columns
bk is a Gaussian vector with covariance
are the vectors e1 , . . . , ek . By Proposition 5.1.11, γ
⊤ k −1 ⊤
matrix σ (cn cn ) . On the other hand, Proposition 1.2.4 shows that the matrix ckn ckn is
2 k
diagonal with coefficients nλ1 , . . . , nλk . As a consequence, the covariance of the vector βbk
2
writes σn Pk diag(λ−1 −1 ⊤
1 , . . . , λk )Pk , therefore
k
bk σ 2 X hu, el i2
Var(hβ , ui) = .
n λl
l=1
where (U1,1 , . . . , U1,n1 ) and (U2,1 , . . . , U2,n2 ) are independent samples of uniformly distributed
random variables on [0, 1]. Performing the change of variable u = F (x) and using the continuity
of F allows to rewrite
1 X n1
1 X
n2
Xn1 ,n2 = sup 1{U1,i1 ≤u} − 1{U2,i2 ≤u} ,
u∈(0,1) n1 i1 =1
n2
i2 =1
Correction of Exercise 6.A.2 In the example, Zi = X1,i − X2,i = ǫ1,i − ǫ2,i . Under H0′ , the
pairs (ǫ1,i , ǫ2,i ) and (ǫ2,i , ǫ1,i ) have the same distribution, so that ǫ1,i − ǫ2,i and ǫ2,i − ǫ1,i have the
same distribution, which implies that the law of Zi is symmetric. Therefore, H0′ ⊂ H0 , so that if
Wn is the rejection region for a test of level α for the null hypothesis H0 , then
sup P(Wn ) ≤ sup P(Wn ) = α.
H0′ H0
In other words, the test rejecting H0′ on the event Wn has level at most α.
1. Let f : {−1, 1} → R and g : (0, +∞) → R be bounded functions. The symmetry of the
law of ζ yields
E[f (sign(ζ))g(|ζ|)] = E[f (sign(−ζ))g(| − ζ|)] = E[f (− sign(ζ))g(|ζ|)],
so that
1
E[f (sign(ζ))g(|ζ|)] = (E[f (sign(ζ))g(|ζ|)] + E[f (− sign(ζ))g(|ζ|)])
2
1
= E (f (sign(ζ)) + f (− sign(ζ))) g(|ζ|)
2| {z }
=f (1)+f (−1)
f (1) + f (−1)
= E[g(|ζ|)],
2
which shows that sign(ζ) and |ζ| are independent, with P(sign(ζ) = 1) = P(sign(ζ) =
−1) = 1/2. (Notice that we have used the assumption that P(ζ = 0) = 0, which follows
from (∗), to ensure that f (sign(ζ)) + f (− sign(ζ)) = f (1) + f (−1), almost surely.)
2. Let ǫ1 , . . . , ǫn ∈ {−1, 1}. We write
P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn
X
= P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn , π = σ
σ∈Sn
X
= P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn , |ζσ(1) | < · · · < |ζσ(n) | .
σ∈Sn
170 Correction of the exercises
By the result of Question 1, the vectors (sign(ζ1 ), . . . , sign(ζn )) and (|ζ1 |, . . . , |ζn |) are
independent, therefore for any σ ∈ Sn ,
P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn , |ζσ(1) | < · · · < |ζσ(n) |
= P sign(ζσ(1) ) = ǫ1 , . . . , sign(ζσ(n) ) = ǫn P |ζσ(1) | < · · · < |ζσ(n) |
1
= n P(π = σ),
2
where at the last line, we have used the fact that the variables sign(ζ1 ), . . . , sign(ζn ) are
independent Rademacher variables. We deduce that
1 X 1
P sign(ζπ(1) ) = ǫ1 , . . . , sign(ζπ(n) ) = ǫn = n P(π = σ) = n ,
2 2
σ∈Sn
where ℓ1 , . . . , ℓn are independent B(1/2) variables. This variable does not depend on the
law of Z1 , so that the statistic T + is free under H0 . Denoting by t+
n,r the quantile of order r
of τ , we deduce that the test rejecting H0 as soon as T 6∈ [tn,α/2 , t+
+ + +
n,1−α/2 ] has level α.
These quantiles may be computed by numerical simulation.
4. Under H0 , the expectation of T + is equal to
n
X n
1X n(n + 1)
tn = E[τ + ] = kE[ℓk ] = k= ,
2 4
k=1 k=1
5. We proceed as for the proof of the Central Limit Theorem, and compute the characteristic
function Φwn of the reduced variable
τ + − tn
wn = .
σn
For all u ∈ R,
Φwn (u) = E [exp(iuwn )]
τ + − tn
= E exp iu
σn
" n
!#
iu X
= E exp k(ℓk − 1/2)
σn
k=1
Yn
iu
= E exp k(ℓk − 1/2)
σn
k=1
Yn
1 iuk 1 iuk
= exp + exp − .
2 2σn 2 2σn
k=1
C.6 Correction of the exercises of Chapter 6 171
√
The expression of σn2 computed above shows that k/σn = O(1/ n), uniformly over k ≤ n,
so that for any k ∈ {1, . . . , n},
1 iuk 1 iuk
exp + exp −
2 2σn 2 2σn
2 2
1 iuk u k 1 1 iuk u2 k2 1
= 1+ − +o + exp 1 − − +o
2 2σn 8σn2 n 2 2σn 8σn2 n
u2 k2 1
=1− 2
+o .
8σn n
Pretending not to see that we are taking the logarithm of a complex number, we then write
n
X
u2 k 2 1
log Φwn (u) = log 1 − +o
8σn2 n
k=1
n
X u2 k 2
= − + o(1)
8σn2
k=1
u2 n(n + 1)(2n + 1)
=− + o(1)
8σn2 6
u2
= − + o(1).
2
We deduce that Φwn (u) converges to the characteristic function exp(−u2 /2) of the N(0, 1)
distribution. This allows us to construct an asymptotic test, which rejects H0 as soon as
|wn | ≥ φ1−α/2 .
172 Correction of the exercises
Bibliography
[1] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer
Series in Statistics. Springer, New York, second edition, 2009. Data mining, inference, and
prediction.
[3] E. L. Lehmann and Joseph P. Romano. Testing statistical hypotheses. Springer Texts in
Statistics. Springer, New York, third edition, 2005.
[4] A. Reinhart. Statistics Done Wrong: The woefully complete guide. No Starch Press, 2015.
[6] A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.