Nonparametric Curve Estimation Course

A short course on nonparametric curve estimation
MSc in Applied Mathematics at EAFIT University (Colombia)

Eduardo García-Portugués
2018-02-20, v1.8
2
Contents
1 Introduction 5
1.1 Course objectives and logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Nonparametric inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Main references and credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Density estimation 15
2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Regression estimation 55
3.1 Review on parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Kernel regression estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Local likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Installation of R and RStudio 85
B Introduction to RStudio 87
C Introduction to R 89
3
4 CONTENTS
Chapter 1
Introduction
The animations of these notes will not be displayed the first time they are browsed1 . See for
example Figure 1.3. To see them, click on the caption’s link “Application also available here”.
You will get a warning from your browser saying that “Your connection is not private”. Click in
“Advanced” and allow an exception in your browser (I guarantee you I will not do anything evil!).
The next time the animation will show up correctly within the notes.
This course is intended to provide an introduction to nonparametric estimation of the density and
regression functions from, mostly, the perspective of kernel smoothing. The emphasis is placed in building
intuition behind the methods, gaining insights into their asymptotic properties, and showing their application
through the use of statistical software.
1.1 Course objectives and logistics

The software employed in the course is the statistical language R and its most common IDE (Integrated
Development Environment) nowadays, RStudio. A basic prior knowledge of both is assumed2 . The appendix
presents basic introductions to RStudio and R for those students lacking basic expertise on them.
The notes contain a substantial amount of snippets of code that are fully self-contained. Students are
encouraged to bring their own laptops at the lessons to practice with the code.
The required packages for the course are:
# Install packages
install.packages(c("ks", "nor1mix", "KernSmooth", "manipulate", "locfit"))
The codes in the notes may assume that the packages have been loaded, so it is better to do it now:
# Load packages
library(ks)
library(nor1mix)
library(KernSmooth)
library(locfit)
The Shiny interactive apps on the notes can be downloaded and run locally. This allows in particular to
examine their codes. Check this GitHub repository for the sources.
2 Among others: basic programming in R, ability to work with objects and data structures, ability to produce graphics,
knowledge of the main statistical functions, ability to run scripts in RStudio.
5
6 CHAPTER 1. INTRODUCTION
Each topic of this contains a mix of theoretical and practical exercises for grading. Groups of two students
must choose three exercises in total (at least one theoretical and other practical) from Sections 2.7 and 3.6
and turn them in order to be graded. The group grade is weighted according to the difficulty of the exercises,
which is given by the number of stars: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆). The final grade (0 − 5) is
1 ∑ Scorei
3
(2 + ⋆i ),
3 i=1 5
where Scorei is the score (0 − 5) for the i-th exercise and ⋆i represents its number of stars (1 − 3). Deadline
for submission is 2017-10-13.
1.2 Background and notation

We begin by reviewing some elementary results that will be employed during the course, which will also
serve to introduce notation.
1.2.1 Basic probability review
A triple (Ω, A, P) is called a probability space. Ω represents the sample space, the set of all possible individual
outcomes of a random experiment. A is a σ-field, a class of subsets of Ω that is closed under complementation
and numerable unions, and such that Ω ∈ A. A represents the collection of possible events (combinations
of individual outcomes) that are assigned a probability by the probability measure P. A random variable is
a map X : Ω −→ R such that {ω ∈ Ω : X(ω) ≤ x} ∈ A (the set is measurable).
The cumulative distribution function (cdf) of a random variable X is F (x) := P[X ≤ x]. When an independent
and identically distributed (iid) sample X1 , . . . , Xn is given, the cdf can be estimated by the empirical
distribution function (ecdf)
1∑
n
Fn (x) = ⊮{Xi ≤x} , (1.1)
n i=1
{
1, A is true,
where 1A := is an indicator function. Continuous random variables are either characterized
0, A is false
by the cdf F or the probability density function (pdf) f = F ′ , which represents the infinitesimal relative
probability of X per unit of length. We write X ∼ F (or X ∼ f ) to denote that X has a cdf F (or a pdf f ).
d
If two random variables X and Y have the same distribution, we write X = Y .
The expectation operator is constructed using the Lebesgue-Stieljes “dF (x)” integral. Hence, for X ∼ F , the
expectation of g(X) is
∫
E[g(X)] := g(x)dF (x)
{∫
g(x)f (x)dx, X continuous,
= ∑
{i:P[X=xi ]>0} g(xi )P[X = xi ], X discrete.
Unless otherwise stated, the integration limits of any integral are R or Rp . The variance operator is defined
as Var[X] := E[(X − E[X])2 ].
1.2. BACKGROUND AND NOTATION 7
We employ boldfaces to denote vectors (assumed to be column matrices) and matrices. A p-random vector is
a map X : Ω −→ Rp , X(ω) := (X1 (ω), . . . , Xp (ω)), such that each Xi is a random variable. The (joint) cdf of
p
X is F (x) := P[X ≤ x] := P[X1 ≤ x1 , . . . , Xp ≤ xp ] and, if X is continuous, its (joint) pdf is f := ∂x1∂···∂xp F .
The marginals of F and f are the cdf and pdf of Xi , i = 1, . . . , p, respectively. They are defined as:
∫
FXi (xi ) := P[Xi ≤ x] = F (x)dx−i ,
Rp−1
∫
∂
fXi (xi ) := FX (xi ) = f (x)dx−i ,
∂xi i Rp−1
where x−i := (x1 , . . . , xi−1 , xi+1 , xp ). The definitions can be extended analogously to the marginals of the
cdf and pdf of different subsets of X.
The conditional cdf and pdf of X1 |(X2 , . . . , Xp ) are defined, respectively, as
FX1 |X−1 =x−1 (x1 ) := P[X1 ≤ x1 |X−1 = x−1 ],

f (x)
fX1 |X−1 =x−1 (x1 ) := .
fX−1 (x−1 )
The conditional expectation of Y |X is the following random variable3

∫
E[Y |X] := ydFY |X (y|X).
The conditional variance of Y |X is defined as

Var[Y |X] := E[(Y − E[Y |X])2 |X] = E[Y 2 |X] − E[Y |X]2 .
Proposition 1.1 (Laws of total expectation and variance). Let X and Y be two random variables.
• Total expectation: if E[|Y |] < ∞, then E[Y ] = E[E[Y |X]].
• Total variance: if E[Y 2 ] < ∞, then Var[Y ] = E[Var[Y |X]] + Var[E[Y |X]].
Exercise 1.1. Prove the law of total variance from the law of total expectation.
We conclude with some useful inequalities.
Proposition 1.2 (Markov’s inequality). Let X be a non-negative random variable with E[X] < ∞. Then
E[X]
P[X > t] ≤ , ∀t > 0.
t
Proposition 1.3 (Chebyshev’s inequality). Let µ = E[X] and σ 2 = Var[X]. Then
σ2
P[|X − µ| ≥ t] ≤
, ∀t > 0.
t2
Exercise 1.2. Prove Markov’s inequality using X = X1{X>t} + X1{X≤t} . Then prove Chebyshev’s inequal-
ity using Markov’s. Hint: use the random variable (X − E[X])2 .
√ 1.4 (Cauchy-Schwartz inequality). Let X and Y such that E[X ] < ∞ and E[Y ] < ∞. Then
2 2
Proposition
E[|XY |] ≤ E[X 2 ]E[Y 2 ].
Proposition 1.5 (Jensen’s inequality). If g is a convex function, then g(E[X]) ≤ E[g(X)].
Example 1.1. Jensen’s inequality has interesting derivations. For example:
• Take h = −g. Then h is a concave function and we have that h(E[X]) ≥ E[h(X)].
• Take g(x) = xr for r ≥ 1. Then E[X]r ≤ E[X r ]. If 0 < r < 1, then E[X]r ≥ E[X r ].
• Consider 0 ≤ r ≤ s. Then g(x) = xr/s is convex and g(E[|X|s ]) ≤ E[g(|X|s )] = E[|X|r ]. As a
consequence E[|X|r ] < ∞ =⇒ E[|X|s ] < ∞ for 0 ≤ r ≤ s. Finite moments of higher order implies
finite moments of lower order.
3 Recall that the X-part is random!
1.2.2 Some facts about distributions
We will make use of several parametric distributions. Some notation and facts are introduced as follows:
• N (µ, σ 2 ) stands for the normal distribution with mean µ and variance σ 2 . Its pdf is ϕσ (x − µ) :=
(x−µ)2 ( )
√ 1 e− 2σ2 , x ∈ R, and satisfies that ϕσ (x − µ) = 1 ϕ x−µ (if σ = 1 the dependence is omitted).
2πσ σ σ
Its cdf is denoted as Φσ (x − µ). zα denotes the upper α-quantile of a N (0, 1), i.e., zα = Φ−1 (1 − α).
Some uncentered moments of X ∼ N (µ, σ 2 ) are
E[X] = µ,
E[X 2 ] = µ2 + σ 2 ,
E[X 3 ] = µ3 + 3µσ 2 ,
E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4 .
The multivariate normal is represented by Np (µ, Σ), where µ is a p-vector and Σ is a p × p symmetric
′ −1
and positive matrix. The pdf of a N (µ, Σ) is ϕΣ (x−µ) := (2π)p/21|Σ|1/2 e− 2 (x−µ) Σ (x−µ) , and satisfies
1
( )
that ϕΣ (x − µ) = |Σ|−1/2 ϕ Σ−1/2 (x − µ) (if Σ = I the depence is omitted).
d
• The lognormal distribution is denoted by LN (µ, σ 2 ) and is such that LN (µ, σ 2 ) = exp(N (µ, σ 2 )). Its
(log x−log µ)2 2
− µ+ σ2
1
pdf is f (x; µ, σ) := √2πσx e , x > 0. Note that E[LN (µ, σ 2 )] = e
2σ 2
• The exponential distribution is denoted as Exp(λ) and has pdf f (x; λ) = λe−λx , λ, x > 0.
ap
• The gamma distribution is denoted as Γ(a, p) and has pdf f (x; a, p) = Γ(p) xp−1 e−ax , a, p, x > 0, where
∫ ∞ p−1 −ax
Γ(p) = 0 x e dx. It is known that E[Γ(a, p)] = a and Var[Γ(a, p)] = ap2 .
p
d ap
• The inverse gamma distribution, IG(a, p) = Γ(a, p)−1 , has pdf f (x; a, p) = −p−1 − x
a
Γ(p) x e , a, p, x > 0.
a2
It is known that E[IG(a, p)] = and Var[IG(a, p)] =
a
p−1 (p−1)2 (p−2) .
• The binomial distribution is denoted as B(n, p). Recall that E[B(n, p)] = np and Var[B(n, p)] =
np(1 − p). A B(1, p) is a Bernouilli distribution, denoted as Ber(p).
1
• The beta distribution is denoted as β(a, b) and its pdf is f (x; a, b) = β(a,b) xa−1 (1 − x)1−b , 0 < x < 1,
Γ(a)Γ(b)
where β(a, b) = Γ(a+b) . When a = b = 1, the uniform distribution U(0, 1) arises.
xλ e−λ
• The Poisson distribution is denoted as Pois(λ) and has pdf P[X = x] = x! , x = 0, 1, 2, . . .. Recall
that E[Pois(λ)] = Var[Pois(λ)] = λ.
1.2.3 Basic stochastic convergence review
Let Xn be a sequence of random variables defined in a common probability space (Ω, A, P). The four most
common types of convergence of Xn to a random variable in (Ω, A, P) are the following.
d
Definition 1.1 (Convergence in distribution). Xn converges in distribution to X, written Xn −→ X, if
limn→∞ Fn (x) = F (x) for all x for which F is continuous, where Xn ∼ Fn and X ∼ F .
P
Definition 1.2 (Convergence in probability). Xn converges in probability to X, written Xn −→ X, if
limn→∞ P[|Xn − X| > ε] = 0, ∀ε > 0.
as
Definition 1.3 (Convergence almost surely). Xn converges almost surely (as) to X, written Xn −→ X, if
P[{ω ∈ Ω : limn→∞ Xn (ω) = X(ω)}] = 1.
r
Definition 1.4 (Convergence in r-mean). Xn converges in r-mean to X, written Xn −→ X, if
limn→∞ E[|Xn − X|r ] = 0.
Remark. The previous definitions can be extended to a sequence of p-random vectors Xn . For Definitions
1.2 and 1.4, replace | · | by the Euclidean norm || · ||. Alternatively, Definition 1.2 can be extended marginally:
P P
Xn −→ X : ⇐⇒ Xj,n −→ Xj , ∀j = 1, . . . , p. For Definition 1.1, replace Fn and F by the joint cdfs of Xn
and X, respectively. Definition 1.3 extends marginally as well.
The 2-mean convergence plays a remarkable role in defining a consistent estimator θ̂n = T (X1 , . . . , Xn ) of a
parameter θ. We say that the estimator is consistent if its mean squared error (MSE),
MSE[θ̂] := E[(θ̂n − θ)2 ]

= (E[θ̂n ] − θ)2 + Var[θ̂n ]
= : Bias[θ̂n ]2 + Var[θ̂n ],
2
goes to zero as n → ∞. Equivalently, if θ̂n −→ θ.
d,P,r,as
If Xn −→ X and X is a degenerate random variable such that P[X = c] = 1, c ∈ R, then we write
d,P,r,as
Xn −→ c (the list-notation is used to condensate four different convergence results in the same line).
The relations between the types of convergences are conveniently summarized in the next proposition.
Proposition 1.6. Let Xn be a sequence of random variables and X a random variable. Then the following
implication diagram is satisfied:
r P as
Xn −→ X =⇒ Xn −→ X ⇐= Xn −→ X
⇓
d
Xn −→ X
None of the converses hold in general. However, there are some notable exceptions:
d P
i. If Xn −→ c, then Xn −→ c.
∑ P as
ii. If ∀ε > 0, limn→∞ n P[|Xn − X| > ε] < ∞ (implies4 Xn −→ X), then Xn −→ X.
P r
iii. If Xn −→ X and P[|Xn | ≤ M ] = 1, ∀n ∈ N and M > 0, then Xn −→ X for r ≥ 1.
∑n P as
iv. If Sn = i=1 Xi with X1 , . . . , Xn iid, then Sn −→ S ⇐⇒ Sn −→ S.
s r
Also, if s ≥ r ≥ 1, then Xn −→ X =⇒ Xn −→ X.
The corner stone limit result in probability is the central limit theorem (CLT). In its simpler version, it entails
that:
Theorem 1.1 (CLT). Let Xn be a sequence of iid random variables with E[Xi ] = µ and Var[Xi ] = σ 2 < ∞,
i ∈ N. Then: √
n(X̄ − µ) d
−→ N (0, 1).
σ
We will use later the following CLT for random variables that are independent but not identically distributed
due to its easy-to-check moment conditions.
Theorem 1.2 (Lyapunov’s CLT). Let Xn be a sequence of independent random variables with E[Xi ] = µi
and Var[Xi ] = σi2 < ∞, i ∈ N, and such that for some δ > 0
1 ∑ [ ]
n
E |Xi − µi |2+δ −→ 0 as n → ∞,
s2+δ
n i=1
∑n
with s2n = i=1 σi2 . Then:
1 ∑
n
d
(Xi − µi ) −→ N (0, 1).
sn i=1
Finally, the next results will be of usefulness (′ denotes transposition).

4 Intuitively: if convergence in probability is fast enought, then we have almost surely convergence.
Theorem 1.3 (Cramér-Wold device). Let Xn be a sequence of p-dimensional random vectors. Then:
Xn −→ X ⇐⇒ c′ Xn −→ c′ X,
d d
∀c ∈ Rp .
d, P, as d, P, as
Theorem 1.4 (Continuous mapping theorem). If Xn −→ X, then g(Xn ) −→ g(X) for any continuous
function g.
Theorem 1.5 (Slutsky’s theorem). Let Xn and Yn be sequences of random variables.
d P d
i. If Xn −→ X and Yn −→ c, then Xn Yn −→ cX.
d P d X
ii. If Xn −→ X and Yn −→ c, c ̸= 0, then X
Yn −→ c .
n
d P d
iii. If Xn −→ X and Yn −→ c, then Xn + Yn −→ X + c.
Theorem 1.6 (Limit algebra for (P, r, as)-convergence). Let Xn and Yn be sequences of random variables
P, r, as
and an → a and bn → b two sequences. Denote Xn −→ X to “Xn converges in probability (respectively
almost surely, respectively in r-mean)”.
P, r, as P, r, as P, r, as
i. If Xn −→ X and Yn −→ X, then an Xn + bn Yn −→ aX + bY .
P, as P, as P, as
ii. If Xn −→ X and Yn −→ X, then Xn Yn −→ XY .
Remark. Recall the abscense of results for convergence in distribution. They do not hold. Particularly:
d d d
Xn −→ X and Yn −→ X do not imply Xn + Yn −→ X + Y .
√ d √ d ( )
Theorem 1.7 (Delta method). If n(Xn − µ) −→ N (0, σ 2 ), then n(g(Xn ) − g(µ)) −→ N 0, (g ′ (µ))2 σ 2
for any function such that g is differentiable at µ and g ′ (µ) ̸= 0.
Example 1.2. It is well known that, given a parametric density fθ , with parameter ∑n θ ∈ Θ, and iid
X1 , . . . , Xn ∼ fθ , then the maximum likelihood (ML) estimator θ̂ML := arg maxθ∈Θ i=1 log fθ (Xi ) (the
parameter that maximizes the probability of the data based on the model) converges to a normal under
certain regularity conditions:
√ ( )
n(θ̂ML − θ) −→ N 0, I(θ)−1 ,
d
[ 2 ]
where I(θ) := −Eθ ∂ log ∂θ
fθ (x)
2 . Then, it is satisfied that:
√ ( )
n(g(θ̂ML ) − g(θ)) −→ N 0, (g ′ (θ))2 I(θ)−1 .
d
If we apply the continuous mapping theorem for g, we would have obtained that
√ ( ( ))
g( n(θ̂ML − θ)) −→ g N 0, I(θ)−1 ,
d
Theorem 1.8 (Weak and strong laws of large numbers). Let Xn be a iid sequence with E[Xi ] = µ, i ≥ 1.
∑n P, as
Then: n1 i=1 Xi −→ µ.
1.2.4 OP and oP notation
In computer science the O notation used to measure the complexity of algorithms. For example, when an
algorithm is O(n2 ), it is said that it is quadratic in time and we know that is going to take on the order of
n2 operations to process an input of size n. We do not care about the specific amount of computations, but
focus on the big picture by looking for an upper bound for the sequence of computation times in terms of n.
This upper bound disregards constants. For example, the dot product between two vectors of size n is an
O(n) operation, albeit it takes n multiplications and n − 1 sums, hence 2n − 1 operations.
In mathematical analysis, O-related notation is mostly used with the aim of bounding sequences that shrink
to zero. The technicalities are however the same.
Definition 1.5 (Big O). Given two strictly positive sequences an and bn ,
an
an = O(bn ) : ⇐⇒ lim ≤ C, for a C > 0.
n→∞ bn
If an = O(bn ), then we say that an is big-O of bn . To indicate that an is bounded, we write an = O(1).
Definition 1.6 (Little o). Given two strictly positive sequences an and bn ,
an
an = o(bn ) : ⇐⇒ lim = 0.
n→∞ bn
If an = o(bn ), then we say that an is little-o of bn . To indicate that an → 0, we write an = o(1).

The interpretation of these two definitions is simple:
• an = O(bn ) means that an “not larger than” bn asymptotically. If an , bn → 0, then it means that
an “does not decrease slower” than bn .
• an = o(bn ) means that an is “smaller than” bn asymptotically. If an , bn → 0, then it means that an
“decrease faster” than bn .
Obviously, little-o implies big-O. Playing with limits we can get a list of useful facts:
Proposition 1.7. Consider three sequences an , bn , cn → 0. The following properties hold:
i. kO(an ) = O(an ), ko(an ) = o(an ), k ∈ R.
ii. o(an ) + o(bn ) = o(an + bn ), O(an ) + O(bn ) = O(an + bn ).
iii. o(an ) + O(bn ) = O(an + bn ), o(an )o(bn ) = o(an bn ).
iv. O(an )O(bn ) = O(an bn ), o(an )O(bn ) = o(an bn ).
v. o(1)O(an ) = o(an ).
vi. arn = o(asn ), for r > s ≥ 0.
vii. an bn = o(a2n + b2n ).
viii. an bn = o(an + bn ).
ix. (an + bn )k = O(akn + bkn ).
The last result is a consequence of a useful inequality:
Lemma 1.1 (Cp inequality). Given a, b ∈ R and p > 0,
{
1, p ≤ 1,
|a + b| ≤ Cp (|a| + |b| ),
p p p
Cp =
2p−1 , p > 1.
The previous notation is purely deterministic. Let’s add some stochastic flavor by establishing the stochastic
analogous of little-o.
Definition 1.7 (Little oP ). Given a strictly positive sequence an and a sequence of random variables Xn ,
|Xn | P
Xn = oP (an ) : ⇐⇒ −→ 0
an
[ ]
|Xn |
⇐⇒ lim P > ε = 0, ∀ε > 0.
n→∞ an
P
If Xn = oP (an ), then we say that Xn is little-oP of an . To indicate that Xn −→ 0, we write Xn = oP (1).
Therefore, little-oP allows to quantify easily the speed at which a sequence of random variables converges to
zero in probability. ( ) ( )
Example 1.3. Let Yn = oP n−1/2 and Zn = oP n−1 . Then Zn converges faster to zero in probability
than Yn . To visualize this, recall that the little-oP and limit definitions entail that
∀ε, δ > 0, ∃ n0 = n0 (ε, δ) : ∀n ≥ n0 , P [|Xn | > an ε] < δ.

[ ( )]
Therefore,
[ for fixed ε, δ] > 0, and a fixed n ≥ max(n0,Y , n0,Z ), then P Yn ∈ −n−1/2 ε, n−1/2 ε < 1 − δ and
P Zn ∈ (−n−1 ε, n−1 ε) < 1−δ, but the latter interval is much shorter, hence Zn is more tightly concentrated
around 0.
Definition 1.8 (Big OP ). Given a strictly positive sequence an and a sequence of random variables Xn ,
Xn = OP (an ) : ⇐⇒ ∀ε > 0, ∃ C = Cε > 0, n0 = n0 (ε) :

[ ]
|Xn |
∀n ≥ n0 , P > Cε < ε
an
[ ]
|Xn |
⇐⇒ lim lim P > C = 0.
C→∞ n→∞ an
If Xn = OP (an ), then we say that Xn is big-OP of an .

Var[X] Var[X]
Example 1.4. Chebyshev [ inequality entails that] P[|X − E[X]| ≥ t] ≤ t2 , ∀t > 0. Setting ε :=
√ t2
and Cε := √ε , then P |X − E[X]| ≥ Var[X]Cε ≤ ε. Therefore,
1
(√ )
X − E[X] = OP Var[Xn ] .
d
Exercise 1.3. Prove that if Xn −→ X, then Xn = OP (1). Hint: use the double-limit definition and note
that X = OP (1).
d
Example 1.5. (Example 1.18 in DasGupta (2008)) Suppose that an (Xn − cn ) −→ X for deterministic
sequences an and cn such that cn → c. Then, if an → ∞, Xn − c = oP (1). The argument is simple:
Xn − c = Xn − cn + cn − c
1
= an (Xn − cn ) + o(1)
an
1
= OP (1) + o(1).
an
Exercise 1.4. Using the previous example, derive the weak law of large numbers as a consequence of the
CLT, both for id and non-id independent random variables.
Proposition 1.8. Consider two strictly positive sequences an , bn → 0. The following properties hold:
i. oP (an ) = OP (an ) (little-oP implies big-OP ).
ii. o(1) = oP (1), O(1) = OP (1) (deterministic implies probabilistic).
iii. kOP (an ) = OP (an ), koP (an ) = oP (an ), k ∈ R.
iv. oP (an ) + oP (bn ) = oP (an + bn ), OP (an ) + OP (bn ) = OP (an + bn ).
v. oP (an ) + OP (bn ) = OP (an + bn ), oP (an )oP (bn ) = oP (an bn ).
vi. OP (an )OP (bn ) = OP (an bn ), oP (an )OP (bn ) = oP (an bn ).
vii. oP (1)OP (an ) = oP (an ).
viii. (1 + oP (1))−1 = OP (1).
1.2.5 Basic analytical tools
We will make use of the following well-known analytical results.

Theorem 1.9 (Mean value theorem). Let f : [a, b] −→ R be a continuous function and differentiable in
(a, b). Then there exists c ∈ (a, b) such that f (b) − f (a) = f ′ (c)(b − a).
Theorem 1.10 (Integral mean value theorem). Let f : [a, b] −→ R be a continuous function over (a, b).
∫b
Then there exists c ∈ (a, b) such that a f (x)dx = f (c)(b − a).
Theorem 1.11 (Taylor’s theorem). Let f : R −→ R and x ∈ R. Assume that f has p continuous derivatives
in an interval (x − δ, x + δ) for some δ > 0. Then for any 0 < h < δ,
∑
p
f (j) (x)
f (x + h) = hj + Rn , Rn = o(hp ).
j=0
j!
1.3. NONPARAMETRIC INFERENCE 13
Remark. The remaider Rn depends on x ∈ R. Explicit control of Rn is possible if f is further assumed to be

(p+1)
(p + 1) differentiable in (x − δ, x + δ). In this case, Rn = f (p+1)!
(ξx ) p+1
h = o(hp ), for some ξx ∈ (x − δ, x + δ).
Then, if f (p+1) is bounded in (x − δ, x + δ), supy∈(x−δ,x+δ) R
hp → 0, i.e., the remainder is o(h ) uniformly in
n p
(x − δ, x + δ).
Theorem 1.12 (Dominated convergence theorem). Let fn : S ⊂ R −→ R be a sequence of Lebesgue
measurable
∫ functions such that limn→∞ fn (x) = f (x) and |fn (x)| ≤ g(x), ∀x ∈ S and ∀n ∈ N, where
S
|g(x)|dx < ∞. Then ∫ ∫
lim fn (x)dx = f (x)dx < ∞.
n→∞ S S
Remark. Note that if S is bounded and |fn (x)| ≤ M , ∀x ∈ S and ∀n ∈ N, then limit interchangeability with
integral is always possible.
1.3 Nonparametric inference

The aim of statistical inference is to use data to infer an unknown quantity. In the game of inference, there
is usually a trade-off between efficiency and generality, and this trade-off is controlled by the strength of
assumptions that are made on the data generating process.
Parametric inference opts for favoring efficiency. Given a model (a strong assumption on the data generating
process), it provides a set of inferential methods (point estimation, confidence intervals, hypothesis testing,
etc). All of them are the most efficient inferential procedures if the model matches the reality, in other words,
if the data generating process truly satisfies the assumptions. Otherwise the methods may be inconsistent.
On the other hand, nonparametric inference opts for favoring generality. Given a set of minimal and weak
assumptions (e.g. certain smoothness of a density), it provides inferential methods that are consistent for
broad situations, in exchange of losing efficiency for small or moderate sample sizes.
Hence, for any specific data generation process there is a parametric method that dominates in efficiency
its nonparametric counterpart. But knowledge of the data generation process is rarely the case in practice.
That is the appealing of a nonparametric method: it will perform adequately no matter what the data
generation process is. For that reason, nonparametric methods are useful. . .
• . . . when we have no clue on what could be a good parametric model.
• . . . for creating goodness-of-fit tests employed for validating parametric models.
Example 1.6. Assume we have a sample X1 , . . . , Xn from a random variable X and we want to estimate
its distribution function F . Without any assumption, we know that the ecdf in (1.1) is an estimate for
F (x) = P[X ≤ x]. It is indeed a nonparametric estimate for F . Its expectation and variance are
F (x)(1 − F (x))
E[Fn (x)] = F (x), Var[Fn (x)] = .
n
From the squared bias and variance, we can get the MSE:
F (x)(1 − F (x))
MSE[Fn (x)] = .
n
Assume now that X ∼ Exp(λ). By maximum likelihood, it is possible to estimate λ as λ̂ML = X̄ −1 . Then,
we have the following estimate for F (x):
F (x; λ̂ML ) = 1 − e−λ̂ML x . (1.2)
Is not so simple to obtain the exact MSE for (1.2), even if it is easy to prove that λ̂ML ∼ IG(λ−1 , n).
Approximations are possible using Exercise 1.2. However, the MSE can be easily approximated by Monte
Carlo.
What does it happen when the data is generated from an Exp(λ)? Then (1.2) uniformly dominates (1.1)
in performance. But, even for small deviations from Exp(λ) given by Γ(λ, p), p = 1 + δ, (1.2) shows major
problems in terms of bias, while (1.1) remains with the same performance. The animation in Figure 1.3
illustrates precisely this behavior.
A simplified example of parametric and nonparametric estimation. The objective is to estimate the distri-
bution function F of a random variable. The data is generated from a Γ(λ, p). The parametric method
assumes that p = 1, that is, that the data comes from a Exp(λ). The nonparametric method does not
assume anything on the data generation process. The left plot shows the true distribution function and
ten estimates of each method from samples of size n. The right plot shows the MSE of each method on
estimationg F (x). Application also available here.
1.4 Main references and credits

Several great reference books have been used for preparing these notes. The following list details the sections
in which each of them has been consulted:
• Fan and Gijbels (1996) (Sections 3.2, 3.3, 3.4).
• DasGupta (2008) (Sections 1.2.1, 1.2.3, 1.2.4).
• Loader (1999) (Section 3.5).
• Scott (2015) (Sections 2.1, 2.2, 2.4)).
• Silverman (1986) (Sections 2.2, 2.4).
• van der Vaart (1998) (Sections 1.2.3, 1.2.4).
• Wand and Jones (1995) (Sections 2.2, 2.4, 2.6, 3.5)
• Wasserman (2004) (Sections 1.2.3, 3.1, 3.5).
• Wasserman (2006) (Sections 1.2.1, 2.5, 3.4).
In addition, these notes are possible due to the existence of these incredible pieces of software: Xie (2016),
Xie (2015), Allaire et al. (2017), and R Core Team (2017).
The icons used in the notes were designed by madebyoliver, freepik, and roundicons from Flaticon.
All material in these notes is licensed under CC BY-NC-SA 4.0.
Chapter 2
Density estimation
A random variable X is completely characterized by its cdf. Hence, an estimation of the cdf yields as a
side-product ∫estimates for different characteristics
∫ of X by plugging-in
∑n Fn in the F . For example, the mean
µ = E[X] = xdF (x) can be estimated by xdFn (x) = n1 i=1 Xi = X̄. Despite its usefulness, cdfs are
hard to visualize and interpret.
Densities, on the other hand, are easy to visualize and interpret, making them ideal tools for data
exploration. They provide immediate graphical information about the most likely areas, modes, and
spread of X. A continuous random variable is also completely characterized by its pdf f = F ′ . Density
estimation does not follow trivially from the ecdf Fn , since this is not differentiable (not even continuous),
hence the need of the specific procedures we will see in this chapter.
2.1 Histograms
2.1.1 Histogram
The simplest method to estimate a density f form an iid sample X1 , . . . , Xn is the histogram. From an
analytical point of view, the idea is to aggregate the data in intervals of the form [x0 , x0 + h) and then use
their relative frequency to approximate the density at x ∈ [x0 , x0 + h), f (x), by the estimate of
f (x0 ) = F ′ (x0 )
F (x0 + h) − F (x0 )
= lim+
h→0 h
P[x0 < X < x0 + h]
= lim+ .
h→0 h
More precisely, given an origin t0 and a bandwidth h > 0, the histogram builds a piecewise constant function
in the intervals {Bk := [tk , tk+1 ) : tk = t0 + hk, k ∈ Z} by counting the number of sample points inside each
of them. These constant-length intervals are also denoted bins. The fact that they are of constant length
h is important, since it allows to standardize by h in order to have relative frequencies in the bins. The
histogram at a point x is defined as
1 ∑
n
fˆH (x; t0 , h) := 1{Xi ∈Bk :x∈Bk } . (2.1)
nh i=1
15
16 CHAPTER 2. DENSITY ESTIMATION
Equivalently, if we denote the number of points in Bk as vk , then the histogram is fˆH (x; t0 , h) = vk
nh if x ∈ Bk
for a k ∈ Z.
The analysis of fˆH (x; t0 , h) as a random variable is∫ simple, once it is recognized that the bin counts vk are
distributed as B(n, pk ), with pk = P[X ∈ Bk ] = Bk f (t)dt1 . If f is continuous, then by the mean value
theorem, pk = hf (ξk,h ) for a ξk,h ∈ (tk , tk+1 ). Therefore:
npk
E[fˆH (x; t0 , h)] = = f (ξk,h ),
nh
npk (1 − pk ) f (ξk,h )(1 − hf (ξk,h ))
Var[fˆH (x; t0 , h)] = 2 2
= .
n h nh
The above results show interesting insights:
1. If h → 0, then ξh,k → x, resulting in f (ξk,h ) → f (x), and thus (2.1) becoming an asymptotically
unbiased estimator of f (x).
2. But if h → 0, the variance increases. For decreasing the variance, nh → ∞ is required.
3. The variance is directly dependent on f (x)(1 − hf (x)) ≈ f (x), hence the more variability at regions
with higher density.
A more detailed analysis of the histogram can be seen in Section 3.2.2 of Scott (2015). We skip it here since
we the detailed asymptotic analysis for the more general kernel density estimator is given in Section 2.2.
Implementation of histograms is very simple in R. As an example, we consider the old-but-gold dataset

faithful. This dataset contains the duration of the eruption and the waiting time between eruptions for
the Old Faithful geyser in Yellowstone National Park (USA).
# The faithful dataset is included in R
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
# Duration of eruption
faithE <- faithful$eruptions
# Default histogram: automatically chosen bins and absolute frequencies!

histo <- hist(faithE)
1 Note that it is key that the {Bk } are deterministic (and not sample-dependent) for this result to hold.
2.1. HISTOGRAMS 17
Histogram of faithE
60
Frequency
40
20
0
2 3 4 5
faithE
# List that contains several objects
str(histo)
## List of 6
## $ breaks : num [1:9] 1.5 2 2.5 3 3.5 4 4.5 5 5.5
## $ counts : int [1:8] 55 37 5 9 34 75 54 3
## $ density : num [1:8] 0.4044 0.2721 0.0368 0.0662 0.25 ...
## $ mids : num [1:8] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## $ xname : chr "faithE"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
# With relative frequencies

hist(faithE, probability = TRUE)
Histogram of faithE
0.5
0.4
Density
0.3
0.2
0.1
0.0
2 3 4 5
faithE
# Choosing the breaks
# t0 = min(faithE), h = 0.25
Bk <- seq(min(faithE), max(faithE), by = 0.25)
hist(faithE, probability = TRUE, breaks = Bk)
rug(faithE) # Plotting the sample
Histogram of faithE
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Density
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
faithE
2.1. HISTOGRAMS 19
The shape of the histogram depends on:
• t0 , since the separation between bins happens at t0 k, k ∈ Z;

• h, which controls the bin size and the effective number of bins for aggregating the sample.
We focus first on exploring the dependence on t0 , as it serves for motivating the next density estimator.
# Uniform sample
set.seed(1234567)
u <- runif(n = 100)
# t0 = 0, h = 0.2
Bk1 <- seq(0, 1, by = 0.2)
# t0 = -0.1, h = 0.2
Bk2 <- seq(-0.1, 1.1, by = 0.2)
# Comparison
par(mfrow = c(1, 2))
hist(u, probability = TRUE, breaks = Bk1, ylim = c(0, 1.5),
main = "t0 = 0, h = 0.2")
rug(u)
abline(h = 1, col = 2)
hist(u, probability = TRUE, breaks = Bk2, ylim = c(0, 1.5),
main = "t0 = -0.1, h = 0.2")
rug(u)
abline(h = 1, col = 2)
t0 = 0, h = 0.2 t0 = −0.1, h = 0.2

1.5
1.5
1.0
1.0
Density
Density
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8
u u
High dependence on t0 also happens when estimating densities that are not compactly supported. The next
snippet of code points towards it.
# Sample 100 points from a N(0, 1) and 50 from a N(3, 0.25)

set.seed(1234567)
samp <- c(rnorm(n = 100, mean = 0, sd = 1),
rnorm(n = 50, mean = 3.25, sd = sqrt(0.5)))
# min and max for choosing Bk1 and Bk2

range(samp)
# Comparison
Bk1 <- seq(-2.5, 5, by = 0.5)
Bk2 <- seq(-2.25, 5.25, by = 0.5)
hist(samp, probability = TRUE, breaks = Bk1, ylim = c(0, 0.5),
main = "t0 = -2.5, h = 0.5")
rug(samp)
hist(samp, probability = TRUE, breaks = Bk2, ylim = c(0, 0.5),
main = "t0 = -2.25, h = 0.5")
rug(samp)
2.1.2 Moving histogram
An alternative to avoid the dependence on t0 is the moving histogram or naive density estimator. The idea is
to aggregate the sample X1 , . . . , Xn in intervals of the form (x − h, x + h) and then use its relative frequency
in (x − h, x + h) to approximate the density at x:
f (x) = F ′ (x)
F (x + h) − F (x − h)
= lim+
h→0 2h
P[x − h < X < x + h]
= lim+ .
h→0 2h
Recall the differences with the histogram: the intervals depend on the evaluation point x and are centered
around it. That allows to directly estimate f (x) (without the proxy f (x0 )) by an estimate of the symmetric
derivative.
More precisely, given a bandwidth h > 0, the naive density estimator builds a piecewise constant function
by considering the relative frequency of X1 , . . . , Xn inside (x − h, x + h):
1 ∑
n
fˆN (x; h) := 1{x−h<Xi <x+h} . (2.2)
2nh i=1
The function has 2n discontinuities that are located at Xi ± h.
Similarly
∑n to the histogram, the analysis of fˆN (x; h) as a random variable follows from realizing that
i=1 {x−h<Xi <x+h} ∼ B(n, px,h ), px,h = P[x − h < X < x + h] = F (x + h) − F (x − h). Then:
1
2.2. KERNEL DENSITY ESTIMATION 21
F (x + h) − F (x − h)
E[fˆN (x; h)] = ,
2h
F (x + h) − F (x − h)
Var[fˆN (x; h)] =
4nh2
(F (x + h) − F (x − h))2
− .
4nh2
These two results provide interesting insights on the effect of h:

1. If h → 0, then E[fˆN (x; h)] → f (x) and (2.2) is an asymptotically unbiased estimator of f (x). But also
2
Var[fˆN (x; h)] ≈ f2nh
(x)
− f (x)
n → ∞.
2. If h → ∞, then E[fN (x; h)] → 0 and Var[fˆN (x; h)] → 0.
ˆ
3. The variance shrinks to zero if nh → ∞ (or, in other words, if h−1 = o(n), i.e. if h−1 grows slower than
n). So both the bias and the variance can be reduced if n → ∞, h → 0, and nh → ∞, simultaneously.
4. The variance is (almost) proportional to f (x).
The animation in Figure 2.1.2 illustrates the previous points and gives insight on how the performance of
(2.2) varies smoothly with h.
Bias and variance for the moving histogram. The animation shows how for small bandwidths the bias of
fˆN (x; h) on estimating f (x) is small, but the variance is high, and how for large bandwidths the bias is
large and the variance is small. The variance is represented by the asymptotic 95% confidence intervals for
fˆN (x; h). Application also available here.
Estimator (2.2) raises an important question: Why giving the same weight to all X1 , . . . , Xn in (x −
h, x + h)? After all, we are estimating f (x) = F ′ (x) by estimating F (x+h)−F 2h
(x−h)
through the relative
frequency of X1 , . . . , Xn in (x − h, x + h). Should not be the data points closer to x more important than
the ones further away? The answer to this question shows that (2.2) is indeed a particular case of a wider
class of density estimators.
2.2 Kernel density estimation

The moving histogram (2.2) can be equivalently written as
1 ∑1 {
n
fˆN (x; h) = 1 x−X
}
nh i=1 2 −1< h i <1
( )
1 ∑
n
x − Xi
= K , (2.3)
nh i=1 h
with K(z) = 21 1{−1<z<1} . Interestingly, K is a uniform density in (−1, 1). This means that, when approxi-
mating
[ ]
x−X
P[x − h < X < x + h] = P −1 < <1
h
by (2.3), we give equal weight to all the points X1 , . . . , Xn . The generalization of (2.3) is now obvious:
replace K by an arbitrary density. Then K is known as a kernel, a density that is typically symmetric and
unimodal at 0. This generalization provides the definition of kernel density estimator (kde)2 :
2 Also known as the Parzen-Rosemblatt estimator to honor the proposals by Parzen (1962) and Rosenblatt (1956).
( )
1 ∑
n
x − Xi
fˆ(x; h) := K . (2.4)
nh i=1 h
(z) ∑n
A common notation is Kh (z) := h1 K h , so fˆ(x; h) = 1
n i=1 Kh (x − Xi ).
Several types of kernels are possible. The most popular is the normal kernel K(z) = ϕ(z), although the
Epanechnikov kernel, K(z) = 43 (1−z 2 )1{|z|<1} , is the most efficient. The rectangular kernel K(z) = 12 1{|z|<1}
yields the moving histogram as a particular case. The kde inherits the smoothness properties of the kernel.
That means, for example, (2.4) with a normal kernel is infinitely differentiable. But with an Epanechnikov
kernel, (2.4) is not differentiable, and with a rectangular kernel is not even continuous. However, if a certain
smoothness is guaranteed (continuity at least), the choice of the kernel has little importance in practice (at
least compared with the choice of h). Figure 2.2 illustrates the construction of the kde, and the bandwidth
and kernel effects.
It is useful to recall the kde with the normal kernel. If that is the case, then Kh (x − Xi ) = ϕh (x − Xi ) and
the kernel is the density of a N (Xi , h2 ). Thus the bandwidth h can be thought as the standard deviation of
the normal densities that have mean zero.
Construction of the kernel density estimator. The animation shows how the bandwidth and kernel affect the
density estimate, and how the kernels are rescaled densities with modes at the data points. Application also
available here.
Implementation of the kde in R is built-in through the density function. The function automatically chooses
the bandwidth h using a bandwidth selector which will be studied in detail in Section 2.4.
# Sample 100 points from a N(0, 1)
set.seed(1234567)
samp <- rnorm(n = 100, mean = 0, sd = 1)
# Quickly compute a kde and plot the density object

# Automatically chooses bandwidth and uses normal kernel
plot(density(x = samp))
# Select a particular bandwidth (0.5) and kernel (Epanechnikov)

lines(density(x = samp, bw = 0.5, kernel = "epanechnikov"), col = 2)
2.2. KERNEL DENSITY ESTIMATION 23
density.default(x = samp)
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
N = 100 Bandwidth = 0.3145
# density automatically chooses the interval for plotting the kde

# (observe that the black line goes to roughly between -3 and 3)
# This can be tuned using "from" and "to"
plot(density(x = samp, from = -4, to = 4), xlim = c(-5, 5))
density.default(x = samp, from = −4, to = 4)

0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
N = 100 Bandwidth = 0.3145

# The density object is a list

kde <- density(x = samp, from = -5, to = 5, n = 1024)
str(kde)
## List of 7
## $ x : num [1:1024] -5 -4.99 -4.98 -4.97 -4.96 ...
## $ y : num [1:1024] 5.98e-17 3.46e-17 2.56e-17 3.84e-17 4.50e-17 ...
## $ bw : num 0.315
## $ n : int 100
## $ call : language density.default(x = samp, n = 1024, from = -5, to = 5)
## $ data.name: chr "samp"
## $ has.na : logi FALSE
## - attr(*, "class")= chr "density"
# Note that the evaluation grid "x"" is not directly controlled, only through
# "from, "to", and "n" (better use powers of 2)
plot(kde$x, kde$y, type = "l")
curve(dnorm(x), col = 2, add = TRUE) # True density
rug(samp)
0.4
0.3
kde$y
0.2
0.1
0.0
−4 −2 0 2 4
kde$x
Exercise 2.1. Load the dataset faithful. Then:
• Estimate and plot the density of faithful$eruptions.
• Create a new plot and superimpose different density estimations with bandwidths equal to 0.1, 0.5,
and 1.
• Get the density estimate at exactly the point x = 3.1 using h = 0.15 and the Epanechnikov kernel.
2.3 Asymptotic properties
Asymptotic results give us insights on the large-sample (n → ∞) properties of an estimator. One might think
why are they are useful, since in practice we only have finite sample sizes. Apart from purely theoretical
reasons, asymptotic results usually give highly valuable insights on the properties of the method, typically
simpler than finite-sample results (which might be analytically untractable).
2.3. ASYMPTOTIC PROPERTIES 25
Along this section we will make the following assumptions:

• A1. The density f is twice continuously differentiable and the second derivative is square integrable.
• A2. The kernel K is a symmetric and bounded pdf with finite second moment and square integrable.
• A3. h = hn is a deterministic sequence of bandwidths3 such that, when n → ∞, h → 0 and nh → ∞.
We need to introduce some notation. From now on, the integrals are ∫ thought to be over R, if not stated
otherwise. The second moment of ∫the kernel is denoted as µ2 (K) := z 2 K(z)dz. The squared integral of a
function f , is denoted by R(f ) := f (x)2 dx. The convolution between two real functions f and g, f ∗ g, is
the function
∫
(f ∗ g)(x) := f (x − y)g(y)dy = (g ∗ f )(x). (2.5)
We are now ready to obtain the bias and variance of fˆ(x; h). Recall that is not possible to apply the
“binomial trick” we used previously since now the estimator is not piecewise constant. Instead of that, we
use the linearity of the kde and the convolution definition. For the bias, recall that
1∑
n
E[fˆ(x; h)] = E[Kh (x − Xi )]
n i=1
∫
= Kh (x − y)f (y)dy
= (Kh ∗ f )(x). (2.6)
Similarly, the variance is obtained as
1
Var[fˆ(x; h)] = ((Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)). (2.7)
n
These two expressions are exact, but they are hard to interpret. Equation (2.6) indicates that the estimator
is biased (we will see an example of that bias in Section 2.5), but it does not differentiate explicitly the effects
of kernel, bandwidth and density on the bias. The same happens with (2.7), yet more emphasized. That is
why the following asymptotic expressions are preferred.
Theorem 2.1. Under A1–A3, the bias and variance of the kde at x are
1
Bias[fˆ(x; h)] = µ2 (K)f ′′ (x)h2 + o(h2 ), (2.8)
2
R(K)
Var[fˆ(x; h)] = f (x) + o((nh)−1 ). (2.9)
nh
Proof. For the bias we consider the change of variables z = x−y

h , y = x − hz, dy = −hdz. The integral limits
flip and we have:
∫
E[fˆ(x; h)] = Kh (x − y)f (y)dy
∫
= K(z)f (x − hz)dz. (2.10)
3h = hn always depends on n from now on, although the subscript is dropped for the ease of notation.
Since h → 0, an application of a second-order Taylor expansion gives
f ′′ (x) 2 2
f (x − hz) = f (x) + f ′ (x)hz + h z (2.11)
2
+ o(h2 z 2 ). (2.12)
Substituting (2.12) in (2.10), and bearing in mind that K is a symmetric density around 0, we have
∫
K(z)f (x − hz)dz
∫ { f ′′ (x) 2 2
= K(z) f (x) + f ′ (x)hz + h z
2
}
+ o(h2 z 2 ) dz
1
= f (x) + µ2 (K)f ′′ (x)h2 + o(h2 ),
2
which provides (2.8).

For the variance, first note that
1 ∑
n
Var[fˆ(x; h)] = 2 Var[Kh (x − Xi )]
n i=1
1{ }
= E[Kh2 (x − X)] − E[Kh (x − X)]2 . (2.13)
n
The second term of (2.13) is already computed, so we focus on the first. Using the previous change of
variables and a first-order Taylor expansion, we have:
∫
1
E[Kh2 (x − X)] = K 2 (z)f (x − hz)dz
h
∫
1
= K 2 (z) {f (x) + O(hz)} dz
h
R(K)
= f (x) + O(1). (2.14)
h
Plugging-in (2.13) into (2.14) gives
{ }
1 R(K)
Var[fˆ(x; h)] = f (x) + O(1) − O(1)
n h
R(K)
= f (x) + O(n−1 )
nh
R(K)
= f (x) + o((nh)−1 ),
nh
since n−1 = o((nh)−1 ).

2.3. ASYMPTOTIC PROPERTIES 27
Remark. Integrating little-o’s is a tricky issue. ∫In general, integrating a ox (1) quantity, possibly dependent
on x, does not provide an o(1). In other words: ox (1)dx ̸= o(1). If the previous hold with equality, then the
limits and integral will be interchangeable. But this is not always true – only if certains conditions are met;
recall the dominated convergence theorem (Theorem 1.12). If one wants to be completely rigorous on the
two implicit commutations of integrals and limits that took place in the proof, it is necessary to have explicit
control of the remainder via Taylor’s theorem (Theorem 1.11) and then apply the dominated convergence
theorem. For simplicity in the exposition, we avoid this.
The bias and variance expressions (2.8) and (2.9) yield interesting insights (see Figure 2.1.2 for their visual-
ization):
• The bias decreases with h quadratically. In addition, the bias at x is directly proportional to f ′′ (x).
This has an interesting interpretation:
– The bias is negative in concave regions, i.e. {x ∈ R : f (x)′′ < 0}. These regions correspod to
peaks and modes of f , where the kde underestimates f (tends to be below f ).
– Conversely, the bias is positive in convex regions, i.e. {x ∈ R : f (x)′′ > 0}. These regions
correspod to valleys and tails of f , where the kde overestimates f (tends to be above f ).
– The wilder the curvature f ′′ , the harder to estimate f . Flat density regions are easier to
estimate than wiggling regions with high curvature (several modes).
• The variance depends directly on f (x). The higher the density, the more variable is the kde.
Interestingly, the variance decreases as a factor of (nh)−1 , a consequence of nh playing the role of the
effective sample size for estimating f (x). The density at x is not estimated using all the sample points,
but only a fraction proportional to nh in a neighborhood of x.
Figure 2.1: Illustration of the effective sample size for estimating f (x) at x = 2. In blue, the kernels with
contribution larger than 0.01 to the kde. In grey, rest of the kernels.
The MSE of the kde is trivial to obtain from the bias and variance:
Corollary 2.1. Under A1–A3, the MSE of the kde at x is
µ22 (K) ′′ R(K)

MSE[fˆ(x; h)] = (f (x))2 h4 + f (x)
4 nh
+ o(h4 + (nh)−1 ). (2.15)
2
Therefore, the kde is pointwise consistent in MSE, i.e., fˆ(x; h) −→ f (x).
Proof. We apply that MSE[fˆ(x; h)] = (E[fˆ(x; h)] − f (x))2 + Var[fˆ(x; h)] and that (O(h2 ) + o(h2 ))2 =
O(h4 ) + o(h4 ). Since MSE[fˆ(x; h)] → 0 when n → ∞, consistency follows.
2 P d
Note that, due to the consistency, fˆ(x; h) −→ f (x) =⇒ fˆ(x; h) −→ f (x) =⇒ fˆ(x; h) −→ f (x) under
A1–A2. However, these results are not useful for quantifying the randomness of fˆ(x; h): the convergence
is towards the degenerate random variable f (x), for a given x ∈ R. For that reason, we focus now on the
asymptotic normality of the estimator.
∫
Theorem 2.2. Assume that K 2+δ (z)dz < ∞4 for some δ > 0. Then, under A1–A3,
√ d
nh(fˆ(x; h) − E[fˆ(x; h)]) −→ N (0, R(K)f (x)), (2.16)
( )
√ 1 ′′ d
ˆ
nh f (x; h) − f (x) − µ2 (K)f (x)h −→ N (0, R(K)f (x)).
2
(2.17)
2
Proof. First note that Kh (x − Xn ) is a sequence of independent but not identically distributed random
variables: h = hn depends on n. Therefore, we look forward to apply Theorem 1.2.
We prove first (2.16). For simplicity, denote Ki := Kh (x − Xi ), i = 1, . . . , n. From the proof of Theorem 2.1
we know that E[Ki ] = E[fˆ(x; h)] = f (x) + o(1) and
∑
n
s2n = Var[Ki ]
i=1
= n Var[fˆ(x; h)]
2
R(K)
=n f (x)(1 + o(1)).
h
An application of the Cp inequality (first) and Jensen’s inequality (second), gives:
[ ] ( [ ] )
E |Ki − E[Ki ]|2+δ ≤ C2+δ E |Ki |2+δ + |E[Ki ]|2+δ
[ ]
≤ 2C2+δ E |Ki |2+δ
( [ ])
= O E |Ki |2+δ .
∫
In addition, due to a Taylor expansion after z = x−y
h and using that K 2+δ (z)dz < ∞:
∫ ( )
[ ] 1 x−y
E |Ki |2+δ = K 2+δ
f (y)dy
h2+δ h
∫
1
= 1+δ K 2+δ (z)f (x − hz)dy
h
∫
1
= 1+δ K 2+δ (z)(f (x) + o(1))dy
h
( )
= O h−(1+δ) .
Then:
4 This is satisfied, for example, if the kernel decreases exponentially, i.e., if ∃ α, M > 0 : K(z) ≤ e−α|z| , ∀|z| > M .
2.4. BANDWIDTH SELECTION 29
1 ∑ [ ]
n
E |Ki − E[Ki ]|2+δ
s2+δ
n i=1
( )1+ δ2 ( )
h
= (1 + o(1))O nh−(1+δ)
nR(K)f (x)
( )
= O (nh)− 2
δ
and the Lyapunov’s condition is satisfied. As a consequence, by Lyapunov’s CLT and Slutsky’s theorem:
√
nh
(fˆ(x; h) − E[fˆ(x; h)])
R(K)f (x)
1 ∑
n
= (1 + o(1)) (Ki − E[Ki ])
sn i=1
d
−→ N (0, 1)
and (2.16) is proved.

To prove (2.17), we consider:
( )
√ 1
nh fˆ(x; h) − f (x) − µ2 (K)f ′′ (x)h2
2
√
= nh(f (x; h) − E[fˆ(x; h)] + o(h2 )).
ˆ
Due to Example 1.5 and Proposition 1.8, we can prove that fˆ(x; h) − E[fˆ(x; h)] = oP (1). Then, fˆ(x; h) −
E[fˆ(x; h)] + o(h2 ) = (fˆ(x; h) − E[fˆ(x; h)])(1 + oP (h2 )). Therefore, Slutsky’s theorem and (2.17) give:
( )
√ 1 ′′
ˆ
nh f (x; h) − f (x) − µ2 (K)f (x)h 2
2
√
= nh(fˆ(x; h) − E[fˆ(x; h)])(1 + oP (h2 ))
d
−→ N (0, R(K)f (x)).
√
Remark.
√ Note the rate nh in the asymptotic normality results. This is different from the standard CLT rate
n (see Theorem 1.1). It is slower: the variance of the limiting normal distribution decreases as O((nh)−1 )
and not as O(n−1 ). The phenomenon is related with the effective sample size used in the smoothing.
Exercise 2.2. Using (2.17) and Exercise 1.5, show that fˆ(x; h) = f (x) + oP (1).
2.4 Bandwidth selection

As we saw in the previous sections, bandwidth selection is a key issue in density estimation. The purpose
of this section is to introduce objective and automatic bandwidth selectors that attempt to minimize the
estimation error of the target density f .
The first step is to define a global, rather than local, error criterion. The Integrated Squared Error (ISE),
∫
ISE[f (·, h)] := (fˆ(x; h) − f (x))2 dx,
ˆ
is the squared distance between the kde and the target density. The ISE is a random quantity, since it
depends directly on the sample X1 , . . . , Xn . As a consequence, looking for an optimal-ISE bandwidth is a
hard task, since it the optimality is dependent on the sample itself and not only on the population and n.
To avoid this problematic, it is usual to compute the Mean Integrated Squared Error (MISE):
[ ]
MISE[fˆ(·; h)] := E ISE[fˆ(·, h)]
[∫ ]
=E ˆ
(f (x; h) − f (x)) dx
2
∫ [ ]
= E (fˆ(x; h) − f (x))2 dx
∫
= MSE[fˆ(x; h)]dx.
The MISE is convenient due to its mathematical tractability and its natural relation with the MSE. There
are, however, other error criteria that present attractive properties, such as the Mean Integrated Absolute
Error (MIAE):
[∫ ]
ˆ
MIAE[f (·; h)] := E ˆ
|f (x; h) − f (x)|dx
∫ [ ]
= E |fˆ(x; h) − f (x)| dx.
The MIAE, unless the MISE, is invariant with respect to monotone transformations of the data. For example,
if g(x) = f (t−1 (x))(t−1 )′ (x) is the density of Y = t(X) and X ∼ f , then if the change of variables y = t(x)
is made,
∫ ∫
|fˆ(x; h) − f (x)|dx = |fˆ(t−1 (y); h) − f (t−1 (y))|)(t−1 )′ (y)dy
∫
= |ĝ(y; h) − g(y)|dy.
Despite this appealing property, the analysis of MIAE is substantially more complicated. We refer to Devroye
and Györfi (1985) for a comprehensive treatment of absolute value metrics for kde.
Once the MISE is set as the error criterion to be minimized, our aim is to find
hMISE := arg min MISE[fˆ(·; h)].
h>0
For that purpose, we need an explicit expression of the MISE that we can attempt to minimize. An asymp-
totic expansion for the MISE can be easily derived from the MSE expression.
Corollary 2.2. Under A1–A3,
1 R(K)
MISE[fˆ(·; h)] = µ22 (K)R(f ′′ )h4 +
4 nh
+ o(h4 + (nh)−1 ). (2.18)
Therefore, MISE[fˆ(·; h)] → 0 when n → ∞.

Proof. Trivial.
The dominating part of the MISE is denoted as AMISE, which stands for Asymptotic MISE: AMISE[fˆ(·; h)] =
1 2 ′′ 4 R(K)
4 µ2 (K)R(f )h + nh . Due to its closed expression, it is possible to obtain a bandwidth that minimizes
the AMISE.
Corollary 2.3. The bandwidth that minimizes the AMISE is
[ ]1/5
R(K)
hAMISE = .
µ22 (K)R(f ′′ )n
The optimal AMISE is:

5 2
inf AMISE[fˆ(·; h)] = (µ (K)R(K)4 )4/5 R(f ′′ )1/5 n−4/5 .
h>0 4 2
d
Proof. Solving dh AMISE[fˆ(·; h)] = 0, i.e. µ22 (K)R(f ′′ )h3 − R(K)n−1 h−2 = 0, yields hAMISE and
ˆ
AMISE[f (·; hAMISE )] gives the AMISE-optimal error.
The AMISE-optimal order deserves some further inspection. It can be seen in Section 3.2 of Scott (2015) that
( )2/3
the AMISE-optimal order for the histogram of Section 2.1 is 34 R(f ′ )1/3 n−2/3 . Two facts are of interest.
First, the MISE of the histogram is asymptotically larger than the MISE of the kde (n−4/5 = o(n−2/3 )).
This is a quantification of the quite apparent visual improvement of the kde over the histogram. Second,
R(f ′ ) appears instead of R(f ′′ ), evidencing that the histogram is affected not only by the curvature of the
target density f but also by how fast it varies.
∫
Unfortunately, the AMISE bandwidth depends on R(f ′′ ) = (f ′′ (x))2 dx, which measures the curvature of
the density. As a consequence, it can not be readily applied in practice. In the next subsection we will see
how to plug-in estimates for R(f ′′ ).
2.4.1 Plug-in rules
A simple solution to estimate R(f ′′ ) is to assume that f is the density of a N (µ, σ 2 ), and then plug-in the
form of the curvature for such density:
3
R(ϕ′′σ (· − µ)) = .
8π 1/2 σ 5
While doing so, we approximate the curvature of an arbitrary density by means of the curvature of a Normal.
We have that
[ 1/2 ]1/5
8π R(K)
hAMISE = σ.
3µ22 (K)n
Interestingly, the bandwidth is directly proportional to the standard deviation of the target density. Replac-
ing σ by an estimate yields the normal scale bandwidth selector, which we denote by ĥNS to emphasize
its randomness:
[ 1/2 ]1/5
8π R(K)
ĥNS = σ̂.
3µ22 (K)n
The estimate σ̂ can be chosen as the standard deviation s, or, in order to avoid the effects of potential
outliers, as the standardized interquantile range:
X([0.75n]) − X([0.25n])
σ̂IQR := .
Φ−1 (0.75) − Φ−1 (0.25)
Silverman (1986) suggests employing the minimum of both quantities,

σ̂ = min(s, σ̂IQR ). (2.19)
When combined with a normal kernel, for which µ2 (K) = 1 and R(K) = ϕ√2 (0) = 1
√
2 π
(check (2.26)), this
particularization of ĥNS gives the famous rule-of-thumb for bandwidth selection:
( )1/5
4
ĥRT = n−1/5 σ̂ ≈ 1.06n−1/5 σ̂.
3
ĥRT is implemented in R through the function bw.nrd (not to confuse with bw.nrd0).
# Data
x <- rnorm(100)
# Rule-of-thumb
bw.nrd(x = x)
## [1] 0.3698529
# Same as
iqr <- diff(quantile(x, c(0.75, 0.25))) / diff(qnorm(c(0.75, 0.25)))
1.06 * length(x)^(-1/5) * min(sd(x), iqr)
## [1] 0.367391
The previous selector is an example of a zero-stage plug-in selector, a terminology which lays on the fact
that R(f ′′ ) was estimated by plugging-in a parametric assumption at the “very beginning”. Because we
could have opted to estimate R(f ′′ ) nonparametrically and then plug-in the estimate into into hAMISE . Let’s
explore this possibility in more detail. But first, note the useful equality
∫ ∫
f (s) (x)2 dx = (−1)s f (2s) (x)f (x)dx.
This holds by a iterative application of integration by parts. For example, for s = 2, take u = f ′′ (x) and
dv = f ′′ (x)dx. This gives
∫ ∫
′′ ′′ ′
2
f (x) dx = [f (x)f − f ′ (x)f ′′′ (x)dx
(x)]+∞
−∞
∫
= − f ′ (x)f ′′′ (x)dx
under the assumption that the derivatives vanish at infinity. Applying again integration by parts with
u = f ′′′ (x) and dv = f ′ (x)dx gives the result. This simple derivation has an important consequence: for
estimating the functionals R(f (s) ) it is only required to estimate for r = 2s the functionals
∫
ψr := f (r) (x)f (x)dx = E[f (r) (X)].
Particularly, R(f ′′ ) = ψ4 .
Thanks to the previous expression, a possible way to estimate ψr nonparametrically is
1 ∑ ˆ(r)
n
ψ̂r (g) = f (Xi ; g)
n i=1
1 ∑ ∑ (r)
n n
= L (Xi − Xj ), (2.20)
n2 i=1 j=1 g
where fˆ(r) (·; g)

( is the
ˆ(r)
) r-th derivative of a kde with bandwidth g and kernel L, i.e. f (x; g) =
1
∑n (r) x−Xi
ng r i=1 L g . Note that g and L can be different from h and K, respectively. It turns out that
estimating ψr involves the adequate selection of a bandwidth g. The agenda is analogous to the one for
hAMISE , but now taking into account that both ψ̂r (g) and ψr are scalar quantities:
1. Under certain regularity assumptions (see Section 3.5 in Wand and Jones (1995) for full details), the
asymptotic bias and variance of ψ̂r (g) are obtained. With them, we can compute the asymptotic
expansion of the MSE and obtain the Asymptotic Mean Squared Error AMSE:
{ }
L(r) (0) µ2 (L)ψr+2 g 2 2R(L(r) )ψ0
AMSE[ψ̂r (g)] = r+1
+ +
ng 4 n2 g 2r+1
{∫ }
4
+ f (r) (x)2 f (x)dx − ψr2 .
n
Note: k is the highest integer such that µk (L) > 0. In these notes we have restricted to the case k = 2
for the kernels K, but there are theoretical gains if one allows high-order kernels L with vanishing even
moments larger than 2 for estimating ψr .
2. Obtain the AMSE-optimal bandwidth:
[ ]1/(r+k+1)
k!L(r) (0)
gAMSE = −
µk (L)ψr+k n
The order of the optimal AMSE is

{
O(n−(2k+1)/(r+k+1) ), k < r,
inf AMSE[ψ̂r (g)] =
g>0 O(n−1 ), k ≥ r,
which shows that a parametric-like rate of convergence can be achieved with high-order kernels. If we
consider L = K and k = 2, then
[ ]1/(r+3)
2K (r) (0)
gAMSE = − .
µ2 (L)ψr+2 n
The above result has a major problem: it depends on ψr+2 ! Thus if we want to estimate R(f ′′ ) = ψ4 by
ψ̂4 (gAMSE ) we will need to estimate ψ6 . But ψ̂6 (gAMSE ) will depend on ψ8 and so on. The solution to this
convoluted problem is to stop estimating the functional ψr after a given number ℓ of stages, therefore the
terminology ℓ-stage plug-in selector. At the ℓ stage, the functional ψℓ+4 in the AMSE-optimal bandwidth
for estimating ψℓ+2 is computed assuming that the density is a N (µ, σ 2 ), for which
(−1)r/2 r!
ψr = √ , for r even.
(2σ)r+1 (r/2)! π
Typically, two stages are considered a good trade-off between bias (mitigated when ℓ increases) and variance
(augments with ℓ) of the plug-in selector. This is the method proposed by Sheather and Jones (1991), where
they consider L = K and k = 2, yielding what we call the direct plug-in (DPI). The algorithm is:
105
1. Estimate ψ8 using ψ̂8NS := √
32 π σ̂ 9
, where σ̂ is given in (2.19)
2. Estimate ψ6 using ψ̂6 (g1 ) from (2.20), where
[ ]1/9
2K (6) (0)
g1 := − .
µ2 (K)ψ̂8NS n
3. Estimate ψ4 using ψ̂4 (g2 ) from (2.20), where

[ ]1/7
2K (4) (0)
g2 := − .
µ2 (K)ψ̂6 (g1 )n
4. The selected bandwidth is

[ ]1/5
R(K)
ĥDPI := .
µ22 (K)ψ̂4 (g2 )n
Remark. The derivatives K (r) for the normal kernel can be obtained using Hermite polynmials: ϕ(r) (x) =
ϕ(x)Hr (x). For r = 4, 6, H4 (x) = x4 − 6x2 + 3 and H6 (x) = x6 − 16x4 + 45x2 − 15.
ĥDPI is implemented in R through the function bw.SJ (use method = "dpi"). The package ks provides
hpi, which is a faster implementation that allows for more flexibility and has a somehow more complete
documentation.
# Data
x <- rnorm(100)
# Rule-of-thumb
bw.SJ(x = x, method = "dpi")
## [1] 0.3538952
# Similar to
library(ks)
hpi(x) # Default is two-stages
## [1] 0.3531024
2.4.2 Cross-validation
We turn now our attention to a different philosophy of bandwidth estimation. Instead of trying to minimize
the AMISE by plugging-in estimates for the unknown curvature term, we directly attempt to minimize the
MISE by using the sample twice: one for computing the kde and other for evaluating its performance on
estimating f . To avoid the clear dependence on the sample, we do this evaluation in a cross-validatory way:
the data used for computing the kde is not used for its evaluation.
We begin by expanding the square in the MISE expression:
[∫ ]
MISE[fˆ(·; h)] = E ˆ
(f (x; h) − f (x)) dx
2
[∫ ] [∫ ]
=E ˆ
f (x; h) dx − 2E
2 ˆ
f (x; h)f (x)dx
∫
+ f (x)2 dx.
Since the last term does not depend on h, minimizing MISE[fˆ(·; h)] is equivalent to minimizing
[∫ ] [∫ ]
E fˆ(x; h)2 dx − 2E fˆ(x; h)f (x)dx .
This quantity is unknown, but it can be estimated unbiasedly (see Exercise 2.7) by
∫ ∑
n
LSCV(h) := fˆ(x; h)2 dx − 2n−1 fˆ−i (Xi ; h), (2.21)
i=1
where fˆ−i (·; h) is the leave-one-out kde and is based on the sample with the Xi removed:
1 ∑
n
fˆ−i (x; h) = Kh (x − Xj ).
n − 1 j=1
j̸=i
The motivation∫ for (2.21) is the following. The first term is unbiased by design. The second arises from
approximating fˆ(x; h)f (x)dx by Monte Carlo, or in other words, by replacing f (x)dx = dF (x) with dFn (x).
This gives
∫
1∑ˆ
n
fˆ(x; h)f (x)dx ≈ f (Xi ; h)
n i=1
and, in order to mitigate the dependence of the sample, we replace fˆ(Xi ; h) by fˆ−i (Xi ; h) above. In that
way, we use the sample for estimating the integral involving fˆ(·; h), but for each Xi we compute the kde
on the remaining points. The least squares cross-validation (LSCV) selector, also denoted unbiased
cross-validation (UCV) selector, is defined as
ĥLSCV := arg min LSCV(h).

h>0
Numerical optimization is required for obtaining ĥLSCV , contrary to the previous plug-in selectors, and there
is little control on the shape of the objective function. This will be also the case for the forthcoming bandwidth
selectors. The following remark warns about the dangers of numerical optimization in this context.
Remark. Numerical optimization of the LSCV function can be challenging. In practice, several local minima
are possible, and the roughness of the objective function can vary notably depending on n and f . As a
consequence, optimization routines may get trapped in spurious solutions. To be on the safe side, it is
always advisable to check the solution by plotting LSCV(h) for a range of h, or to perform a search in a
bandwidth grid: ĥLSCV ≈ arg minh1 ,...,hG LSCV(h).
ĥLSCV is implemented in R through the function bw.ucv. bw.ucv uses R’s optimize, which is quite sensible
to the selection of the search interval (long intervals containing the solution may lead to unsatisfactory
termination of the search; and short intervals might not contain the minimum). Therefore, some care is
needed and that is why the bw.ucv.mod function is presented.
# Data
set.seed(123456)
x <- rnorm(100)
# UCV gives a warning

bw.ucv(x = x)
## Warning in bw.ucv(x = x): minimum occurred at one end of the range
## [1] 0.4499177
# Extend search interval

args(bw.ucv)
## function (x, nb = 1000L, lower = 0.1 * hmax, upper = hmax, tol = 0.1 *
## lower)
## NULL
bw.ucv(x = x, lower = 0.01, upper = 1)
## [1] 0.5482419
# bw.ucv.mod replaces the optimization routine of bw.ucv by an exhaustive search on

# "h.grid" (chosen adaptatively from the sample) and optionally plots the LSCV curve
# with "plot.cv"
bw.ucv.mod <- function(x, nb = 1000L,
h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
plot.cv = FALSE) {
if ((n <- length(x)) < 2L)
stop("need at least 2 data points")
n <- as.integer(n)
if (is.na(n))
stop("invalid length(x)")
if (!is.numeric(x))
stop("invalid 'x'")
nb <- as.integer(nb)
if (is.na(nb) || nb <= 0L)
stop("invalid 'nb'")
storage.mode(x) <- "double"
hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
Z <- .Call(stats:::C_bw_den, nb, x)
d <- Z[[1L]]
cnt <- Z[[2L]]
fucv <- function(h) .Call(stats:::C_bw_ucv, n, d, cnt, h)
# h <- optimize(fucv, c(lower, upper), tol = tol)$minimum
# if (h < lower + tol | h > upper - tol)
# warning("minimum occurred at one end of the range")
obj <- sapply(h.grid, function(h) fucv(h))
h <- h.grid[which.min(obj)]
if (plot.cv) {
plot(h.grid, obj, type = "o")
rug(h.grid)
abline(v = h, col = 2, lwd = 2)
}
h
}
# Compute the bandwidth and plot the LSCV curve

bw.ucv.mod(x = x, plot.cv = TRUE)
## [1] 0.5431732
# We can compare with the default bw.ucv output

abline(v = bw.ucv(x = x), col = 3)
## Warning in bw.ucv(x = x): minimum occurred at one end of the range
−0.10
−0.15
obj
−0.20
−0.25
0 1 2 3 4 5
h.grid
The next cross-validation selector is based on biased cross-validation (BCV). The BCV selector presents
a hybrid strategy that combines plug-in and cross-validation ideas. It starts by considering the AMISE
expression in (2.18)
1 R(K)
AMISE[fˆ(·; h)] = µ22 (K)R(f ′′ )h4 +
4 nh
and then plugs-in an estimate for R(f ′′ ) based on a modification of R(fˆ′′ (·; h)). The modification is
^ ′′ ) := R(fˆ′′ (·; h)) −

R(K ′′ )
R(f
nh5
1 ∑n ∑ n
= (K ′′ ∗ Kh′′ )(Xi − Xj ), (2.22)
n i=1 j=1 h
j̸=i
a [leave-out-diagonals
] estimate of R(f ′′ ). It is designed to reduce the bias of R(fˆ′′ (·; h)), since
′′
′′ ′′ R(K )
E R(fˆ (·; h)) = R(f ) + nh5 + O(h ). (Scott and Terrell, 1987). Plugging-in (2.22) into the
2
AMISE expression yields the BCV objective function and BCV bandwidth selector:
1 2 ^ ′′ )h4 +
R(K)
BCV(h) := µ (K)R(f ,
4 2 nh
ĥBCV := arg min BCV(h).
h>0
The appealing property of ĥBCV is that it has a considerably smaller variance compared to ĥLSCV . This
reduction in variance comes at the price of an increased bias, which tends to make ĥBCV larger than hMISE .
ĥBCV is implemented in R through the function bw.bcv. Again, bw.bcv uses R’s optimize so the bw.bcv.mod
function is presented to have better guarantees on finding the adequate minimum.
# Data
set.seed(123456)
x <- rnorm(100)
# BCV gives a warning

bw.bcv(x = x)
## Warning in bw.bcv(x = x): minimum occurred at one end of the range
## [1] 0.4500924
# Extend search interval

args(bw.bcv)
## function (x, nb = 1000L, lower = 0.1 * hmax, upper = hmax, tol = 0.1 *
## lower)
## NULL
bw.bcv(x = x, lower = 0.01, upper = 1)
## [1] 0.5070129
# bw.bcv.mod replaces the optimization routine of bw.bcv by an exhaustive search on

# "h.grid" (chosen adaptatively from the sample) and optionally plots the BCV curve
# with "plot.cv"
bw.bcv.mod <- function(x, nb = 1000L,
h.grid = diff(range(x)) * (seq(0.1, 1, l = 200))^2,
plot.cv = FALSE) {
if ((n <- length(x)) < 2L)
stop("need at least 2 data points")
n <- as.integer(n)
if (is.na(n))
stop("invalid length(x)")
if (!is.numeric(x))
stop("invalid 'x'")
nb <- as.integer(nb)
if (is.na(nb) || nb <= 0L)
stop("invalid 'nb'")
storage.mode(x) <- "double"
hmax <- 1.144 * sqrt(var(x)) * n^(-1/5)
Z <- .Call(stats:::C_bw_den, nb, x)
d <- Z[[1L]]
cnt <- Z[[2L]]
fbcv <- function(h) .Call(stats:::C_bw_bcv, n, d, cnt, h)
# h <- optimize(fbcv, c(lower, upper), tol = tol)$minimum
# if (h < lower + tol | h > upper - tol)
# warning("minimum occurred at one end of the range")
obj <- sapply(h.grid, function(h) fbcv(h))
if (plot.cv) {
rug(h.grid)
}
h
}
# Compute the bandwidth and plot the BCV curve

bw.bcv.mod(x = x, plot.cv = TRUE)

## [1] 0.5130493
# We can compare with the default bw.bcv output

abline(v = bw.bcv(x = x), col = 3)
## Warning in bw.bcv(x = x): minimum occurred at one end of the range
0.05
0.04
0.03
obj
0.02
0.01
0 1 2 3 4 5
h.grid
2.4.3 Comparison of bandwidth selectors
We state next some insights from the convergence results of the DPI, LSCV, and BCV selectors. All of them
are based in results of the kind
d
nν (ĥ/hMISE − 1) −→ N (0, σ 2 ),
where σ 2 depends only on K and f , and measures how variable is the selector. The rate nν serves to quantify
how fast the relative ĥ/hMISE − 1 decreases (the larger the ν, the faster the convergence). Under certain
regularity conditions, we have:
d d
• n1/10 (ĥLSCV /hMISE − 1) −→ N (0, σLSCV
2
) and n1/10 (ĥBCV /hMISE − 1) −→ N (0, σBCV
2
). Both selectors
1/2
have a slow rate of convergence (compare it with the n of the CLT). Inspection of the variances of
2
both selectors reveals that, for the normal kernel σLSCV 2
/σBCV ≈ 15.7. Therefore, LSCV is considerably
more variable than BCV.
d
• n5/14 (ĥDPI /hMISE − 1) −→ N (0, σDPI
2
). Thus, the DPI selector has a convergence rate much faster
than the cross-validation selectors. There is an appealing explanation for this phenomenon. Recall
that ĥBCV minimizes the slightly modified version of BCV(h) given by
1 2 R(K)
µ (K)ψ̃4 (h)h4 +
4 2 nh
and
1 ∑∑n n
ψ̃4 (h) := (K ′′ ∗ Kh′′ )(Xi − Xj )
n(n − 1) i=1 j=1 h
j̸=i
n ^
= R(f ′′ ).
n−1
ψ̃4 is a leave-out-diagonals estimate of ψ4 . Despite being different from ψ̂4 , it serves for building a
DPI analogous to BCV points towards the precise fact that drags down the performance of BCV. The
modified version of the DPI minimizes
1 2 R(K)
µ2 (K)ψ̃4 (g)h4 + ,
4 nh
where g is independent of h. The two methods differ on the the way g is chosen: BCV sets g = h and the
modified DPI looks for the best g in terms of the AMSE[ψ̃4 (g)]. It can be seen that gAMSE = O(n−2/13 ),
whereas the h used in BCV is asymptotically O(n−1/5 ). This suboptimality on the choice of g is the
reason of the asymptotic deficiency of BCV.
We focus now on exploring the empirical performance of bandwidth selectors. The workhorse for doing that
is simulation. A popular collection of simulation scenarios was given by Marron and Wand (1992) and are
conveniently available through the package nor1mix. They forma collection of normal r-mixtures of the form
∑
r
f (x; µ, σ, w) : = wj ϕσj (x − µj ),
j=1
∑r
where wj ≥ 0, j = 1, . . . , r and j=1 wj = 1. Densities of this form are specially attractive since they
allow for arbitrarily flexibility and, if the normal kernel is employed, they allow for explicit and exact MISE
expressions:
√
MISEr [fˆ(·; h)] = (2 πnh)−1 + w′ {(1 − n−1 )Ω2 − 2Ω1 + Ω0 }w,
(Ωa )ij = ϕ(ah2 +σi2 +σj2 )1/2 (µi − µj ), i, j = 1, . . . , r. (2.23)
# Load package
library(nor1mix)
# Available models
?MarronWand
# Simulating
samp <- rnorMix(n = 500, obj = MW.nm9) # MW object in the second argument
hist(samp, freq = FALSE)
# Density evaluation
x <- seq(-4, 4, length.out = 400)
lines(x, dnorMix(x = x, obj = MW.nm9), col = 2)
Histogram of samp
0.30
0.20
Density
0.10
0.00
−3 −2 −1 0 1 2 3
samp
# Plot a MW object directly
# A normal with the same mean and variance is plotted in dashed lines
plot(MW.nm5)
plot(MW.nm7)
plot(MW.nm10)
plot(MW.nm12)
lines(MW.nm10) # Also possible
#5 Outlier #7 Separated
0.4
f(x)
f(x)
0.2
2
0.0
0
−1.0 −0.5 0.0 0.5 1.0 −4 −2 0 2 4
x x
#10 Claw #12 Asym Claw

0.6
0.3
f(x)
f(x)
0.3
0.0
−2 −1 0 1 2 0.0 −3 −2 −1 0 1 2 3
x x
Figure 2.4.3 presents a visualization of the performance of the kde with different bandwidth selectors, carried
out in the family of mixtures of Marron and Wand (1992).
Performance comparison of bandwidth selectors. The RT, DPI, LSCV, and BCV are computed for each
sample for a normal mixture density. For each sample, computes the ISEs of the selectors and sorts them
from best to worst. Changing the scenarios gives insight on the adequacy of each selector to hard- and
simple-to-estimate densities. Application also available here.
2.5 Confidence intervals
Obtaining a confidence interval (CI) for f (x) is a hard task. Due to the bias results seen in Section 2.3, we
know that the kde is biased for finite sample sizes, E[fˆ(x; h)] = (Kh ∗ f )(x), and it is only asymptotically
unbiased when h → 0. This bias is called the smoothing bias and in essence complicates the obtention of
CIs for f (x), but not of (Kh ∗ f )(x). Let’s see with an illustrative example the differences between these two
objects.
Two well-known facts for normal densities (see Appendix C in Wand and Jones (1995)) are:
(ϕσ1 (· − µ1 ) ∗ ϕσ2 (· − µ2 ))(x) = ϕ(σ12 +σ22 )1/2 (x − µ1 − µ2 ), (2.24)

∫
ϕσ1 (x − µ1 )ϕσ2 (x − µ2 )dx = ϕ(σ12 +σ22 )1/2 (µ1 − µ2 ), (2.25)
1 1
ϕσ (x − µ)r = (2π)(1−r)/2 ϕσ/r1/2 (x − µ) 1/2 . (2.26)
σ r−1 r
As a consequence, if K = ϕ (i.e., Kh = ϕh ) and f (·) = ϕσ (· − µ), we have that

2.5. CONFIDENCE INTERVALS 43
(Kh ∗ f )(x) = ϕ(h2 +σ2 )1/2 (x − µ), (2.27)

( )
1
(Kh2 ∗ f )(x) = ϕ 1/2 /2 1/2
∗ f (x)
(2π)1/2 h h/2
1 ( )
= 1/2
ϕh/21/2 ∗ f (x)
2π h
1
= ϕ 2 2 1/2 (x − µ). (2.28)
2π h (h /2+σ )
1/2
Thus, the exact expectation of the kde for estimating the density of a N (µ, σ 2 ) is the density of a N (µ, σ 2 +h2 ).
Clearly, when h → 0, the bias disappears (at expenses of increasing the variance, of course). Removing this
finite-sample size bias is not simple: if the bias is expanded, f ′′ appears. Thus for attempting to unbias fˆ(·; h)
we have to estimate f ′′ , which is much more complicated than estimating f ! Taking second derivatives on
the kde does not simply work out-of-the-box, since the bandwidths for estimating f and f ′′ scale differently.
We refer the interested reader to Section 2.12 in Wand and Jones (1995) for a quick review of derivative kde.
The previous deadlock can be solved if we limit our ambitions. Rather than constructing a confidence interval
for f (x), we will do it for E[fˆ(x; h)] = (Kh ∗ f )(x). There is nothing wrong with this as long as we are careful
when we report the results to make it clear that the CI is for (Kh ∗ f )(x) and not f (x).
The building block for the CI for E[fˆ(x; h)] = (Kh ∗ f )(x) is Theorem 2.2, which stated that:
√ d
nh(fˆ(x; h) − E[fˆ(x; h)]) −→ N (0, R(K)f (x)).
Plugging-in fˆ(x; h) = f (x) + OP (h2 + (nh)−1 ) = f (x)(1 + oP (1)) (see Exercise 2.8) as an estimate for f (x)
in the variance, we have by the Slutsky’s theorem that
√
nh
(fˆ(x; h) − E[fˆ(x; h)])
R(K)fˆ(x; h)
√
nh
= (fˆ(x; h) − E[fˆ(x; h)])(1 + oP (1))
R(K)f (x)
d
−→ N (0, 1).
Therefore, an asymptotic 100(1 − α)% confidence interval for E[fˆ(x; h)] that can be straightforwardly
computed is:
 √ 
R(K)fˆ(x; h) 
I = fˆ(x; h) ± zα . (2.29)
nh
Recall that if we wanted to do obtain a CI with the second result in Theorem 2.2 we will need to estimate
f ′′ (x).
Remark. Several points regarding the CI require proper awareness:
1. The CI is for E[fˆ(x; h)] = (Kh ∗ f )(x), not[f (x). ]
2. The CI is pointwise. That means that P E[fˆ(x; h)] ∈ I ≈ 1 − α for each x ∈ R, but not simulta-
[ ]
neously. This is, P E[fˆ(x; h)] ∈ I, ∀x ∈ R ̸= 1 − α.
3. We are approximating f (x) in the variance by fˆ(x; h)√= f (x) + OP (h2 + (nh)−1 ). Additionally, the
convergence to a normal distribution happens at rate nh. Hence both h and nh need to small and
large, respectively, for a good coverage.
4. The CI is for a deterministic bandwidth h (i.e., not selected from the sample), which is not usually
the case in practise. If a bandwidth selector is employed, the coverage will be affected, specially for
small n.
We illustrate the construction of the (2.29) in for the situation where the reference density is a N (µ, σ 2 ) and
the normal kernel is employed. This allows to use (2.27) and (2.28) in combination (2.6) and (2.7) with to
obtain:
E[fˆ(x; h)] = ϕ(h2 +σ2 )1/2 (x − µ),

( )
ˆ 1 ϕ(h2 /2+σ2 )1/2 (x − µ)
Var[f (x; h)] = − (ϕ(h2 +σ2 )1/2 (x − µ)) .
2
n 2π 1/2 h
The following piece of code evaluates the proportion of times that E[fˆ(x; h)] belongs to I for each x ∈ R,
both estimating and knowing the variance in the asymptotic distribution.
# R(K) for a normal
Rk <- 1 / (2 * sqrt(pi))
# Generate a sample from a N(mu, sigma^2)

n <- 100
mu <- 0
sigma <- 1
set.seed(123456)
x <- rnorm(n = n, mean = mu, sd = sigma)
# Compute the kde (NR bandwidth)

kde <- density(x = x, from = -4, to = 4, n = 1024, bw = "nrd")
# Selected bandwidth
h <- kde$bw
# Estimate the variance

hatVarKde <- kde$y * Rk / (n * h)
# True expectation and variance (because the density is a normal)

EKh <- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
varKde <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
(2 * sqrt(pi) * h) - EKh^2) / n
# CI with estimated variance

alpha <- 0.05
zalpha <- qnorm(1 - alpha/2)
ciLow1 <- kde$y - zalpha * sqrt(hatVarKde)
ciUp1 <- kde$y + zalpha * sqrt(hatVarKde)
# CI with known variance

ciLow2 <- kde$y - zalpha * sqrt(varKde)
ciUp2 <- kde$y + zalpha * sqrt(varKde)
# Plot estimate, CIs and expectation

2.5. CONFIDENCE INTERVALS 45
plot(kde, main = "Density and CIs", ylim = c(0, 1))

lines(kde$x, ciLow1, col = "gray")
lines(kde$x, ciUp1, col = "gray")
lines(kde$x, ciUp2, col = "gray", lty = 2)
lines(kde$x, ciLow2, col = "gray", lty = 2)
lines(kde$x, EKh, col = "red")
legend("topright", legend = c("Estimate", "CI estimated var",
"CI known var", "Smoothed density"),
col = c("black", "gray", "gray", "red"), lwd = 2, lty = c(1, 1, 2, 1))
Density and CIs

1.0
Estimate
CI estimated var
0.8
CI known var
Smoothed density
0.6
Density
0.4
0.2
0.0
−4 −2 0 2 4
N = 100 Bandwidth = 0.4193

The above code computes the CI. But it dos not show any insight on the effective coverage of the CI. The
next simulation exercise deals with this issue.
# Simulation setting
n <- 100; h <- 0.2
mu <- 0; sigma <- 1 # Normal parameters
M <- 5e2 # Number of replications in the simulation
nGrid <- 512 # Number of x's for computing the kde
alpha <- 0.05; zalpha <- qnorm(1 - alpha/2) # alpha for CI
# Compute expectation and variance

kde <- density(x = 0, bw = h, from = -4, to = 4, n = nGrid) # Just to get kde$x
EKh <- dnorm(x = kde$x, mean = mu, sd = sqrt(sigma^2 + h^2))
varKde <- (dnorm(kde$x, mean = mu, sd = sqrt(h^2 / 2 + sigma^2)) /
(2 * sqrt(pi) * h) - EKh^2) / n
# For storing if the mean is inside the CI

insideCi1 <- insideCi2 <- matrix(nrow = M, ncol = nGrid)
# Simulation
set.seed(12345)
for (i in 1:M) {
# Sample & kde

x <- rnorm(n = n, mean = mu, sd = sigma)
kde <- density(x = x, bw = h, from = -4, to = 4, n = nGrid)
hatSdKde <- sqrt(kde$y * Rk / (n * h))
# CI with estimated variance

ciLow1 <- kde$y - zalpha * hatSdKde
ciUp1 <- kde$y + zalpha * hatSdKde
# CI with known variance

ciLow2 <- kde$y - zalpha * sqrt(varKde)
ciUp2 <- kde$y + zalpha * sqrt(varKde)
# Check if for each x the mean is inside the CI

insideCi1[i, ] <- EKh > ciLow1 & EKh < ciUp1
insideCi2[i, ] <- EKh > ciLow2 & EKh < ciUp2
# Plot results
plot(kde$x, colMeans(insideCi1), ylim = c(0.25, 1), type = "l",
main = "Coverage CIs", xlab = "x", ylab = "Coverage")
lines(kde$x, colMeans(insideCi2), col = 4)
abline(h = 1 - alpha, col = 2)
abline(h = 1 - alpha + c(-1, 1) * qnorm(0.975) *
sqrt(alpha * (1 - alpha) / M), col = 2, lty = 2)
legend(x = "bottom", legend = c("CI estimated var", "CI known var",
"Nominal level",
"95% CI for the nominal level"),
col = c(1, 4, 2, 2), lwd = 2, lty = c(1, 1, 1, 2))
2.6. PRACTICAL ISSUES 47
Coverage CIs
1.0
0.8
Coverage
0.6
CI estimated var
0.4
CI known var
Nominal level
95% CI for the nominal level
−4 −2 0 2 4
x
Exercise 2.3. Explore the coverage of the asymptotic CI for varying values of h. To that end, adapt the
previous code to work in a manipulate environment like the example given below.
# Load manipulate
# install.packages("manipulate")
library(manipulate)
# Sample
x <- rnorm(100)
# Simple plot of kde for varying h

manipulate({
kde <- density(x = x, from = -4, to = 4, bw = h)

plot(kde, ylim = c(0, 1), type = "l", main = "")
curve(dnorm(x), from = -4, to = 4, col = 2, add = TRUE)
rug(x)
}, h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01))
2.6 Practical issues

We discuss in this section several practical issues for kernel density estimation.
2.6.1 Boundary issues and transformations
In Section 2.3 we assumed certain regularity conditions for f . Assumption A1 stated that f should be twice
continuously differentiable (on R). It is simple to think a counterexample for that: take any density with
bounded support, for example a LN (0, 1) in (0, ∞), as seen below. The the kde will run into troubles.
# Sample from a LN(0, 1)
set.seed(123456)
samp <- rlnorm(n = 500)
# kde and density

plot(density(samp), ylim = c(0, 1))
curve(dlnorm(x), from = -2, to = 10, col = 2, add = TRUE)
rug(samp)
density.default(x = samp)
1.0
0.8
0.6
Density
0.4
0.2
0.0
0 5 10 15
N = 500 Bandwidth = 0.2936

What is happening is clear: the kde is spreading probability mass outside the support of LN (0, 1), because
the kernels are functions defined in R. Since the kde places probability mass at negative values, it takes
it from the positive side, resulting in a severe negative bias around 0. As a consequence, the kde does not
integrate one in the support of the data. No matter what is the sample size considered, the kde will always
have a negative bias of O(h) at the boundary, instead of the standard O(h2 ).
A simple approach to deal with the boundary bias is to map a compactly-supported density f into a real-
supported density g, that is simpler to estimate, by means of a transformation t:
f (x) = g(t(x))t′ (x).
The transformation kde is obtained by replacing g by the usual kde:
1∑
n
fˆT (x; h, t) := Kh (t(x) − t(Xi ))t′ (x). (2.30)
n i=1
Note that h is in the scale of t(Xi ), not Xi . Hence another plus of this approach is that bandwidth selection
can be done transparently in terms of the previously seen bandwidth selectors applied to t(X1 ), . . . , t(Xn ).
A table with some common transformations is:
Transformation Useful for t(x) t′ (x)

Log Data in (a, ∞), a ∈ R ( − a))
log(x 1
x−a
−1 x−a
Probit Data in (a, b) Φ (b −
b−a
( )−1
Φ−1 (x)−a
a)ϕ b−a
Shifted power Heavily skewed data in R (x + λ2 (x +
and above −λ1 λ1 )λ2 sign(λ2 ) if λ1 )λ2 −1 sign(λ2 ) if
λ2 ̸= 0 and ̸ 0 and x+λ
λ2 = 1
1
if
log(x + λ1 ) if λ2 = 0
λ2 = 0
Construction of the transformation kde for the log and probit transformations. The left panel shows the kde
(2.4) applied to the transformed data. The right plot shows the transformed kde (2.30). Application also
available here.
The code below illustrates how to compute a transformation kde in practice.
# kde with log-transformed data
kde <- density(log(samp))
plot(kde, main = "kde of transformed data")
rug(log(samp))
kde of transformed data

0.3
Density
0.2
0.1
0.0
−4 −2 0 2
N = 500 Bandwidth = 0.259
# Careful: kde$x is in the reals!

range(kde$x)
## [1] -4.542984 3.456035
# Untransform kde$x so the grid is in (0, infty)

kdeTransf <- kde
kdeTransf$x <- exp(kdeTransf$x)
# Transform the density using the chain rule

kdeTransf$y <- kdeTransf$y * 1 / kdeTransf$x
# Transformed kde
plot(kdeTransf, main = "Transformed kde")
curve(dlnorm(x), col = 2, add = TRUE)
rug(samp)
Transformed kde
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Density
0 5 10 15 20 25 30
2.6.2 Sampling
Sampling a kde is relatively straightforward. The trick is to recall
1∑
n
fˆ(x; h) = Kh (x − Xi )
n i=1
as a mixture density made of n mixture components in which each of them is sampled independently. The
only part that might require special treatment is sampling from the density K, although for most of the
implemented kernels R contains specific sampling functions.
The algorithm for sampling N points goes as follows:
1. Choose i ∈ {1, . . . , n} at random.
( )
2. Sample from Kh (· − Xi ) = h1 K ·−X h
i
.
3. Repeat the previous steps N times.
Let’s see a quick application.
# Sample the Claw
n <- 100
set.seed(23456)
samp <- rnorMix(n = n, obj = MW.nm10)
# Kde
h <- 0.1
plot(density(samp, bw = h))
# Naive sampling algorithm

sampKde <- numeric(1e6)
for (k in 1:1e6) {
i <- sample(x = 1:100, size = 1)

sampKde[k] <- rnorm(n = 1, mean = samp[i], sd = h)
# Add kde of the sampled kde - almost equal

lines(density(sampKde), col = 2)
# Sample 1e6 points from the kde

i <- sample(x = 100, size = 1e6, replace = TRUE)
sampKde <- rnorm(1e6, mean = samp[i], sd = h)
# Add kde of the sampled kde - almost equal

lines(density(sampKde), col = 3)
density.default(x = samp, bw = h)
0.6
0.4
Density
0.2
0.0
−3 −2 −1 0 1 2

2.7 Exercises
This is the list of evaluable exercises for Chapter 2. The number of stars represents an estimate of their
difficulty: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆).
Exercise 2.4. (theoretical, ⋆) Prove that the histogram (2.1) is a proper density estimate (a nonnegative
density that integrates one). Obtain its associated distribution function. What is its difference with respect
to the ecdf (1.1)?
Exercise 2.5. (theoretical, ⋆, adapted from Exercise 2.1 in Wand and Jones (1995)) Derive the result (2.7).
Then obtain the exact MSE and MISE using (2.6) and (2.7).
Exercise 2.6. (theoretical, ⋆⋆) Conditionally on the sample X1 , . . . , Xn , compute the expectation and
variance of the kde (2.4) and compare them with the sample mean and variance. What is the effect of h in
them?
Exercise 2.7. (theoretical, ⋆⋆, Exercise 3.3 in Wand and Jones (1995)) Show that
E[LSCV(h)] = MISE[fˆ(·; h)] − R(f ).
Exercise 2.8. (theoretical, ⋆ ⋆ ⋆) Show that:

( )
• fˆ(x; h) = f (x) + o(h2 ) + OP (nh)−1/2 = f (x)(1 + oP (1)). ( )
∑n
• n1 i=1 (x − Xi )Kh (x − Xi ) = µ2 (K)f ′ (x)h2 + o(h2 ) + OP n−1/2 h 2 .
1
Hint: use Chebyshev inequality.

Exercise 2.9. (theoretical, ⋆ ⋆ ⋆, Exercise 2.23 in Wand and Jones (1995)) Show that the bias and variance
for the transformation kde (2.30) are
1
Bias[fˆT (x; h, t)] = µ2 (K)g ′′ (t(x))t′ (x)h2 + o(h2 ),
2
R(K)
Var[fˆT (x; h, t)] = g(t(x))t′ (x)2 + o((nh)−1 ),
nh
where g is the density of t(X). Usig these results, prove that
∫
1 2 R(K) ′
AMISE[fˆT (·; h, t)] = µ (K) t′ (t−1 (x))g ′′ (x)2 dxh4 + E[t (X)].
4 2 nh
Exercise 2.10. (practical, ⋆) The kde can be used to smoothly resample a dataset. To that end, first
cumpute the kde of the dataset and then employ the algorithm of Section 2.6. Implement this resampling
as a function that takes as arguments the dataset, the bandwidth h, and the number of sampled points M
wanted from the dataset. Use the normal kernel for simplicity. Test the implementation with the faithful
dataset and different bandwidths.
Exercise 2.11. (practical, ⋆⋆, Exercise 6.5 in Wasserman (2006)) Data on the salaries of the chief executive
officer of 60 companies is available at http://lib.stat.cmu.edu/DASL/Datafiles/ceodat.html (alternative link).
Investigate the distribution of salaries using a kde. Use ĥLSCV to choose the amount of smoothing. Also
consider ĥRT . There appear to be a few bumps in the density. Are they real? Use confidence bands to
address this question. Finally, comment on the resulting estimates.
Exercise 2.12. (practical, ⋆⋆) Implement your own version of the transformation kde (2.30) for the three
transformations given in Section 2.6. You can tweak the output of the density function in R and add an
extra argument for selecting the kind of transformation. Or you can implement it directly from scratch.
Exercise 2.13. (practical, ⋆ ⋆ ⋆) A bandwidth selector is a random variable. Visualizing its density can
help to understand its behavior, especially if it is compared with the asymptotic optimal bandwidth hAMISE .
Create a script that does the following steps:
1. For j = 1, . . . , M = 10000:
2.7. EXERCISES 53
• Simulates a sample from a model mixture of nor1mix.

• Computes the bandwidth selectors ĥRT , ĥBCV , ĥUCV , and ĥDPI , and stores them.
2. Estimates the density of each bandwidth selector from its corresponding sample of size M . Use the
RT selector for estimating the density.
3. Plots the estimated densities together.
4. Draws a vertical line for representing the hAMISE bandwidth
Describe the results for the “Claw” and “Bimodal” densities in nor1mix, for sample sizes n = 100, 500.
Exercise 2.14. (practical, ⋆ ⋆ ⋆) Use (2.23) and the family of densities of Marron and Wand (1992) in
nor1mix to compare the MISE and AMISE criteria. To that purpose, code (2.23) and the AMISE expression
for the normal kernel and compare the two error curves and the two minimizers. Explore three models of
your choice from nor1mix for sample sizes n = 50, 100, 200. Describe in detail the results and the major
takeaways.
Exercise 2.15. (practical, ⋆ ⋆ ⋆) The kde can be extended to the multivariate setting by using product
kernels. For a sample X1 , . . . , Xn in Rp , the multivariate kde employing product kernels is
1∑
n
p
fˆ(x; h) = Kh1 (x1 − Xi,1 )× · · · ×Khp (xp − Xi,p ),
n i=1
where x = (x1 , . . . , xp ), Xi = (Xi,1 , . . . , Xi,p ), and h = (h1 , . . . , hp ) is a vector of bandwidths.

Do:
• Implement a function that computes the bivariate kde using normal kernels.
• Create a sample by simulating 500 points from a N ((−1, −1), diag(1, 2)) and 500 from a
N ((1, 1), diag(2, 1)).
• Estimate the unknown density from sample of size n = 1000.
• Check graphically the correct implementation by comparing the kde with the true density (use
?contour).
Chapter 3
Regression estimation
The relation of two random variables X and Y can be completely characterized by their joint cdf F , or
equivalently, by the joint pdf f if (X, Y ) is continuous, the case we will address. In the regression setting,
we are interested in predicting/explaining the response Y by means of the predictor X from a sample
(X1 , Y1 ), . . . , (Xn , Yn ). The role of the variables is not symmetric: X is used to predict/explain Y .
The complete knowledge of Y when X = x is given by the conditional pdf: fY |X=x (y) = ffX (x,y)
(x) . While this
pdf provides full knowledge about Y |X = x, it is also a challenging task to estimate it: for each x we have
to estimate a curve! A simpler approach, yet still challenging, is to estimate the conditional mean (a scalar)
for each x. This is the so-called regression function1
∫ ∫
m(x) := E[Y |X = x] = ydFY |X=x (y) = yfY |X=x (y)dy.
Thus we aim to provide information about Y ’s expectation, not distribution, by X.

Finally, recall that Y can expressed in terms of m by means of the location-scale model:
Y = m(X) + σ(X)ε,
where σ 2 (x) := Var[Y |X = x] and ε is independent from X and such that E[ε] = 0 and Var[ε] = 1.
3.1 Review on parametric regression

We review now a couple of useful parametric regression models that will be used in the construction of
nonparametric regression models.
3.1.1 Linear regression
Model formulation and least squares
The multiple linear regression employs multiple predictors X1 , . . . , Xp 2 for explaining a single response Y by
assuming that a linear relation of the form
Y = β0 + β1 X1 + . . . + βp Xp + ε (3.1)
1 Recall that we assume that (X, Y ) is continuous.
2 Not to confuse with a sample!
55
56 CHAPTER 3. REGRESSION ESTIMATION
holds between the predictors X1 , . . . , Xp and the response Y . In (3.1), β0 is the intercept and β1 , . . . , βp are
the slopes, respectively. ε is a random variable with mean zero and independent from X1 , . . . , Xp . Another
way of looking at (3.1) is
E[Y |X1 = x1 , . . . , Xp = xp ] = β0 + β1 x1 + . . . + βp xp , (3.2)
since E[ε|X1 = x1 , . . . , Xp = xp ] = 0. Therefore, the mean of Y is changing in a linear fashion with respect
to the values of X1 , . . . , Xp . Hence the interpretation of the coefficients:
• β0 : is the mean of Y when X1 = . . . = Xp = 0.

• βj , 1 ≤ j ≤ p: is the additive increment in mean of Y for an increment of one unit in Xj = xj ,
provided that the remaining variables do not change.
Figure 3.1 illustrates the geometrical interpretation of a multiple linear model: a plane in the (p + 1)-
dimensional space. If p = 1, the plane is the regression line for simple linear regression. If p = 2, then the
plane can be visualized in a three-dimensional plot. TODO: add another figure.
Figure 3.1: The regression plane (blue) of Y on X1 and X2 , and its relation with the regression lines (green
lines) of Y on X1 (left) and Y on X2 (right). The red points represent the sample for (X1 , X2 , Y ) and
the black points the subsamples for (X1 , X2 ) (bottom), (X1 , Y ) (left) and (X2 , Y ) (right). Note that the
regression plane is not a direct extension of the marginal regression lines.
The estimation of β0 , β1 , . . . , βp is done by minimizing the so-called Residual Sum of Squares (RSS). First
we need to introduce some helpful matrix notation:
3.1. REVIEW ON PARAMETRIC REGRESSION 57
• A sample of (X1 , . . . , Xp , Y ) is denoted by (X11 , . . . , X1p , Y1 ), . . . , (Xn1 , . . . , Xnp , Yn ), where Xij de-
notes the i-th observation of the j-th predictor Xj . We denote with Xi = (Xi1 , . . . , Xip ) to the i-th
observation of (X1 , . . . , Xp ), so the sample simplifies to (X1 , Y1 ), . . . , (Xn , Yn ).
• The design matrix contains all the information of the predictors and a column of ones
 
1 X11 · · · X1p
 .. 
X =  ... ..
.
..
. .  .
1 Xn1 · · · Xnp n×(p+1)
• The vector of responses Y, the vector of coefficients β and the vector of errors are, respectively,
 
  β0  
Y1  β1  ε1
 ..     .. 
Y= .  , β= .  , and ε =  .  .
 .. 
Yn n×1 εn n×1
βp (p+1)×1
Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely
Yi = β0 + β1 Xi1 + . . . + βp Xik + εi , i = 1, . . . , n,
into something as compact as
Y = Xβ + ε.
The RSS for the multiple linear regression is
∑
n
RSS(β) := (Yi − β0 − β1 Xi1 − . . . − βp Xik )2
i=1
= (Y − Xβ)′ (Y − Xβ). (3.3)
The RSS aggregates the squared vertical distances from the data to a regression plane given by β. Note that
the vertical distances are considered because we want to minimize the error in the prediction of Y . Thus,
the treatment of the variables is not symmetrical 3 ; see Figure 3.1.1. The least squares estimators are the
minimizers of the RSS:
β̂ := arg min
p+1
RSS(β).
β∈R
Luckily, thanks to the matrix form of (3.3), it is simple to compute a closed-form expression for the least
squares estimates:
β̂ = (X′ X)−1 X′ Y. (3.4)

3 If that was the case, we would consider perpendicular distances, which yield to Principal Component Analysis (PCA).
∂Ax ∂f (x)′ g(x)

Exercise 3.1. β̂ can be obtained differentiating (3.3). Prove it using that ∂x = A and ∂x =
f (x)′ ∂g(x) ′ ∂f (x)
∂x + g(x) ∂x for two vector-valued functions f and g.
The least squares regression plane y = β̂0 + β̂1 x1 + β̂2 x2 and its dependence on the kind of squared distance
considered. Application also available here.
Let’s check that indeed the coefficients given by R’s lm are the ones given by (3.4) in a toy linear model.
# Create the data employed in Figure 3.1
# Generates 50 points from a N(0, 1): predictors and error

set.seed(34567)
x1 <- rnorm(50)
x2 <- rnorm(50)
x3 <- x1 + rnorm(50, sd = 0.05) # Make variables dependent
eps <- rnorm(50)
# Responses
yLin <- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
yQua <- -0.5 + x1^2 + 0.5 * x2 + eps
yExp <- -0.5 + 0.5 * exp(x2) + x3 + eps
# Data
dataAnimation <- data.frame(x1 = x1, x2 = x2, yLin = yLin,
yQua = yQua, yExp = yExp)
# Call lm
# lm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done
mod <- lm(yLin ~ x1 + x2, data = dataAnimation)
summary(mod)
##
## Call:
## lm(formula = yLin ~ x1 + x2, data = dataAnimation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37003 -0.54305 0.06741 0.75612 1.63829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5703 0.1302 -4.380 6.59e-05 ***
## x1 0.4833 0.1264 3.824 0.000386 ***
## x2 0.3215 0.1426 2.255 0.028831 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9132 on 47 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2452
## F-statistic: 8.958 on 2 and 47 DF, p-value: 0.0005057
# mod is a list with a lot of information

# str(mod) # Long output
# Coefficients
mod$coefficients
## (Intercept) x1 x2
## -0.5702694 0.4832624 0.3214894
# Application of formula (3.4)
# Matrix X
X <- cbind(1, x1, x2)
# Vector Y
Y <- yLin
# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
## [,1]
## -0.5702694
## x1 0.4832624
## x2 0.3214894
Exercise 3.2. Compute β for the regressions yLin ~ x1 + x2, yQua ~ x1 + x2, and yExp ~ x2 + x3
using equation (3.4) and the function lm. Check that the fitted plane and the coefficient estimates are
coherent.
Once we have the least squares estimates β̂, we can define the next two concepts:
• The fitted values Ŷ1 , . . . , Ŷn , where
Ŷi := β̂0 + β̂1 Xi1 + · · · + β̂p Xik , i = 1, . . . , n.
They are the vertical projections of Y1 , . . . , Yn into the fitted line (see Figure 3.1.1). In a matrix form,
inputting (3.3)
Ŷ = Xβ̂ = X(X′ X)−1 X′ Y = HY,
where H := X(X′ X)−1 X′ is called the hat matrix because it “puts the hat into Y”. What it does is to
project Y into the regression plane (see Figure 3.1.1).
• The residuals (or estimated errors) ε̂1 , . . . , ε̂n , where
ε̂i := Yi − Ŷi , i = 1, . . . , n.
They are the vertical distances between actual data and fitted data.
Model assumptions
Up to now, we have not made any probabilistic assumption on the data generation process. β̂ was derived
from purely geometrical arguments, not probabilistic ones. However, some probabilistic assumptions are
required for inferring the unknown population coefficients β from the sample (X1 , Y1 ), . . . , (Xn , Yn ).
The assumptions of the multiple linear model are:
i. Linearity: E[Y |X1 = x1 , . . . , Xp = xp ] = β0 + β1 x1 + . . . + βp xp .
ii. Homoscedasticity: Var[εi ] = σ 2 , with σ 2 constant for i = 1, . . . , n.
iii. Normality: εi ∼ N (0, σ 2 ) for i = 1, . . . , n.
iv. Independence of the errors: ε1 , . . . , εn are independent (or uncorrelated, E[εi εj ] = 0, i ̸= j, since
they are assumed to be normal).
Figure 3.2: The key concepts of the simple linear model. The blue densities denote the conditional density
of Y for each cut in the X axis. The yellow band denotes where the 95% of the data is, according to the
model. The red points represent data following the model.
A good one-line summary of the linear model is the following (independence is assumed)
Y |(X1 = x1 , . . . , Xp = xp ) ∼ N (β0 + β1 x1 + . . . + βp xp , σ 2 ). (3.5)
Inference on the parameters β and σ can be done, conditionally4 on X1 , . . . , Xn , from (3.5). We do not
explore this further, referring the interested reader to these notes. Instead of that, we remark the connection
between least squares estimation and the maximum likelihood estimator derived from (3.5).
First, note that (3.5) is the population version of the linear model (it is expressed in terms of the random
variables, not in terms of their samples). The sample version that summarizes assumptions i–iv is
Y|(X1 , . . . , Xp ) ∼ Nn (Xβ, σ 2 I).
Using this result, it is easy obtain the log-likelihood function of Y1 , . . . , Yn conditionally5 on X1 , . . . , Xn as
4 We assume that the randomness is on the response only.

5 We assume that the randomness is on the response only.
∑
n
ℓ(β) = log ϕσ2 I (Y − Xβ) = log ϕσ (Yi − (Xβ)i ). (3.6)
i=1
Finally, the next result justifies the consideration of the least squares estimate: it equals the maximum
likelihood estimator derived under assumptions i–iv.
Theorem 3.1. Under assumptions i–iv, the maximum likelihood estimate of β is the least squares estimate
(3.4):
β̂ ML = arg max
p+1
ℓ(β) = (X′ X)−1 XY.
β∈R
Proof. Expanding the first equality at (3.6) gives (|σ 2 I|1/2 = σ n )

1
ℓ(β) = − log((2π)n/2 σ n ) − (Y − Xβ)′ (Y − Xβ).
2σ 2
Optimizing ℓ does not require knowledge on σ 2 , since differentiating with respect to β and equating to zero
gives (see Exercise 3.1) σ12 (Y − Xβ)′ X = 0. The result follows from that.
3.1.2 Logistic regression
Model formulation
When the response Y can take only two values, codified for convenience as 1 (success) and 0 (failure), it
is called a binary variable. A binary variable, known also as a Bernoulli variable, is a B(1, p). Recall that
E[B(1, p)] = P[B(1, p) = 1] = p.
If Y is a binary variable and X1 , . . . , Xp are predictors associated to Y , the purpose in logistic regression is
to estimate
p(x1 , . . . , xp ) := P[Y = 1|X1 = x1 , . . . , Xp = xp ]

= E[Y |X1 = x1 , . . . , Xp = xp ], (3.7)
this is, how the probability of Y = 1 is changing according to particular values, denoted by x1 , . . . , xp , of
the predictors X1 , . . . , Xp . A tempting possibility is to consider a linear model for (3.7), p(x1 , . . . , xp ) =
β0 +β1 x1 +. . .+βp xp . However, such a model will run into serious problems inevitably: negative probabilities
and probabilities larger than one will arise.
A solution is to consider a function to encapsulate the value of z = β0 + β1 x1 + . . . + βp xp , in R, and map
it back to [0, 1]. There are several alternatives to do so, based on distribution functions F : R −→ [0, 1] that
deliver y = F (z) ∈ [0, 1]. Different choices of F give rise to different models, the most common being the
logistic distribution function:
ez 1
logistic(z) := = .
1 + ez 1 + e−z
Its inverse, F −1 : [0, 1] −→ R, known as the logit function, is
p
logit(p) := logistic−1 (p) = log .
1−p
This is a link function, this is, a function that maps a given space (in this case [0, 1]) into R. The term link
function is employed in generalized linear models, which follow exactly the same philosophy of the logistic
regression – mapping the domain of Y to R in order to apply there a linear model. As said, different link
functions are possible, but we will concentrate here exclusively on the logit as a link function.
The logistic model is defined as the following parametric form for (3.7):
p(x1 , . . . , xp ) = logistic(β0 + β1 x1 + . . . + βp xp )
1
= . (3.8)
1 + e−(β0 +β1 x1 +...+βp xp )
The linear form inside the exponent has a clear interpretation:

1
• If β0 + β1 x1 + . . . + βp xp = 0, then p(x1 , . . . , xp ) = 2 (Y = 1 and Y = 0 are equally likely).
1
• If β0 + β1 x1 + . . . + βp xp < 0, then p(x1 , . . . , xp ) < 2 (Y = 1 less likely).
1
• If β0 + β1 x1 + . . . + βp xp > 0, then p(x1 , . . . , xp ) > 2 (Y = 1 more likely).
To be more precise on the interpretation of the coefficients β0 , . . . , βp we need to introduce the concept of
odds. The odds is an equivalent way of expressing the distribution of probabilities in a binary
variable. Since P[Y = 1] = p and P[Y = 0] = 1 − p, both the success and failure probabilities can be
inferred from p. Instead of using p to characterize the distribution of Y , we can use
p P[Y = 1]
odds(Y ) = = . (3.9)
1−p P[Y = 0]
The odds is the ratio between the probability of success and the probability of failure. It is extensively used
in betting due to its better interpretability. For example, if a horse Y has a probability p = 2/3 of winning
a race (Y = 1), then the odds of the horse is
p 2/3
odds = = = 2.
1−p 1/3
This means that the horse has a probability of winning that is twice larger than the probability of losing. This
is sometimes written as a 2 : 1 or 2 × 1 (spelled “two-to-one”). Conversely, if the odds of Y is given, we can
easily know what is the probability of success p, using the inverse of (3.9):
odds(Y )
p = P[Y = 1] = .
1 + odds(Y )
For example, if the odds of the horse were 5, that would correspond to a probability of winning p = 5/6.
Remark. Recall that the odds is a number in [0, +∞]. The 0 and +∞ values are attained for p = 0 and
p = 1, respectively. The log-odds (or logit) is a number in [−∞, +∞].
We can rewrite (3.8) in terms of the odds (3.9). If we do so, we have:
odds(Y |X1 = x1 , . . . , Xp = xp )
p(x1 , . . . , xp )
=
1 − p(x1 , . . . , xp )
= eβ0 +β1 x1 +...+βp xp
= eβ0 eβ1 x1 . . . eβp xp .
This provides the following interpretation of the coefficients:

• eβ0 : is the odds of Y = 1 when X1 = . . . = Xp = 0.

• eβj , 1 ≤ j ≤ k: is the multiplicative increment of the odds for an increment of one unit in Xj = xj ,
provided that the remaining variables do not change. If the increment in Xj is of r units, then the
multiplicative increment in the odds is (eβj )r .
Model assumptions and estimation
Some probabilistic assumptions are required for performing inference on the model parameters β from the
sample (X1 , Y1 ), . . . , (Xn , Yn ). These assumptions are somehow simpler than the ones for linear regression.
Figure 3.3: The key concepts of the logistic model. The blue bars represent the conditional distribution of
probability of Y for each cut in the X axis. The red points represent data following the model.
The assumptions of the logistic model are the following:

p(x)
i. Linearity in the logit6 : logit(p(x)) = log 1−p(x) = β0 + β1 x 1 + . . . + βp x p .
ii. Binariness: Y1 , . . . , Yn are binary variables.
iii. Independence: Y1 , . . . , Yn are independent.
A good one-line summary of the logistic model is the following (independence is assumed)
6 An equivalent way of stating this assumption is p(x) = logistic(β0 + β1 x1 + . . . + βp xp ).
Y |(X1 = x1 , . . . , Xp = xp ) ∼ Ber (logistic(β0 + β1 x1 + . . . + βp xp ))

( )
1
= Ber .
1 + e−(β0 +β1 x1 +...+βp xp )
Since Yi ∼ Ber(p(Xi )), i = 1, . . . , n, the log-likelihood of β is
∑
n
( )
ℓ(β) = log p(Xi )Yi (1 − p(Xi ))1−Yi
i=1
∑
n
= {Yi log(p(Xi )) + (1 − Yi ) log(1 − p(Xi ))} . (3.10)
i=1
Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for β̂.
These have to be obtained numerically by means of an iterative procedure, which may run into problems in
low sample situations with perfect classification. Unlike in the linear model, inference is not exact from the
assumptions, but approximate in terms of maximum likelihood theory. We do not explore this further and
refer the interested reader to these notes.
Figure 3.1.2 shows how the log-likelihood changes with respect to the values for (β0 , β1 ) in three data patterns.
The logistic regression fit and its dependence on β0 (horizontal displacement) and β1 (steepness of the curve).
Recall the effect of the sign of β1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the
form is a reflected ‘s’. Application also available here.
The data of the illustration has been generated with the following code:
Let’s check that indeed the coefficients given by R’s glm are the ones that maximize the likelihood of the
animation of Figure 3.1.2. We do so for y ~ x1.
# Create the data employed in Figure 3.4
# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
# Call glm
# glm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done. We need to specify family = "binomial" to make a
# logistic regression
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
summary(mod)
##
## Call:
## glm(formula = y1 ~ x, family = "binomial", data = dataMle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47853 -0.40139 0.02097 0.38880 2.12362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1692 0.4725 -0.358 0.720274
## x 2.4282 0.6599 3.679 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.315 on 49 degrees of freedom
## Residual deviance: 29.588 on 48 degrees of freedom
## AIC: 33.588
##
## Number of Fisher Scoring iterations: 6
# mod is a list with a lot of information

# str(mod) # Long output
# Coefficients
mod$coefficients
## (Intercept) x
## -0.1691947 2.4281626
# Plot the fitted regression curve

xGrid <- seq(-5, 5, l = 200)
yGrid <- 1 / (1 + exp(-(mod$coefficients[1] + mod$coefficients[2] * xGrid)))
plot(xGrid, yGrid, type = "l", col = 2, xlab = "x", ylab = "y")
points(x, y1)
1.0
0.8
0.6
y
0.4
0.2
0.0
−4 −2 0 2 4
x
Exercise 3.3. For the regressions y ~ x2 and y ~ x3, do the following:
• Check that β is indeed maximizing the likelihood as compared with Figure 3.1.2.
• Plot the fitted logistic curve and compare it with the one in Figure 3.1.2.
3.2 Kernel regression estimation
3.2.1 Nadaraya-Watson estimator
Our objective is to estimate the regression function m nonparametrically. Due to its definition, we can
rewrite it as follows:
m(x) = E[Y |X = x]
∫
= yfY |X=x (y)dy
∫
yf (x, y)dy
= .
fX (x)
This expression shows an interesting point: the regression function can be computed from the joint density f
and the marginal fX . Therefore, given a sample (X1 , Y1 ), . . . , (Xn , Yn ), an estimate of m follows by replacing
the previous densities by their kdes. To that aim, recall that in Exercise 2.15 we defined a multivariate
extension of the kde based on product kernels. For the two dimensional case, the kde with equal bandwidths
h = (h, h) is
1∑
n
fˆ(x, y; h) = Kh (x1 − Xi )Kh (y − Yi ). (3.11)
n i=1
Using (3.11),
3.2. KERNEL REGRESSION ESTIMATION 67
∫
y fˆ(x, y; h)dy
m(x) ≈
fˆX (x; h)
∫
y fˆ(x, y; h)dy
=
fˆX (x; h)
∫ 1 ∑n
y n i=1 Kh (x − Xi )Kh (y − Yi )dy
= ∑n
i=1 Kh (x − Xi )
1
∑n n
∫
i=1 Kh (x − Xi ) yKh (y − Yi )dy
1
= n
∑n
i=1 Kh (x − Xi )
1
n
∑n
1
Kh (x − Xi )Yi
= n1 ∑i=1 n .
n i=1 Kh (x − Xi )
This yields the so-called Nadaraya-Watson7 estimate:
1∑ ∑
n n
Kh (x − Xi )
m̂(x; 0, h) := ∑n Yi = Wi0 (x)Yi , (3.12)
i=1 Kh (x − Xi )
n i=1 1
n i=1
where Wi0 (x) := ∑nKh (x−Xi)

K (x−X )
. This estimate can be seen as a weighted average of Y1 , . . . , Yn by means of
h i
i=1
the set of weights {Wn,i (x)}ni=1 (check that they add to one). The set of weights depends on the evaluation
point x. That means that the Nadaraya-Watson estimator is a local mean of Y1 , . . . , Yn around X = x
(see Figure 3.2.2).
Let’s implement the Nadaraya-Watson estimate to get a feeling of how it works in practice.
# Nadaraya-Watson
mNW <- function(x, X, Y, h, K = dnorm) {
# Arguments
# x: evaluation points
# X: vector (size n) with the predictors
# Y: vector (size n) with the response variable
# h: bandwidth
# K: kernel
# Matrix of size n x length(x)

Kx <- sapply(X, function(Xi) K((x - Xi) / h) / h)
# Weights
W <- Kx / rowSums(Kx) # Column recycling!
# Means at x ("drop" to drop the matrix attributes)

drop(W %*% Y)
# Generate some data to test the implementation

set.seed(12345)
n <- 100
7 Termed due to the coetaneous proposals by Nadaraya (1964) and Watson (1964).
eps <- rnorm(n, sd = 2)

m <- function(x) x^2 * cos(x)
X <- rnorm(n, sd = 2)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)
# Bandwidth
h <- 0.5
# Plot data
plot(X, Y)
rug(X, side = 1); rug(Y, side = 2)
lines(xGrid, m(xGrid), col = 1)
lines(xGrid, mNW(x = xGrid, X = X, Y = Y, h = h), col = 2)
legend("topright", legend = c("True regression", "Nadaraya-Watson"),
lwd = 2, col = 1:2)
15
True regression
Nadaraya−Watson
10
5
Y
0
−5
−10
−4 −2 0 2 4
X
Exercise 3.4. Implement your own version of the Nadaraya-Watson estimator in R and compare it with mNW.
You may focus only on the normal kernel and reduce the accuracy of the final computation up to 1e-7 to
achieve better efficiency. Are you able to improve the speed of mNW? Use system.time or the microbenchmark
package to measure the running times for a sample size of n = 10000.
Winner-takes-all bonus: the significative fastest version of the Nadaraya-Watson estimator (under the
above conditions) will get a bonus of 0.5 points.
The code below illustrates the effect of varying h for the Nadaraya-Watson estimator using manipulate.
# Simple plot of N-W for varying h's
manipulate({
# Plot data
plot(X, Y)

lwd = 2, col = 1:2)
}, h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01))
3.2.2 Local polynomial regression
Nadaraya-Watson can be seen as a particular case of a local polynomial fit, specifically, the one corresponding
to a local constant fit. The motivation for the local polynomial fit comes from attempting to the minimize
the RSS
∑
n
(Yi − m(Xi ))2 . (3.13)
i=1
This is not achievable directly, since no knowledge on m is available. However, by a p-th order Taylor
expression it is possible to obtain that, for x close to Xi ,
m′′ (x)
m(Xi ) ≈ m(x) + m′ (x)(Xi − x) + (Xi − x)2
2
m(p) (x)
+ ··· + (Xi − x)p . (3.14)
p!
Replacing (3.14) in (3.13), we have that
 2
∑
n ∑
p
m(j) (x)
Yi − (Xi − x)j  . (3.15)
i=1 j=0
j!
This expression is still not workable: it depends on m(j) (x), j = 0, . . . , p, which of course are unknown!
(j)
The great idea is to set βj := m j!(x) and turn (3.15) into a linear regression problem where the unknown
parameters are β = (β0 , β1 , . . . , βp )′ :
 2
∑
n ∑
p
Yi − βj (Xi − x)j  . (3.16)
i=1 j=0
While doing so, an estimate of β automatically will yield estimates for m(j) (x), j = 0, . . . , p, and we know
how to obtain β̂ by minimizing (3.16). The final touch is to make the contributions of Xi dependent on the
distance to x by weighting with kernels:
 2
∑
n ∑
p
β̂ := arg min Yi − βj (Xi − x)j  Kh (x − Xi ). (3.17)
p+1
β∈R
i=1 j=0
Denoting
 
1 X1 − x ··· (X1 − x)p
 .. .. .. .. 
X :=  . . . . 
1 Xn − x · · · (Xn − x)p n×(p+1)
and  
Y1
 
W := diag(Kh (X1 − x), . . . , Kh (Xn − x)), Y :=  ...  ,
Yn n×1
we can re-express (3.17) into a weighted least squares problem whose exact solution is
β̂ = arg min
p+1
(Y − Xβ)′ W(Y − Xβ) (3.18)
β∈R
= (X′ WX)−1 X′ WY. (3.19)
Exercise 3.5. Using the equalities given in Exercise 3.1, prove (3.19).
The estimate for m(x) is then computed as
m̂(x; p, h) : = β̂0
= e′1 (X′ WX)−1 X′ WY
∑n
= Wip (x)Yi ,
i=1
where Wip (x) := e′1 (X′ WX)−1 X′ Wei and ei is the i-th canonical vector. Just as the Nadaraya-Watson, the
local polynomial estimator is a linear combination of the responses. Two cases deserve special attention:
• p = 0 is the local constant estimator or the Nadaraya-Watson estimator (Exercise 3.6). In this situation,
the estimator has explicit weights, as we saw before:
Kh (x − Xi )
Wi0 (x) = ∑n .
j=1 Kh (x − Xj )
• p = 1 is the local linear estimator, which has weights equal to (Exercise 3.9):
ŝ2 (x; h) − ŝ1 (x; h)(Xi − x)

Wi1 (x) = Kh (x − Xi ), (3.20)
ŝ2 (x; h)ŝ0 (x; h) − ŝ1 (x; h)2
∑n
where ŝr (x; h) := 1
n i=1 (Xi − x)r Kh (x − Xi ).
Figure 3.2.2 illustrates the construction of the local polynomial estimator (up to cubic degree) and shows
how β̂0 = m̂(x; p, h), the intercept of the local fit, estimates m at x.
Construction of the local polynomial estimator. The animation shows how local polynomial fits in a neighbor-
hood of x are combined to provide an estimate of the regression function, which depends on the polynomial
degree, bandwidth, and kernel (gray density at the bottom). The data points are shaded according to their
weights for the local fit at x. Application also available here.
KernSmooth’s locpoly implements the local polynomial estimator. Below are some examples of its usage.

set.seed(123456)
n <- 100
m <- function(x) x^3 * sin(x)
X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)
# Bandwidth
h <- 0.25
lp0 <- locpoly(x = X, y = Y, bandwidth = h, degree = 0, range.x = c(-10, 10),
gridsize = 500)
lp1 <- locpoly(x = X, y = Y, bandwidth = h, degree = 1, range.x = c(-10, 10),
gridsize = 500)
# Plot data
plot(X, Y)
lines(lp0$x, lp0$y, col = 2)
lines(lp1$x, lp1$y, col = 3)
legend("bottom", legend = c("True regression", "Local constant",
"Local linear"),
lwd = 2, col = 1:3)
10
0
Y
−10
−20
True regression
Local constant
Local linear
−3 −2 −1 0 1 2 3
X
# Simple plot of local polynomials for varying h's
manipulate({
# Plot data
lpp <- locpoly(x = X, y = Y, bandwidth = h, degree = p, range.x = c(-10, 10),
gridsize = 500)
plot(X, Y)
lines(lpp$x, lpp$y, col = p + 2)
legend("bottom", legend = c("True regression", "Local polynomial fit"),
lwd = 2, col = c(1, p + 2))
}, h = slider(min = 0.01, max = 2, initial = 0.5, step = 0.01),

p = slider(min = 0, max = 4, initial = 0, step = 1))
3.3 Asymptotic properties

The purpose of this section is to provide some key asymptotic results for the bias, variance, and asymptotic
normality of the local linear and local constant estimators. These provide useful insights on the effect of p,
m, f , and σ 2 in the performance of the estimators. Proofs and detailed analysis are skipped; we refer the
interested reader to Ruppert and Wand (1994), Section 5.3 of Wand and Jones (1995), and Section 3.2 of
Fan and Gijbels (1996).
Along this section we will make the following assumptions:
• A1. m is twice continuously differentiable.
• A2. σ 2 is continuous and positive.
• A3. f is continuously differentiable and positive.
• A4. The kernel K is a symmetric and bounded pdf with finite second moment and is square integrable.
• A5. h = hn is a deterministic sequence of bandwidths such that, when n → ∞, h → 0 and nh → ∞.
The bias and variance are expanded in their conditional versions on the predictor’s sample X1 , . . . , Xn . The
reason of analyzing the conditional instead of the unconditional versions is avoiding technical difficulties that
integration with respect to the predictor’s density may pose.
Theorem 3.2. Under A1–A5, the conditional bias and variance of the local constant (p = 0) and local
linear (p = 1) estimators are
Bias[m̂(x; p, h)|X1 , . . . , Xn ] = Bp (x)h2 + oP (h2 ), (3.21)

R(K) 2
Var[m̂(x; p, h)|X1 , . . . , Xn ] = σ (x) + oP ((nh)−1 ), (3.22)
nhf (x)
where { }
µ2 (K) m′ (x)f ′ (x)
Bp (x) = m′′ (x) + 2 1{p=0} .
2 f (x)
The bias and variance expressions (3.21) and (3.22) yield interesting insights:
• The bias decreases with h quadratically for p = 0, 1. The bias at x is directly proportional to m′′ (x) if
p = 1 and affected by m′′ (x) if p = 0. This has the same interpretation as in the density setting:
– The bias is negative in concave regions, i.e. {x ∈ R : m(x)′′ < 0}. These regions correspond to
peaks and modes of m
– Conversely, the bias is positive in convex regions, i.e. {x ∈ R : m(x)′′ > 0}. These regions
correspond to valleys of m.
– The wilder the curvature m′′ , the harder to estimate m.
• The bias for p = 0 at x is affected by m′ (x), f ′ (x), and f (x). Precisely, the lower the density f (x),
the larger the bias. And the faster m and f change at x, the larger the bias. Thus the bias
of the local constant estimator is much more sensible to m(x) and f (x) than the local linear (which
is only sensible to m′′ (x)). Particularly, the fact that it depends on f ′ (x) and f (x) is referred as the
design bias since it depends merely on the predictor’s distribution.
2
• The variance depends directly on σf (x)
(x)
for p = 0, 1. As a consequence, the lower the density and
larger the conditional variance, the more variable is m̂(·; p, h). The variance decreases at a
factor of (nh)−1 due to the effective sample size.
An extended version of Theorem 3.2, given in Theorem 3.1 of Fan and Gijbels (1996), shows that odd order
polynomial fits are preferable to even order polynomial fits. The reason is that odd orders introduce
an extra coefficient for the polynomial fit that allows to reduce the bias, while at the same time they keep
the variance unchanged. In summary, the conclusions of the above analysis of p = 0 vs. p = 1, namely that
p = 1 has smaller bias than p = 0 (but of the same order) and the same variance as p = 0, extend to the
case p = 2ν vs. p = 2ν + 1, ν ∈ N. This allows to claim that local polynomial fitting is an odd world (Fan
and Gijbels (1996)).
Finally, we have the asymptotic pointwise normality of the estimator.
Theorem 3.3. Assume that E[(Y − m(x))2+δ |X = x] < ∞ for some δ > 0. Then, under A1–A5,
( )
√ d R(K)σ 2 (x)
nh(m̂(x; p, h) − E[m̂(x; p, h)]) −→ N 0, , (3.23)
f (x)
( )
√ ( ) d R(K)σ 2 (x)
nh m̂(x; p, h) − m(x) − Bp (x)h −→ N 0,
2
. (3.24)
f (x)
3.4 Bandwidth selection
Bandwidth selection is, as in kernel density estimation, of key practical importance. Several bandwidth
selectors, in the spirit of the plug-in and cross-validation ideas discussed in Section 2.4 have been proposed.
There are, for example, a rule-of-thumb analogue (see Section 4.2 in Fan and Gijbels (1996)) and a direct
plug-in analogue (see Section 5.8 in Wand and Jones (1995)). For simplicity, we will focus only on the
cross-validation bandwidth selector.
Following an analogy with the fit of the linear model, we could look for the bandwidth h such that it
minimizes the RSS of the form
1∑
n
(Yi − m̂(Xi ; p, h))2 . (3.25)
n i=1
However, this is a bad idea. Attempting to minimize (3.25) always leads to h ≈ 0 that results in a useless
interpolation of the data. Let’s see an example.
# Grid for representing (3.22)
hGrid <- seq(0.1, 1, l = 200)^2
error <- sapply(hGrid, function(h)
mean((Y - mNW(x = X, X = X, Y = Y, h = h))^2))
# Error curve
plot(hGrid, error, type = "l")
rug(hGrid)
abline(v = hGrid[which.min(error)], col = 2)
15
10
error
5
0
0.0 0.2 0.4 0.6 0.8 1.0
hGrid The
root of the problem is the comparison of Yi with m̂(Xi ; p, h), since there is nothing that forbids h → 0 and
as a consequence m̂(Xi ; p, h) → Yi . We can change this behavior if we compare Yi with m̂−i (Xi ; p, h), the
leave-one-out estimate computed without the i-th datum (Xi , Yi ):
1∑
n
CV(h) := (Yi − m̂−i (Xi ; p, h))2 ,
n i=1
hCV := arg min CV(h).
h>0
The optimization of the above criterion might seem to be computationally expensive, since it is required to
compute n regressions for a single evaluation of the objective function. ∑n p
Proposition 3.1. The weights of the leave-one-out estimator m̂−i (x; p, h) = j=1 W−i,j (x)Yj can be ob-
∑n p
j̸ =i
tained from m̂(x; p, h) = i=1 Wi (x)Yi :
Wjp (x) Wjp (x)

p
W−i,j (x) ∑
= n = .
p
k=1 Wk (x) 1 − Wip (x)
k̸=i
This implies that
( )2
1∑
n
Yi − m̂(Xi ; p, h)
CV(h) = .
n i=1 1 − Wip (Xi )
Let’s implement this simple bandwidth selector in R.

set.seed(12345)
n <- 100
m <- function(x) x^2 + sin(x)

X <- rnorm(n, sd = 1.5)
Y <- m(X) + eps
xGrid <- seq(-10, 10, l = 500)
# Objective function
cvNW <- function(X, Y, h, K = dnorm) {
sum(((Y - mNW(x = X, X = X, Y = Y, h = h, K = K)) /
(1 - K(0) / colSums(K(outer(X, X, "-") / h))))^2)
}
# Find optimum CV bandwidth, with sensible grid

bw.cv.grid <- function(X, Y,
h.grid = diff(range(X)) * (seq(0.1, 1, l = 200))^2,
K = dnorm, plot.cv = FALSE) {
obj <- sapply(h.grid, function(h) cvNW(X = X, Y = Y, h = h, K = K))
if (plot.cv) {
rug(h.grid)
}
h
}
# Bandwidth
h <- bw.cv.grid(X = X, Y = Y, plot.cv = TRUE)
1400
obj
800 1000
600
0 1 2 3 4 5 6 7
h.grid
h
## [1] 0.3117806
# Plot result
plot(X, Y)
lwd = 2, col = 1:2)
15
True regression
Nadaraya−Watson
10
Y
5
0
−5
−3 −2 −1 0 1 2 3 4
3.5 Local likelihood

We explore in this section an extension of the local polynomial estimator. This extension aims to estimate
the regression function by relying in the likelihood, rather than the least squares. Thus, the idea behind the
local likelihood is to fit, locally, parametric models by maximum likelihood.
We begin by seeing that local likelihood using the the linear model is equivalent to local polynomial modelling.
Theorem 3.1 showed that, under the assumptions given in Section 3.1.1, the maximum likelihood estimate
of β in the linear model
Y |(X1 , . . . , Xp ) ∼ N (β0 + β1 X1 + . . . + βp Xp , σ 2 ) (3.26)
was equivalent to the least squares estimate, β̂ ML = (X′ X)−1 XY. The reason was the form of the conditional
(on X1 , . . . , Xp ) likelihood:
n
ℓ(β) = − log(2πσ 2 )
2
1 ∑
n
− 2 (Yi − β0 − β1 Xi1 − . . . − βp Xip )2 .
2σ i=1
If there is a single predictor X, polynomial fitting of order p of the conditional meal can be achieved by the
well-known trick of identifying the j-th predictor Xj in (3.26) by X j . This results in
3.5. LOCAL LIKELIHOOD 77
Y |X ∼ N (β0 + β1 X + . . . + βp X p , σ 2 ). (3.27)
Therefore, maximizing with respect to β the weighted log-likelihood of the linear model around x of (3.27),
n
ℓx,h (β) = − log(2πσ 2 )
2
1 ∑
n
− 2 (Yi − β0 − β1 Xi − . . . − βp Xip )2 Kh (x − Xi ),
2σ i=1
provides β̂0 = m̂(x; p, h), the local polynomial estimator, as it was obtained in (3.17).
The same idea can be applied for other parametric models. The family of generalized linear models8 , which
presents an extension of the linear model to different kinds of response variables, provides a particularly
convenient parametric framework. Generalized linear models are constructed by mapping the support of
Y to R by a link function g, and then modeling the transformed expectation by a linear model. Thus, a
generalized linear model for the predictors X1 , . . . , Xp , assumes
g (E[Y |X1 = x1 , . . . , Xp = xp ]) = β0 + β1 x1 + . . . + βp xp
or, equivalently,
E[Y |X1 = x1 , . . . , Xp = xp ] = g −1 (β0 + β1 x1 + . . . + βp xp )
together with a distribution assumption for Y |(X1 , . . . , Xp ). The following table lists some useful transfor-
mations and distributions that are adequate to model responses in different supports. Recall that the linear
and logistic models of Sections 3.1.1 and 3.1.2 are obtained from the first and second rows, respectively.
Support of Y Distribution Link g(µ) g −1 (η) Y |(X1 = x1 , . . . , Xp = xp )

R N (µ, σ )2
µ η N (β0 + β1 x1 + . . . + βp xp , σ 2 )
(0, ∞) Exp(λ) µ−1 η −1
Exp(β0 + β1 x1 + . . . + βp xp )
0, 1 B(1, p) logit(µ) logistic(η) B (1, logistic(β0 + β1 x1 + . . . + βp xp ))
0, 1, 2, . . . Pois(λ) log(µ) eη Pois(eβ0 +β1 x1 +...+βp xp )
All the distributions of the table above are members of the exponential family of distributions, which is the
family of distributions with pdf expressable as
{ }
yθ − b(θ)
f (y; θ, ϕ) = exp + c(y, ϕ) ,
a(ϕ)
where a(·), b(·), and c(·, ·) are specific functions, ϕ is the scale parameter and θ is the canonial parameter.
If the canonical link function g (the ones in the table above are all) is employed, then θ = g(µ) and s a
consequence
θ(x1 , . . . , xp ) := g(E[Y |X1 = x1 , . . . , Xp = xp ]).
Recall that, again, if there is only one predictor, identifying the j-th predictor Xj by X j in the above
expressions allows to consider p-th order polynomial fits for g (E[Y |X1 = x1 , . . . , Xp = xp ]).
8 The logistic model is a generalized linear model, as seen in Section 3.1.2
Construction of the local likelihood estimator. The animation shows how local likelihood fits in a neighbor-
hood of x are combined to provide an estimate of the regression function for binary response, which depends
on the polynomial degree, bandwidth, and kernel (gray density at the bottom). The data points are shaded
according to their weights for the local fit at x. Application also available here.
We illustrate the local likelihood principle for the logistic regression, the simplest non-linear model. In this
case, (X1 , Y1 ), . . . , (Xn , Yn ) with
Yi |Xi ∼ Ber(p(Xi )), i = 1, . . . , n.

( )
for p(x) = P[Y = 1|X = x] = E[Y |X = x]. The link function is g(x) = logit(x) = log x
1−x and
p9
θ(x) = logit(p(x)). Assuming that θ(x) = β0 + β1 x + . . . + βp x , we have that the log-likelihood of β is
∑
n
ℓ(β) = {Yi log(logistic(θ(Xi ))) + (1 − Yi ) log(1 − logistic(θ(Xi )))}
i=1
∑n
= ℓ(Yi , θ(Xi )),
i=1
where we consider the log-likelihood addend
ℓ(y, θ) = yθ − log(1 + eθ )
and make explicit the dependence on θ(x) for clarity in the next developments, and implicit the dependence
on β. The local log-likelihood of β around x is then
∑
n
ℓx,h (β) = ℓ(Yi , θ(Xi − x))Kh (x − Xi ). (3.28)
i=1
Maximizing10 the local log-likelihood (3.28) with respect to β provides
β̂ = arg max
p+1
ℓx,h (β).
β∈R
The local likelihood estimate of θ(x) is

θ̂(x) := β̂0 .
Note that the dependence of β̂0 on x and h is omitted. From θ̂(x), we can obtain the local logistic regression
evaluated at x as
( )
m̂ℓ (x; h, p) := g −1 θ̂(x) = logistic(β̂0 ). (3.29)
The code below shows three different ways of implementing the local logistic regression (of first degree) in R.
# Simulate some data
n <- 200
logistic <- function(x) 1 / (1 + exp(-x))
p <- function(x) logistic(1 - 3 * sin(x))
set.seed(123456)
9 If p = 1, then we have the usual logistic model.
10 No analytical solution for the optimization problem, numerical optimization is needed.
X <- runif(n = n, -3, 3)

Y <- rbinom(n = n, size = 1, prob = p(X))
# Set bandwidth and evaluation grid

h <- 0.25
x <- seq(-3, 3, l = 501)
# Optimize the weighted log-likelihood through glm's built in procedure

suppressWarnings(
fitGlm <- sapply(x, function(x) {
K <- dnorm(x = x, mean = X, sd = h)
glm.fit(x = cbind(1, X - x), y = Y, weights = K,
family = binomial())$coefficients[1]
})
)
# Optimize the weighted log-likelihood explicitly

suppressWarnings(
fitNlm <- sapply(x, function(x) {
K <- dnorm(x = x, mean = X, sd = h)
nlm(f = function(beta) {
-sum(K * (Y * (beta[1] + beta[2] * (X - x)) -
log(1 + exp(beta[1] + beta[2] * (X - x)))))
}, p = c(0, 0))$estimate[1]
})
)
# Using locfit
# Bandwidth can not be controlled explicitly - only though nn in ?lp
library(locfit)
fitLocfit <- locfit(Y ~ lp(X, deg = 1, nn = 0.25), family = "binomial",
kern = "gauss")
# Compare fits
plot(x, p(x), ylim = c(0, 1.5), type = "l", lwd = 2)
lines(x, logistic(fitGlm), col = 2)
lines(x,logistic(fitNlm), col = 2, lty = 2)
plot(fitLocfit, add = TRUE, col = 4)
legend("topright", legend = c("p(x)", "glm", "nlm", "locfit"), lwd = 2,
col = c(1, 2, 2, 4), lty = c(1, 2, 1, 1))
1.5
p(x)
glm
nlm
locfit
1.0
p(x)
0.5
0.0
−3 −2 −1 0 1 2 3
x
Bandwidth selection can be done by means of likelihood cross-validation. The objective is to maximize the
local likelihood fit at (Xi , Yi ) but removing the influence by the datum itself. That is, maximizing
∑
n
LCV(h) = ℓ(Yi , θ̂−i (Xi )), (3.30)
i=1
where θ̂−i (Xi ) represents the local fit at Xi without the i-th datum (Xi , Yi ). Unfortunately, the nonlinearity
of (3.29) forbids a simplifying result as Proposition 3.1. Thus, in principle, it is required to fit n local
likelihoods for sample size n − 1 for obtaining a single evaluation of (3.30).
There is, however, an approximation (see Sections 4.3.3 and 4.4.3 of Loader (1999)) to (3.30) that only
requires a local likelihood fit for a single sample. We sketch its basis as follows, without aiming to go in full
detail. The approximation considers the first and second derivatives of ℓ(y, θ) with respect to θ, ℓ̇(y, θ) and
ℓ̈(y, θ). In the case of the logistic model, these are:
ℓ̇(y, θ) = y − logistic(θ),
ℓ̈(y, θ) = −logistic(θ)(1 − logistic(θ)).
It can be seen that (Exercise 4.6 in Loader (1999))
( )2
˙ i , θ̂(Xi )) ,
θ̂−i (Xi ) ≈ θ̂(Xi ) − infl(Xi ) l(Y (3.31)
where θ̂(Xi ) represents the local fit at Xi and the influence function is defined as (page 75 of Loader (1999))
infl(x) := e′1 (X′x Wx VXx )−1 e1 K(0)
for the matrices

 
1 X1 − x ··· (X1 − x)p /p!
 .. .. .. .. 
Xx :=  . . . . 
1 Xn − x ··· (Xn − x)p /p! n×(p+1)
and
Wx := diag(Kh (X1 − x), . . . , Kh (Xn − x)),

V := −diag(¨l(Y1 , θ(X1 )), . . . , ¨l(Yn , θ(Xn ))).
A one-term Taylor expansion of ℓ(Yi , θ̂−i (Xi )) using (3.31) gives
( )2
ℓ(Yi , θ̂−i (Xi )) = ℓ(Yi , θ̂(Xi )) − infl(Xi ) ℓ̇(Yi , θ(Xi )) .
Therefore:
∑
n
LCV(h) = ℓ(Yi , θ̂−i (Xi ))
i=1
∑n ( )2
≈ ℓ(Yi , θ̂(Xi )) + infl(Xi ) ℓ̇(Yi , θ̂(Xi )) .
i=1
Recall that θ(Xi ) are unknown and hence they must be estimated.
We conclude by illustrating how to compute the LCV function and optimize it (keep in mind that much
more efficient implementations are possible!).
# Exact LCV - recall that we *maximize* the LCV!
h <- seq(0.1, 2, by = 0.1)
suppressWarnings(
LCV <- sapply(h, function(h) {
sum(sapply(1:n, function(i) {
K <- dnorm(x = X[i], mean = X[-i], sd = h)
nlm(f = function(beta) {
-sum(K * (Y[-i] * (beta[1] + beta[2] * (X[-i] - X[i])) -
log(1 + exp(beta[1] + beta[2] * (X[-i] - X[i])))))
}, p = c(0, 0))$minimum
}))
})
)
plot(h, LCV, type = "o")
abline(v = h[which.max(LCV)], col = 2)
2000
LCV
1900
1800
0.5 1.0 1.5 2.0
3.6 Exercises
This is the list of evaluable exercises for Chapter 3. The number of stars represents an estimate of their
difficulty: easy (⋆), medium (⋆⋆), and hard (⋆ ⋆ ⋆).
Exercise 3.6. (theoretical, ⋆) Show that the local polynomial estimator yields the Nadaraya-Watson when
p = 0. Use (3.19) to obtain (3.12).
Exercise 3.7. (theoretical, ⋆⋆) Obtain the optimization problem for the local Poisson regression (for the
first degree) and the local binomial regression (of first degree also).
Exercise 3.8. (theoretical, ⋆⋆) Show that the Nadaraya-Watson is unbiased (in conditional expectation
with respect to X1 , . . . , Xn ) when the regression function is constant: m(x) = c, c ∈ R. Show the same for
the local linear estimator for a linear regression function m(x) = a + bx, a, b ∈ R. Hint: use (3.20).
Exercise 3.9. (theoretical, ⋆ ⋆ ⋆) Obtain the weight expressions (3.20) of the local linear estimator. Hint:
use the matrix inversion formula for 2 × 2 matrices.
Exercise 3.10. (theoretical, ⋆ ⋆ ⋆) Prove the two implications of Proposition 3.1 for the Nadaraya-Watson
estimator (p = 0).
Exercise 3.11. (practical, ⋆⋆, Example 4.6 in Wasserman (2006)) The dataset at http://www.stat.cmu.
edu/~larry/all-of-nonpar/=data/bpd.dat (alternative link) contains information about the presence of bron-
chopulmonary dysplasia (binary response) and the birth weight in grams (predictor) of 223 babies. Use
the function locfit of the locfit library with the argument family = "binomial" and plot its output.
Explore and comment on the resulting estimates, providing insights about the data.
Exercise 3.12. (practical, ⋆⋆) The ChickWeight dataset in R contains 578 observations of weight and Times
of chicks. Fit a local binomial or local Poisson regression of weight on Times. Use the function locfit
of the locfit library with the argument family = "binomial" or family = "poisson" and explore the
bandwidth effect. Explore and comment on the resulting estimates. What is the estimated expected time of
a chick that weights 200 grams?
Exercise 3.13. (practical, ⋆ ⋆ ⋆) Implement your own version of the local linear estimator. The function
must take a sample X, a sample Y, the points x at which the estimate should be obtained, the bandwidth h
and the kernel K. Test its correct behavior by estimating an artificial dataset that follows a linear model.
Exercise 3.14. (practical, ⋆ ⋆ ⋆) Implement your own version of the local likelihood estimator of first degree
for exponential response. The function must take a sample X, a sample Y, the points x at which the estimate
3.6. EXERCISES 83
should be obtained, the bandwidth h and the kernel K. Test its correct behavior by estimating an artificial
dataset that follows a generalized linear model with exponential response, this is,
Y |X = x ∼ Exp(λ(x)), λ(x) = eβ0 +β1 x ,
using a cross-validated bandwidth. Hint: use optim or nlm for optimizing a function in R.
Appendix A
Installation of R and RStudio
This is what you have to do in order to install R and RStudio in your own computer:
1. In Mac OS X, download and install first XQuartz and log out and back on your Mac OS X account
(this is an important step that is required for 3D graphics to work). Be sure that your Mac OS X
system is up-to-date.
2. Download the latest version of R at https://cran.r-project.org/. For Windows, you can download it
directly here. For Mac OS X you can download the latest version (at the time of writing this, 3.4.1)
here.
3. Install R. In Windows, be sure to select the 'Startup options' and then choose 'SDI' in the 'Display
Mode' options. Leave the rest of installation options as default.
4. Download the latest version of RStudio for your system at https://www.rstudio.com/products/rstudio/
download/#download and install it.
If there is any Linux user, kindly follow the corresponding instructions here for installing R, download RStudio
(only certain Ubuntu and Fedora versions are supported), and install it using your package manager.
85
86 APPENDIX A. INSTALLATION OF R AND RSTUDIO
Appendix B
Introduction to RStudio
RStudio is the most employed Integrated Development Environment (IDE) for R nowadays. When you start
RStudio you will see a window similar to Figure B.1. There are a lot of items in the GUI, most of them
described in the RStudio IDE Cheat Sheet. The most important things to keep in mind are:
1. The code is written in scripts in the source panel (upper-right panel in Figure B.1);
2. for running a line or code selection from the script in the console (first tab in the lower-right panel in
Figure B.1), you do it with the keyboard shortcut 'Ctrl+Enter' (Windows and Linux) or 'Cmd+Enter'
(Mac OS X).
Figure B.1: Main window of ‘RStudio‘. The red shows the code panel and the yellow shows the console output.
Extracted from [here](https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf).
87
88 APPENDIX B. INTRODUCTION TO RSTUDIO
Appendix C
Introduction to R
This section provides a collection of self-explainable snippets of the programming language R (R Core Team,
2017) that show the very basics of the language. It is not meant to be an exhaustive introduction to R, but
rather a reminder/panoramic of a collection of basic functions and methods.
In the following, # denotes comments to the code and ## outputs of the code.
Simple computations
# The console can act as a simple calculator

1.0 + 1.1
## [1] 2.1
2 * 2
## [1] 4
3/2
## [1] 1.5
2^3
## [1] 8
1/0
## [1] Inf
0/0
## [1] NaN
# Use ";" for performing several operations in the same line

(1 + 3) * 2 - 1; 3 + 2
## [1] 7
## [1] 5
# Elemental mathematical functions

sqrt(2); 2^0.5
## [1] 1.414214
## [1] 1.414214
exp(1)
## [1] 2.718282
log(10) # Neperian logarithm
## [1] 2.302585
log10(10); log2(10) # Logs in base 10 and 2
## [1] 1
89
90 APPENDIX C. INTRODUCTION TO R
## [1] 3.321928
sin(pi); cos(0); asin(0)
## [1] 1.224647e-16
## [1] 1
## [1] 0
tan(pi/3)
## [1] 1.732051
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
# Remember to close the parenthesis
1 +
(1 + 3
## Error: <text>:4:0: unexpected end of input
## 2: 1 +
## 3: (1 + 3
## ^
Exercise C.1. Compute:
e2 +sin(2)
( ) .
• 1
Answer: 2.723274.
cos−1 2 +2
√
• Answer: 4.22978.
32.5 + log(10). √ √
• (20.93
− log2 (3 + 2 + sin(1)))10tan(1/3)) 32.5 + log(10). Answer: -3.032108.
Variables and assignment
# Any operation that you perform in R can be stored in a variable (or object)
# with the assignment operator "<-"
x <- 1
# To see the value of a variable, simply type it

x
## [1] 1
# A variable can be overwritten

x <- 1 + 1
# Now the value of x is 2 and not 1, as before

x
## [1] 2
# Careful with capitalization

X
## Error in eval(expr, envir, enclos): object 'X' not found
# Different
X <- 3
x; X
## [1] 2
## [1] 3
91
# The variables are stored in your workspace (a .RData file)

# A handy tip to see what variables are in the workspace
ls()
## [1] "x" "X"
# Now you know which variables can be accessed!
# Remove variables
rm(X)
X
## Error in eval(expr, envir, enclos): object 'X' not found
Exercise C.2. Do the following:
• Store −123 in the variable y.
• Store the log of the square of y in z.
y−z
• Store y+z 2 in y and remove z.
• Output the value of y. Answer: 4.366734.
Vectors
# These are vectors - arrays of numbers

# We combine numbers with the function "c"
c(1, 3)
## [1] 1 3
c(1.5, 0, 5, -3.4)
## [1] 1.5 0.0 5.0 -3.4
# A handy way of creating integer sequences is the operator ":"

# Sequence from 1 to 5
1:5
## [1] 1 2 3 4 5
# Storing some vectors

myData <- c(1, 2)
myData2 <- c(-4.12, 0, 1.1, 1, 3, 4)
myData
## [1] 1 2
myData2
## [1] -4.12 0.00 1.10 1.00 3.00 4.00
# Entrywise operations
myData + 1
## [1] 2 3
myData^2
## [1] 1 4
# If you want to access a position of a vector, use [position]

myData[1]
## [1] 1
myData2[6]
## [1] 4
# You also can change elements

myData[1] <- 0
myData
## [1] 0 2
# Think on what you want to access...

myData2[7]
## [1] NA
myData2[0]
## numeric(0)
# If you want to access all the elements except a position, use [-position]
myData2[-1]
## [1] 0.0 1.1 1.0 3.0 4.0
myData2[-2]
## [1] -4.12 1.10 1.00 3.00 4.00
# Also with vectors as indexes

myData2[1:2]
## [1] -4.12 0.00
myData2[myData]
## [1] 0
# And also
myData2[-c(1, 2)]
## [1] 1.1 1.0 3.0 4.0
# But do not mix positive and negative indexes!

myData2[c(-1, 2)]
## Error in myData2[c(-1, 2)]: only 0's may be mixed with negative subscripts
# Remove the first element

myData2 <- myData2[-1]
• Create the vector x = (1, 7, 3, 4).

• Create the vector y = (100, 99, 98, ..., 2, 1).
• Create the vector z = c(4, 8, 16, 32, 96).
• Compute x2 + y4 and cos(x3 ) + sin(x2 )e−y2 . Answers: 104 and -0.9899925.
• Set x2 = 0 and y2 = −1. Recompute the previous expressions. Answers: 97 and 2.785875.
• Index y by x + 1 and store it as z. What is the output? Answer: z is c(-1, 100, 97, 96).
Some functions
# Functions take arguments between parenthesis and transform them into an output
sum(myData)
## [1] 2
prod(myData)
## [1] 0
# Summary of an object
summary(myData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
93
## 0.0 0.5 1.0 1.0 1.5 2.0
# Length of the vector

length(myData)
## [1] 2
# Mean, standard deviation, variance, covariance, correlation

mean(myData)
## [1] 1
var(myData)
## [1] 2
cov(myData, myData^2)
## [1] 4
cor(myData, myData * 2)
## [1] 1
quantile(myData)
## 0% 25% 50% 75% 100%
## 0.0 0.5 1.0 1.5 2.0
# Maximum and minimum of vectors

min(myData)
## [1] 0
which.min(myData)
## [1] 1
# Usually the functions have several arguments, which are set by "argument = value"
# In this case, the second argument is a logical flag to indicate the kind of sorting
sort(myData) # If nothing is specified, decreasing = FALSE is assumed
## [1] 0 2
sort(myData, decreasing = TRUE)
## [1] 2 0
# Do not know what are the arguments of a function? Use args and help!
args(mean)
## function (x, ...)
## NULL
?mean
• Compute the mean, median and variance of y. Answers: 49.5, 49.5, 843.6869.
• Do the same for y + 1. What are the differences?
• What is the maximum of y? Where is it placed?
• Sort y increasingly and obtain the 5th and 76th positions. Answer: c(4,75).
• Compute the covariance between y and y. Compute the variance of y. Why do you get the same
result?
Matrices, data frames, and lists
# A matrix is an array of vectors

A <- matrix(1:4, nrow = 2, ncol = 2)
A
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# Another matrix
B <- matrix(1, nrow = 2, ncol = 2, byrow = TRUE)
B
## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
# Matrix is a vector with dimension attributes

dim(A)
## [1] 2 2
# Binding by rows or columns

rbind(1:3, 4:6)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
cbind(1:3, 4:6)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# Entrywise operations
A + 1
## [,1] [,2]
## [1,] 2 4
## [2,] 3 5
A * B
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# Accessing elements
A[2, 1] # Element (2, 1)
## [1] 2
A[1, ] # First row - this is a vector
## [1] 1 3
A[, 2] # First column - this is a vector
## [1] 3 4
# Obtain rows and columns as matrices (and not as vectors)

A[1, , drop = FALSE]
## [,1] [,2]
## [1,] 1 3
A[, 2, drop = FALSE]
## [,1]
## [1,] 3
## [2,] 4
# Matrix transpose
t(A)
95
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
# Matrix multiplication
A %*% B
## [,1] [,2]
## [1,] 4 4
## [2,] 6 6
A %*% B[, 1]
## [,1]
## [1,] 4
## [2,] 6
A %*% B[1, ]
## [,1]
## [1,] 4
## [2,] 6
# Care is needed
A %*% B[1, , drop = FALSE] # Incompatible product
## Error in A %*% B[1, , drop = FALSE]: non-conformable arguments
# Compute inverses with "solve"

solve(A) %*% A
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
# A data frame is a matrix with column names

# Useful when you have multiple variables
myDf <- data.frame(var1 = 1:2, var2 = 3:4)
myDf
## var1 var2
## 1 1 3
## 2 2 4
# You can change names

names(myDf) <- c("newname1", "newname2")
myDf
## newname1 newname2
## 1 1 3
## 2 2 4
# The nice thing is that you can access variables by its name with
# the "$" operator
myDf$newname1
## [1] 1 2
# And create new variables also (it has to be of the same

# length as the rest of variables)
myDf$myNewVariable <- c(0, 1)
myDf
## newname1 newname2 myNewVariable
## 1 1 3 0
## 2 2 4 1
# A list is a collection of arbitrary variables

myList <- list(myData = myData, A = A, myDf = myDf)
# Access elements by names

myList$myData
## [1] 0 2
myList$A
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
myList$myDf
## newname1 newname2 myNewVariable
## 1 1 3 0
## 2 2 4 1
# Reveal the structure of an object

str(myList)
## List of 3
## $ myData: num [1:2] 0 2
## $ A : int [1:2, 1:2] 1 2 3 4
## $ myDf :'data.frame': 2 obs. of 3 variables:
## ..$ newname1 : int [1:2] 1 2
## ..$ newname2 : int [1:2] 3 4
## ..$ myNewVariable: num [1:2] 0 1
str(myDf)
## 'data.frame': 2 obs. of 3 variables:
## $ newname1 : int 1 2
## $ newname2 : int 3 4
## $ myNewVariable: num 0 1
# A less lengthy output

names(myList)
## [1] "myData" "A" "myDf"
• Create a matrix called M with rows given by y[3:5], y[3:5]^2 and log(y[3:5]).
• Create a data frame called myDataFrame with column names “y”, “y2” and “logy” containing the
vectors y[3:5], y[3:5]^2 and log(y[3:5]), respectively.
• Create a list, called l, with entries for x and M. Access the elements by their names.
• Compute the squares of myDataFrame and save the result as myDataFrame2.
• Compute the log of the sum of myDataFrame and myDataFrame2. Answer:
## y y2 logy
## 1 9.180087 18.33997 3.242862
## 2 9.159678 18.29895 3.238784
## 3 9.139059 18.25750 3.234656
97
More on data frames
# Let's begin importing the iris dataset

data(iris)
# "names" gives you the variables in the data frame

names(iris)
## [1] "species" "sepal.len" "sepal.wid" "petal.len" "petal.wid"
# The beginning of the data

head(iris)
## species sepal.len sepal.wid petal.len petal.wid
## 1 versicolor 7.0 3.2 4.7 1.4
## 2 versicolor 6.4 3.2 4.5 1.5
## 3 versicolor 6.9 3.1 4.9 1.5
## 4 versicolor 5.5 2.3 4.0 1.3
## 5 versicolor 6.5 2.8 4.6 1.5
## 6 versicolor 5.7 2.8 4.5 1.3
# So we can access variables by "$" or as in a matrix

iris$Sepal.Length[1:10]
## NULL
iris[1:10, 1]
## [1] versicolor versicolor versicolor versicolor versicolor versicolor
## [7] versicolor versicolor versicolor versicolor
## Levels: versicolor virginica
iris[3, 1]
## [1] versicolor
## Levels: versicolor virginica
# Information on the dimension of the data frame

dim(iris)
## [1] 100 5
# "str" gives the structure of any object in R

str(iris)
## 'data.frame': 100 obs. of 5 variables:
## $ species : Factor w/ 2 levels "versicolor","virginica": 1 1 1 1 1 1 1 1 1 1 ...
## $ sepal.len: num 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
## $ sepal.wid: num 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
## $ petal.len: num 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
## $ petal.wid: num 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
# Recall the species variable: it is a categorical variable (or factor),

# not a numeric variable
iris$Species[1:10]
## NULL
# Factors can only take certain values

levels(iris$Species)
## NULL
# If a file contains a variable with character strings as observations (either

# encapsulated by quotation marks or not), the variable will become a factor

# when imported into R
• Load the faithful dataset into R.
• Get the dimensions of faithful and show beginning of the data.
• Retrieve the fifth observation of eruptions in two different ways.
• Obtain a summary of waiting.
Vector-related functions
# The function "seq" creates sequences of numbers equally separated

seq(0, 1, by = 0.1)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length.out = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
# You can short the latter argument

seq(0, 1, l = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
# Repeat number
rep(0, 5)
## [1] 0 0 0 0 0
# Reverse a vector
myVec <- c(1:5, -1:3)
rev(myVec)
## [1] 3 2 1 0 -1 5 4 3 2 1
# Another way
myVec[length(myVec):1]
## [1] 3 2 1 0 -1 5 4 3 2 1
# Count repetitions in your data

table(iris$Sepal.Length)
## < table of extent 0 >
table(iris$Species)
## < table of extent 0 >
• Create the vector x = (0.3, 0.6, 0.9, 1.2).
• Create a vector of length 100 ranging from 0 to 1 with entries equally separated.
• Compute the amount of zeros and ones in x <- c(0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0). Check that
they are the same as in rev(x).
• Compute the vector (0.1, 1.1, 2.1, ..., 100.1) in four different ways using seq and rev. Do the same but
using : instead of seq. (Hint: add 0.1)
Logical conditions and subsetting
# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y

# They return TRUE or FALSE
99
# Smaller than
0 < 1
## [1] TRUE
# Greater than
1 > 1
## [1] FALSE
# Greater or equal to
1 >= 1 # Remember: ">="" and not "=>"" !
## [1] TRUE
# Smaller or equal to
2 <= 1 # Remember: "<="" and not "=<"" !
## [1] FALSE
# Equal
1 == 1 # Tests equality. Remember: "=="" and not "="" !
## [1] TRUE
# Unequal
1 != 0 # Tests iequality
## [1] TRUE
# TRUE is encoded as 1 and FALSE as 0

TRUE + 1
## [1] 2
FALSE + 1
## [1] 1
# In a vector-like fashion
x <- 1:5
y <- c(0, 3, 1, 5, 2)
x < y
## [1] FALSE TRUE FALSE TRUE FALSE
x == y
## [1] FALSE FALSE FALSE FALSE FALSE
x != y
## [1] TRUE TRUE TRUE TRUE TRUE
# Subsetting of vectors
x
## [1] 1 2 3 4 5
x[x >= 2]
## [1] 2 3 4 5
x[x < 3]
## [1] 1 2
# Easy way of work with parts of the data

data <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)
data
## x y
## 1 0 1
## 2 1 2
## 3 3 3
## 4 3 4
## 5 0 5
# Data such that x is zero

data0 <- data[data$x == 0, ]
data0
## x y
## 1 0 1
## 5 0 5
# Data such that x is larger than 2

data2 <- data[data$x > 2, ]
data2
## x y
## 3 3 3
## 4 3 4
# In an example
iris$Sepal.Width[iris$Sepal.Width > 3]
## NULL
# Problem - what happened?

data[x > 2, ]
## x y
## 3 3 3
## 4 3 4
## 5 0 5
# In an example
summary(iris)
## species sepal.len sepal.wid petal.len
## versicolor:50 Min. :4.900 Min. :2.000 Min. :3.000
## virginica :50 1st Qu.:5.800 1st Qu.:2.700 1st Qu.:4.375
## Median :6.300 Median :2.900 Median :4.900
## Mean :6.262 Mean :2.872 Mean :4.906
## 3rd Qu.:6.700 3rd Qu.:3.025 3rd Qu.:5.525
## Max. :7.900 Max. :3.800 Max. :6.900
## petal.wid
## Min. :1.000
## 1st Qu.:1.300
## Median :1.600
## Mean :1.676
## 3rd Qu.:2.000
## Max. :2.500
summary(iris[iris$Sepal.Width > 3, ])
## versicolor:0 Min. : NA Min. : NA Min. : NA Min. : NA
## virginica :0 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
101
## Max. : NA Max. : NA Max. : NA Max. : NA
# On the factor variable only makes sense == and !=

summary(iris[iris$Species == "setosa", ])
# Subset argument in lm
lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)
## Error in eval(predvars, data, env): object 'Sepal.Width' not found
lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
## Error in eval(predvars, data, env): object 'Sepal.Width' not found
# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset
# AND operator "&"

TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
# OR operator "|"
TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE
# Both operators are useful for checking for ranges of data

y
## [1] 0 3 1 5 2
index1 <- (y <= 3) & (y > 0)
y[index1]
## [1] 3 1 2
index2 <- (y < 2) | (y > 4)
y[index2]
## [1] 0 1 5
# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])

Exercise C.8. Do the following for the iris dataset:
• Compute the subset corresponding to Petal.Length either smaller than 1.5 or larger than 2. Save
this dataset as irisPetal.
• Compute and summarize a linear regression of Sepal.Width into Petal.Width + Petal.Length for
the dataset irisPetal. What is the R2 ? Solution: 0.101.
• Check that the previous model is the same as regressing Sepal.Width into Petal.Width +
Petal.Length for the dataset iris with the appropriate subset expression.
• Compute the variance for Petal.Width when Petal.Width is smaller or equal that 1.5 and larger
than 0.3. Solution: 0.1266541.
Plotting functions
# "plot" is the main function for plotting in R

# It has a different behaviour depending on the kind of object that it receives
# How to plot some data

plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Error in plot.window(...): need finite 'xlim' values
103
# Alrernatively
plot(iris[, 1:2], main = "Sepal.Length vs Sepal.Width")
# Change the axis limits

plot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))
# How to plot a curve (a parabola)

x <- seq(-1, 1, l = 50)
y <- x^2
plot(x, y)
plot(x, y, main = "A dotted parabola")
plot(x, y, main = "A parabola", type = "l")
plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)
# Plotting a more complicated curve between -pi and pi

x <- seq(-pi, pi, l = 50)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l") # Kind of rough...
# Remember that we are joining points for creating a curve!

# More detailed plot
x <- seq(-pi, pi, l = 500)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l")
# For more options in the plot customization see

?plot
?par
# "plot" is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as "points", "lines" or "abline"
plot(x, y) # Create a plot

lines(x, x^2, col = "red") # Add lines
points(x, y + 10, col = "blue") # Add points
abline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x
• Plot the faithful dataset.
• Add the straight line y = 110 − 15x (red).
• Make a new plot for the function y = sin(x) (black). Add y = sin(2x) (red), y = sin(3x) (blue), and
y = sin(4x) (orange).
Distributions
# R allows to sample [r], compute density/probability mass functions [d],

# compute distribution function [p], and compute quantiles [q] for several
# continuous and discrete distributions. The format employed is [rdpq]name,
# where name stands for:
# - norm -> Normal
# - unif -> Uniform
# - exp -> Exponential

# - t -> Student's t
# - f -> Snedecor's F
# - chisq -> Chi squared
# - pois -> Poisson
# - binom -> Binomial
# More distributions:
?Distributions
# Sampling from a Normal - 100 random points from a N(0, 1)

rnorm(n = 10, mean = 0, sd = 1)
## [1] -1.12055305 -0.06232391 0.71414179 0.77304492 1.98370559
## [6] 0.88127216 -1.15256313 0.55526743 0.26835801 1.26793195
# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)
## [1] 1.4404800 -0.7195761 0.6709784 -0.4219485 0.3782196 -1.6665864
## [7] -0.5082030 0.4433822 -1.7993868 -0.6179521
# Plotting the density of a N(0, 1) - the Gauss bell

x <- seq(-4, 4, l = 100)
y <- dnorm(x = x, mean = 0, sd = 1)
0.4
0.3
0.2
y
0.1
0.0
−4 −2 0 2 4
x
# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
105
1.0
0.8
0.6
y
0.4
0.2
0.0
−4 −2 0 2 4
x
# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)
## [1] 1.644854
# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)

# and qname(p,...), but the parameters in ... change. Look them in ?Distributions
# For example, here is que same for the uniform distribution
# Sampling from a U(0, 1)

set.seed(45678)
runif(n = 10, min = 0, max = 1)
## [1] 0.9251342 0.3339988 0.2358930 0.3366312 0.7488829 0.9327177 0.3365313
## [8] 0.2245505 0.6473663 0.0807549
# Plotting the density of a U(0, 1)

x <- seq(-2, 2, l = 100)
y <- dunif(x = x, min = 0, max = 1)
1.0
0.8
0.6
y
0.4
0.2
0.0
−2 −1 0 1 2
x
# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)
## [1] 0.95
# Sampling from a Bi(10, 0.5)

set.seed(45678)
samp <- rbinom(n = 200, size = 10, prob = 0.5)
table(samp) / 200
## samp
## 1 2 3 4 5 6 7 8 9
## 0.010 0.060 0.115 0.220 0.210 0.215 0.115 0.045 0.010
# Plotting the probability mass of a Bi(10, 0.5)

x <- 0:10
y <- dbinom(x = x, size = 10, prob = 0.5)
plot(x, y, type = "h") # Vertical bars
107
0.25
0.20
0.15
y
0.10
0.05
0.00
0 2 4 6 8 10
x
# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")
1.0
0.8
0.6
y
0.4
0.2
0.0
0 2 4 6 8 10
x
• Compute the 90%, 95% and 99% quantiles of a F distribution with df1 = 1 and df2 = 5. Answer:
c(4.060420, 6.607891, 16.258177).
• Plot the distribution function of a U (0, 1). Does it make sense with its density function?
• Sample 100 points from a Poisson with lambda = 5.
• Sample 100 points from a U (−1, 1) and compute its mean.

• Plot the density of a t distribution with df = 1 (use a sequence spanning from -4 to 4). Add lines
of different colors with the densities for df = 5, df = 10, df = 50 and df = 100. Do you see any
pattern?
Functions
# A function is a way of encapsulating a block of code so it can be reused easily

# They are useful for simplifying repetitive tasks and organize the analysis
# This is a silly function that takes x and y and returns its sum
# Note the use of "return" to indicate what should be returned
add <- function(x, y) {
z <- x + y
return(z)
}
# Calling add - you need to run the definition of the function first!
add(x = 1, y = 2)
## [1] 3
add(1, 1) # Arguments names can be omitted
## [1] 2
# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
model <- lm(formula = formula, data = data)
summary(model)
}
# If no return(), the function returns the value of the last expression
# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)
## Error in eval(predvars, data, env): object 'Sepal.Length' not found
# Recall: there is no variable called model in the workspace.

# The function works on its own workspace!
model
## Error in eval(expr, envir, enclos): object 'model' not found
# Add a line to a plot

addLine <- function(x, beta0, beta1) {
lines(x, beta0 + beta1 * x, lwd = 2, col = 2)
}
# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)
109
1.0
0.8
0.6
y
0.4
0.2
0.0
0 2 4 6 8 10
x
# The function "sapply" allows to sequentially apply a function
sapply(1:10, sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
sqrt(1:10) # The same
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
# The advantage of "sapply" is that you can use with any function
myFun <- function(x) c(x, x^2)
sapply(1:10, myFun) # Returns a 2 x 10 matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 1 4 9 16 25 36 49 64 81 100
# "sapply" is usefuf for plotting non-vectorized functions

sumSeries <- function(n) sum(1:n)
plot(1:10, sapply(1:10, sumSeries), type = "l")
50
sapply(1:10, sumSeries)
40
30
20
10
0
2 4 6 8 10
1:10
# "apply" applies iteratively a function to rows (1) or columns (2)
# of a matrix or data frame
A <- matrix(1:10, nrow = 5, ncol = 2)
A
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
apply(A, 1, sum) # Applies the function by rows
## [1] 7 9 11 13 15
apply(A, 2, sum) # By columns
## [1] 15 40
# With other functions

apply(A, 1, sqrt)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000 1.414214 1.732051 2 2.236068
## [2,] 2.44949 2.645751 2.828427 3 3.162278
apply(A, 2, function(x) x^2)
## [,1] [,2]
## [1,] 1 36
## [2,] 4 49
## [3,] 9 64
## [4,] 16 81
## [5,] 25 100
∑n
• Create a function that takes as argument n and returns the value of i=1 i2 .
∑n √
• Create a function that takes as input the argument N and then plots the curve (n, i=1 i) for
n = 1, . . . , N . Hint: use sapply.
111
Control structures
# The "for" statement allows to create loops that run along a given vector
# Prints 5 times a message (i varies in 1:5)
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
# Another example
a <- 0
for (i in 1:3) {
a <- i + a
}
a
## [1] 6
# Nested loops are possible

A <- matrix(0, nrow = 2, ncol = 3)
for (i in 1:2) {
for (j in 1:3) {
A[i, j] <- i + j
}
}
# The "if" statement allows to create conditional structures of the forms:

# if (condition) {
# # Something
# } else {
# # Something else
# }
# These structures are thought to be inside functions
# Is the number positive?

isPositive <- function(x) {
if (x > 0) {
print("Positive")
} else {
print("Not positive")
}
}
isPositive(1)
## [1] "Positive"
isPositive(-1)
## [1] "Not positive"
# A loop can be interrupted with the "break" statement

# Stop when x is above 100
x <- 1
for (i in 1:1000) {
x <- (x + 0.01) * x
print(x)
if (x > 100) {
break
}
}
## [1] 1.01
## [1] 1.0302
## [1] 1.071614
## [1] 1.159073
## [1] 1.35504
## [1] 1.849685
## [1] 3.439832
## [1] 11.86684
## [1] 140.9406
∑m
• Compute Cn×k in Cn×k = An×m Bm×k from A and B. Use that ci,j = l=1 ai,l bl,j . Test the
implementation with simple examples.
• Create a function that samples a N (0, 1) and returns the first sampled point that is larger than 4.
• Create a function that simulates N samples from the distribution of max(X1 , . . . , Xn ) where X1 , . . . , Xn
are iid U(0, 1).
Bibliography
Allaire, J. J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman,
R., and Arslan, R. (2017). rmarkdown: Dynamic Documents for R. R package version 1.6.
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics. Springer,
New York.
Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation. Wiley Series in Probability and
Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York. The
L1 view.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications, volume 66 of Monographs
on Statistics and Applied Probability. Chapman & Hall, London.
Loader, C. (1999). Local regression and likelihood. Statistics and Computing. Springer-Verlag, New York.
Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann. Statist., 20(2):712–736.
Nadaraya, E. (1964). On estimating regression. Teor. Veroyatnost. i Primenen, 9(1):157–159.
Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist., 33(3):1065–
1076.
R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist.,
27(3):832–837.
Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Statist.,
22(3):1346–1370.
Scott, D. W. (2015). Multivariate Density Estimation. Wiley Series in Probability and Statistics. John Wiley
& Sons, Inc., Hoboken, second edition. Theory, practice, and visualization.
Scott, D. W. and Terrell, G. R. (1987). Biased and unbiased cross-validation in density estimation. J. Am.
Stat. Assoc., 82(400):1131–1146.
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density
estimation. J. Roy. Statist. Soc. Ser. B, 53(3):683–690.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Monographs on Statistics
and Applied Probability. Chapman & Hall, London.
van der Vaart, A. W. (1998). Asymptotic Statistics, volume 3 of Cambridge Series in Statistical and Proba-
bilistic Mathematics. Cambridge University Press, Cambridge.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing, volume 60 of Monographs on Statistics and Applied
Probability. Chapman & Hall, Ltd., London.
113
114 BIBLIOGRAPHY
Wasserman, L. (2004). All of Statistics. Springer Texts in Statistics. Springer-Verlag, New York. A concise
course in statistical inference.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372.
Xie, Y. (2015). Dynamic Documents with R and knitr. The R Series. Chapman & Hall/CRC, Boca Raton.
Xie, Y. (2016). bookdown: Authoring Books and Technical Documents with R Markdown. The R Series.
Chapman & Hall/CRC, Boca Raton.

Nonparametric Curve Estimation Course

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Nonparametric Curve Estimation Course

Transféré par

Droits d'auteur :

Formats disponibles

A short course on nonparametric curve estimation

MSc in Applied Mathematics at EAFIT University (Colombia)

A Installation of R and RStudio 85

1.1 Course objectives and logistics

1.2 Background and notation

1.2.1 Basic probability review

FX1 |X−1 =x−1 (x1 ) := P[X1 ≤ x1 |X−1 = x−1 ],

The conditional expectation of Y |X is the following random variable3

The conditional variance of Y |X is defined as

1.2.2 Some facts about distributions

1.2.3 Basic stochastic convergence review

MSE[θ̂] := E[(θ̂n − θ)2 ]

Finally, the next results will be of usefulness (′ denotes transposition).

1.2.4 OP and oP notation

If an = o(bn ), then we say that an is little-o of bn . To indicate that an → 0, we write an = o(1).

∀ε, δ > 0, ∃ n0 = n0 (ε, δ) : ∀n ≥ n0 , P [|Xn | > an ε] < δ.

Xn = OP (an ) : ⇐⇒ ∀ε > 0, ∃ C = Cε > 0, n0 = n0 (ε) :

If Xn = OP (an ), then we say that Xn is big-OP of an .

1.2.5 Basic analytical tools

We will make use of the following well-known analytical results.

Remark. The remaider Rn depends on x ∈ R. Explicit control of Rn is possible if f is further assumed to be

1.3 Nonparametric inference

F (x; λ̂ML ) = 1 − e−λ̂ML x . (1.2)

1.4 Main references and credits

The above results show interesting insights:

Implementation of histograms is very simple in R. As an example, we consider the old-but-gold dataset

# Default histogram: automatically chosen bins and absolute frequencies!

# With relative frequencies

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

The shape of the histogram depends on:

• t0 , since the separation between bins happens at t0 k, k ∈ Z;

t0 = 0, h = 0.2 t0 = −0.1, h = 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8

# Sample 100 points from a N(0, 1) and 50 from a N(3, 0.25)

# min and max for choosing Bk1 and Bk2

2.1.2 Moving histogram

The function has 2n discontinuities that are located at Xi ± h.

These two results provide interesting insights on the effect of h:

2.2 Kernel density estimation

# Quickly compute a kde and plot the density object

# Select a particular bandwidth (0.5) and kernel (Epanechnikov)

N = 100 Bandwidth = 0.3145

# density automatically chooses the interval for plotting the kde

density.default(x = samp, from = −4, to = 4)

N = 100 Bandwidth = 0.3145

# The density object is a list

2.3 Asymptotic properties

Along this section we will make the following assumptions:

= (Kh ∗ f )(x). (2.6)

Similarly, the variance is obtained as

Proof. For the bias we consider the change of variables z = x−y

Since h → 0, an application of a second-order Taylor expansion gives

which provides (2.8).

Plugging-in (2.13) into (2.14) gives

since n−1 = o((nh)−1 ).

µ22 (K) ′′ R(K)

An application of the Cp inequality (first) and Jensen’s inequality (second), gives:

and (2.16) is proved.

2.4 Bandwidth selection

Therefore, MISE[fˆ(·; h)] → 0 when n → ∞.

The optimal AMISE is:

2.4.1 Plug-in rules