Vous êtes sur la page 1sur 14

DS-GA 1002 Lecture notes 9 Fall 2016

Learning Models from Data


In this section we consider the problem of estimating distributions from data. This is a
crucial problem in statistics and in machine learning, as it allows to calibrate probabilistic
models that can then be used for prediction and other tasks. Parametric methods assume
that the data are generated according to a known distribution which only depends on certain
unknown parameters. Nonparametric methods estimate the distribution directly from the
data, without making any parametric assumptions.

1 Parametric estimation
The main assumption in parametric estimation is that we know what type of distribution
generated the data. The choice of distribution can be motivated by theoretical considerations
(for example, the central limit theorem which implies that many quantities are well modeled
as Gaussian), or by observing the data. The histograms in Figure 1, for instance, suggest
that the data might be well modeled as samples from an exponential (left) and Gaussian
(right) distribution. Once the type of distribution is chosen, the corresponding parameters
must be fit to the data in order to calibrate the estimate. Taking a frequentist viewpoint,
we model the parameters of the chosen distribution as deterministic quantities.

1.1 The method of moments

The method of moments is a simple way of fitting a distribution: the idea is to choose the
parameters so that the moments of the distribution coincide with the empirical moments of
the data. If the distribution only depends on one parameter, then we use the empirical mean
as a surrogate for the true mean and compute the corresponding value of the parameter. For
an exponential with parameter and mean we have
1
= . (1)

Assuming that we have access to n iid samples x1 , . . . , xn from the exponential distribution,
the method-of-moments estimate of equals
1
MM := . (2)
av (x1 , . . . , xn )
Figure 1 shows the result of fitting an exponential to the call-center data in this way. Simi-
larly, to fit a Gaussian using the method of moments we set the mean equal to its empirical
mean and the variance equal to the empirical variance. An example is shown in Figure 1.
0.9 0.25
Exponential distribution Gaussian distribution
0.8 Real data Real data
0.7 0.20
0.6
0.15
0.5
0.4
0.10
0.3
0.2 0.05
0.1
0.0
0 1 2 3 4 5 6 7 8 9 60 62 64 66 68 70 72 74 76
Interarrival times (s) Height (inches)

Figure 1: Exponential distribution fitted to data consisting of inter-arrival times of calls at a call
center in Israel (left). Gaussian distribution fitted to height data (right).

1.2 Maximum likelihood

The most popular method for learning parametric models is maximum-likelihood fitting.
The likelihood function is the joint pmf or pdf of the data, interpreted as a function of the
unknown parameters. In more detail, let us denote the data by x1 , . . . , xn and assume that
they are realizations of a set of discrete random variables X1 , . . . , Xn which have a joint pmf
that depends on a vector of parameters . ~ To emphasize that the joint pmf depends on ~ we
denote it by p~ := pX1 ,...,Xn . This pmf evaluated at the observed data

p~ (x1 , . . . , xn ) (3)

~ For continuous random


is the likelihood function, when we interpret it as a function of .
variables, we use the joint pdf of the data instead.

Definition 1.1 (Likelihood function). Given a realization x1 , . . . , xn of a set of discrete


random variables X1 , . . . , Xn with joint pmf p~ , where ~ Rm is a vector of parameters, the
likelihood function is
 
Lx1 ,...,xn ~ := p~ (x1 , . . . , xn ) . (4)

If the random variables are continuous with pdf f~ , where ~ Rm , the likelihood function is
 
Lx1 ,...,xn ~ := f~ (x1 , . . . , xn ) . (5)

 
The log-likelihood function is equal to the logarithm of the likelihood function log Lx1 ,...,xn ~ .

2
When the data are modeled as iid samples, the likelihood factors into a product of the
marginal pmf or pdf, so the log likelihood can be decomposed into a sum.
In the case of discrete distributions, for a fixed ~ the likelihood is the probability that
~ it makes sense to choose a value
X1 , . . . , Xn equal the observed data. If we dont know ,
~
for such that this probability is as high as possible, i.e. to maximize the likelihood. For
continuous distributions we apply the same principle to the joint pdf of the data.

Definition 1.2 (Maximum-likelihood estimator). The maximum likelihood (ML) esti-


mator for the vector of parameters ~ Rm is
 
~ML (x1 , . . . , xn ) := arg max Lx ,...,xn ~ (6)
1
~
 
= arg max log Lx1 ,...,xn ~ . (7)
~

The maximum of the likelihood function and that of the log-likelihood function are at the
same location because the logarithm is monotone.

Under certain conditions, one can show that the maximum-likelihood estimator is consistent:
it converges in probability to the true parameter as the number of data increases. One can
also show that its distribution converges to that of a Gaussian random variable (or vector),
just like the distribution of the empirical mean. These results are beyond the scope of the
course. Bear in mind, however, that they only hold if the data are indeed generated by the
type of distribution that we are considering.
We now show how to derive the maximum-likelihood for a Bernouilli and a Gaussian distri-
bution. The resulting estimators for the parameters are the same as the method-of-moments
estimators (except for the variance estimator in the case of the Gaussian).

Example 1.3 (ML estimator of the parameter of a Bernoulli distribution). We model a set
of data x1 , . . . , xn as iid samples from a Bernoulli distribution with parameter (in this case
there is only one parameter). The likelihood function is equal to

Lx1 ,...,xn () = p (x1 , . . . , xn ) (8)


Y
= (1xi =1 + 1xi =0 (1 )) (9)
i=1
= n1
(1 )n0 (10)

and the log-likelihood function to

log Lx1 ,...,xn () = n1 log + n0 log (1 ) , (11)

3
where n1 are the number of samples equal to one and n0 the number of samples equal to
zero. The ML estimator of the parameter is

ML = arg max log Lx1 ,...,xn () (12)



= arg max n1 log + n0 log (1 ) . (13)

We compute the derivative and second derivative of the log-likelihood function,


d log Lx1 ,...,xn () n1 n0
= , (14)
d 1
d2 log Lx1 ,...,xn () n1 n0
= < 0. (15)
d2 2 (1 )2
The function is concave, as the second derivative is negative. The maximum is consequently
at the point where the first derivative equals zero, namely
n1
ML = , (16)
n0 + n1
the fraction of samples that are equal to one, which is a very reasonable estimate.

Example 1.4 (ML estimator of the parameters of a Gaussian distribution). Let x1 , x2 , . . .


be data that we wish to model as iid samples from a Gaussian distribution with mean and
standard deviation . The likelihood function is equal to

Lx1 ,...,xn (, ) = f, (x1 , . . . , xn ) (17)


n
1 (x )2
i 2
Y
= e 2 (18)
i=1
2

and the log-likelihood function to


n
X (xi )2
n log (2)
log Lx1 ,...,xn (, ) = n log . (19)
2 i=1
2 2

The ML estimator of the parameters and is

{ML , ML } = arg max log Lx1 ,...,xn (, ) (20)


{,}
n
X (xi )2
= arg max n log . (21)
{,}
i=1
2 2

4
Data Log-likelihood function

Estimated distribution
6.0 96
True distribution Estimated parameters 99
0.15 Data 5.5 True parameters
102

5.0 105

0.10 4.5 108


111

0.05 4.0 114


117
3.5 120
0.00 10 5 0 5 10 15 3.02.0 2.5 3.0 3.5 4.0 4.5 5.0 123

Estimated distribution
6.0 101.6
True distribution Estimated parameters 104.0
0.15 Data 5.5 True parameters
106.4
5.0 108.8
0.10 4.5 111.2
113.6

0.05 4.0 116.0

3.5 118.4

0.00 10 5 0 5 10 15 3.02.0 2.5 3.0 3.5 4.0 4.5 5.0


120.8

Estimated distribution
6.0 93.9
True distribution Estimated parameters 95.4
0.15 Data 5.5 True parameters 96.9

5.0 98.4

0.10 4.5
99.9
101.4
102.9
0.05 4.0
104.4
3.5 105.9
107.4
0.00 10 5 0 5 10 15 3.02.0 2.5 3.0 3.5 4.0 4.5 5.0

Figure 2: The left column shows histograms of 50 iid samples from a Gaussian distribution,
together with the pdf of the original distribution, as well as the maximum-likelihood estimate. The
right column shows the log-likelihood function corresponding to the data and the location of its
maximum and of the point corresponding to the true parameters.

5
We compute the partial derivatives of the log-likelihood function,
n
log Lx1 ,...,xn (, ) X xi
= 2
, (22)
i=1

n
log Lx1 ,...,xn (, ) n X (xi )2
= + . (23)
i=1 3

The function we are trying to maximize is strictly concave in {, }. To prove this, we would
have to show that the Hessian of the function is positive definite. We omit the calculations
that show that this is the case. Setting the partial derivatives to zero we obtain
n
1X
ML = xi , (24)
n i=1
n
1X
2
ML = (xi ML )@ . (25)
n i=1

The estimator for the mean is just the empirical mean. The estimator for the variance is a
rescaled empirical variance.

Figure 2 displays the log-likelihood function corresponding to 50 iid samples from a Gaussian
distribution with := 3 and := 4. It also shows the approximation to the true pdf
obtained by maximum likelihood. In Examples 1.3 and 1.4 the log-likelihood function is
strictly concave. This means that the function has a unique maximum that can be located
by methods such as gradient ascent. However, this is not always necessarily the case. As
illustrated by the following example, the log-likelihood function can have multiple local
maxima. In such situations, it may be intractable to compute the maximum-likelihood
estimator.

Example 1.5 (Log-likelihood function of a Gaussian mixture). Let X be a Gaussian mixture


defined as
(
G1 with probability 51 ,
X := (26)
G2 with probability 45 ,

where G1 is a Gaussian random variable with mean and variance 2 , whereas G2 is also
Gaussian with mean and variance 2 . We have parameterized the mixture with just two
parameters so that we can visualize the log-likelihood in two dimensions. Let x1 , x2 , . . . be

6
Data Log-likelihood function

0.35 3.0 75
Estimate (maximum)
Estimate (local max.)
Global maximum 200
0.30 Local maximum
True distribution
Data 2.5 True parameters 325
0.25 450

0.20 2.0 575


700
0.15 1.5 825
0.10 950
1.0 1075
0.05
1200
0.0010 5 0 5 10 15 20 0.5 6 4 2 0 2 4 6

Figure 3: The left image shows a histogram of 40 iid samples from the Gaussian mixture defined
in Example 1.5, together with the pdf of the original distribution. The right image shows the log-
likelihood function corresponding to the data, which has a local maximum apart from the global
maximum. The density estimates corresponding to the two maxima are shown on the left.

data modeled as iid samples from X. The likelihood function is equal to


Lx1 ,...,xn (, ) = f, (x1 , . . . , xn ) (27)
n
Y 1 (xi +)2 4 (xi )2
= e 22 + e 22 (28)
i=1
5 2 5 2
and the log-likelihood function to
n
(x +)2 (x )2
 
X 1 i 2 4 i 2
log Lx1 ,...,xn (, ) = log e 2 + e 2 . (29)
i=1
5 2 5 2
Figure 3 shows the log-likelihood function for 40 iid samples of the distribution when := 4
and := 1. The function has a local maximum away from the global maximum. This means
that if we use a local ascent method to find the ML estimator, we might not find the global
maximum, but remain stuck at the local maximum instead. The estimate corresponding to
the local maximum (shown on the left) has the same variance as the global maximum but
is close to 4 instead of 4. Although the estimate doesnt fit the data very well, it is locally
optimal, small shifts of and yield worse fits (in terms of the likelihood).

To finish this section, we describe a machine-learning algorithm for supervised learning based
on parametric fitting using ML estimation.

7
Example 1.6 (Quadratic discriminant analysis). Quadratic discriminant analysis is an al-
gorithm for supervised learning. The input to the algorithm are two sets of training data,
consisting of d-dimensional vectors ~a1 , . . . , ~an and ~b1 , . . . , ~bn which belong to two different
classes (the method can easily be extended to deal with more classes). The goal is to classify
new instances based on the structure of the data.
To perform quadratic discriminant analysis we first fit a d-dimensional Gaussian distribution
to the data of each class using the ML estimator for the mean and covariance matrix, which
correspond to the empirical mean and covariance matrix of the training data (up to a slight
rescaling of the empirical covariance). In more detail, ~a1 , . . . , ~an are used to estimate a mean
~ a and covariance matrix a , whereas ~b1 , . . . , ~bn are used to estimate
~ b and b ,
{~a , a } := arg max L~a1 ,...,~an (~, ) , (30)

~ ,

{~b , b } := arg max L~b1 ,...,~bn (~, ) . (31)



~ ,

Then for each new example ~x, the value of the density function at the example for both
classes is evaluated. If
f~ a ,a (~x) > f~ b ,b (~x) (32)
then ~x is declared to belong to the first class, otherwise it is declared to belong to the second
class. Figure 4 illustrates the method with an example.

2 Nonparametric estimation
In situations where a parametric model is not available or does not fit the data adequately,
we resort to nonparametric methods for learning probabilistic models. Learning a model
without relying on a given parametrization is challenging: we need to estimate the whole
distribution from the available samples. Without further assumptions this problem is ill
posed; many (infinite!) different distributions could have generated the data. However, as
we show below, with enough samples it is possible to obtain models that characterize the
underlying distribution quite accurately.

2.1 Empirical cdf

In this section, we consider the problem of estimating the cdf of an unknown distribution
from iid samples. A reasonable estimate for the cdf at a given point x is the fraction of

8
Figure 4: Quadratic-discriminant analysis applied to data from two different classes (left). The
data corresponding to the two different classes are colored red and blue. Three new examples are
colored in black. Bivariate Gaussians are fit to the data and used to classify the new examples
(right).

samples that are smaller than x. This produces a piecewise constant estimator known as the
empirical cdf.

Definition 2.1 (Empirical cdf). The empirical cdf corresponding to data x1 , . . . , xn is


n
1X
Fbn (x) := 1x x , (33)
n i=1 i

where x R.

The empirical cdf is an unbiased and consistent estimator of the true cdf. This is established
rigorously in Theorem 2.2 below, but is also illustrated empirically in Figure 5. The cdf
of the height data from 25000 is compared to three realizations of the empirical cdf com-
puted from different numbers of iid samples. As the number of available samples grows, the
approximation becomes very accurate.

Theorem 2.2. Let X e be an iid sequence with marginal cdf FX . For any fixed x R Fbn (x)
is an unbiased and consistent estimator of FX (x). In fact, Fbn (x) converges in mean square
to FX (x).

9
Proof. First, we verify
n
!
  1 X
E Fn (x) = E
b 1e (34)
n i=1 X(i)x
n
1X e 
= P X (i) x by linearity of expectation (35)
n i=1
= FX (x) , (36)

so the estimator is unbiased. We now estimate its mean square


n X n
!
  1 X
E Fbn2 (x) = E 1e 1e (37)
n2 i=1 j=1 X(i)x X(j)x
n n n
1 X e  1 X X  
= 2 P X (i) x + 2 P Xe (i) x, X
e (j) x (38)
n i=1 n i=1 j=1,i6=j
n n
FX (x) 1 X X
= + 2 F e (x) FX(j) (x) by independence (39)
n n i=1 j=1,i6=j X(i)
e

FX (x) n 1 2
= + FX (x) . (40)
n n
The variance is consequently equal to
     
Var Fbn (x) = E Fbn (x)2 E2 Fbn (x) (41)
FX (x) (1 FX (x))
= . (42)
n
We conclude that
 2   
lim E FX (x) Fbn (x) = lim Var Fbn (x) = 0. (43)
n n

2.2 Density estimation

Estimating the pdf of a continuous quantity is much more challenging that estimating the
cdf. If we have sufficient data, the fraction of samples that are smaller than a certain x
provide a good estimate for the cdf at that point. However, no matter how much data we
have, there is negligible probability that we will see any samples exactly at x: a pointwise

10
1.0
True cdf
Empirical cdf
0.8

0.6
n = 10
0.4

0.2

60 62 64 66 68 70 72 74 76
Height (inches)

1.0
True cdf
Empirical cdf
0.8

0.6
n = 100
0.4

0.2

60 62 64 66 68 70 72 74 76
Height (inches)

1.0
True cdf
Empirical cdf
0.8

0.6
n = 1000
0.4

0.2

60 62 64 66 68 70 72 74 76
Height (inches)

Figure 5: Cdf of the height data in Figure 1 of Lecture Notes 4 along with three realizations of
the empirical cdf computed with n iid samples for n = 10, 100, 1000.

11
empirical density estimator would equal zero almost everywhere (except at the available
samples).
Intuitively, an estimator for the pdf at a point x should take into account the presence
of samples at neighboring locations. If there are many samples close to x then we should
estimate a higher probability density at x, whereas if all the samples are far away, then the
density estimate should be small. Kernel density estimation achieves this by computing a
local weighted average at each point x. The average is weighted by a kernel, which determines
the contribution of the samples depending on their distance to x.

Definition 2.3 (Kernel density estimator). The kernel density estimate with bandwidth h
of the distribution of x1 , . . . , xn at x R is
n  
1 X x xi
fbh,n (x) := k , (44)
n h i=1 h

where k is a kernel function centered at the origin that satisfies

k (x) 0 for all x R, (45)


Z
k (x) dx = 1. (46)
R

Choosing a rectangular kernel yields an empirical density estimate that is piecewise constant
and roughly looks like a histogram. A popular alternative is the Gaussian kernel k (x) =
2
1 ex , which produces a smooth density estimate. Figure 6 shows an example where the

Gaussian mixture described in Example 1.5 is estimated using kernel density estimation
from different numbers of iid samples. Figure 7 shows an example with real data: the aim
is to estimate the density of the weight of a sea-snail population.1 The whole population
consists of 4177 individuals. The kernel density estimate is computed from 200 iid samples
for different values of the kernel bandwidth.
Figures 6 and 7 show the effect of varying the bandwidth parameter, or equivalently the
width of the kernel. If the bandwidth is very small, individual samples have a large influence
on the density estimate. This allows to reproduce irregular shapes more easily, but also
yields spurious fluctuations that are not present in the true curve. Increasing the bandwidth
smooths out such fluctuations. However, increasing the bandwidth too much may smooth
out structure that is actually present in the true pdf.

1
The data are available at archive.ics.uci.edu/ml/datasets/Abalone

12
h = 0.1 h = 0.5
1.4 0.35
True distribution True distribution
1.2 Data 0.30 Data
Kernel-density estimate Kernel-density estimate
1.0 0.25
0.8 0.20

n=5 0.6 0.15


0.4 0.10
0.2 0.05
0.0 0.00
0.2 5 0 5 0.05 5 0 5

0.5 0.35
True distribution True distribution
Data 0.30 Data
0.4 Kernel-density estimate Kernel-density estimate
0.25
0.3
0.20

n = 102 0.2 0.15


0.10
0.1
0.05
0.0
0.00
0.1 5 0 5 0.05 5 0 5

0.35 0.35
True distribution True distribution
0.30 Data 0.30 Data
Kernel-density estimate Kernel-density estimate
0.25 0.25
0.20 0.20

n = 104 0.15 0.15


0.10 0.10
0.05 0.05
0.00 0.00
0.05 5 0 5 0.05 5 0 5

Figure 6: Kernel density estimation for the Gaussian mixture described in Example 1.5 for different
number of iid samples and different values of the kernel bandwidth h.

13
1.0 KDE bandwidth: 0.05
KDE bandwidth: 0.25
KDE bandwidth: 0.5
0.8 True pdf

0.6

0.4

0.2

0.0
1 0 1 2 3 4
Weight (grams)

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0 1 0 1 2 3 4
Weight (grams)

Figure 7: Kernel density estimate for the weight of a population of abalone, a species of sea snail.
In the plot above the density is estimated from 200 iid samples using a Gaussian kernel with three
different bandwidths. Black crosses representing the individual samples are shown underneath. In
the plot below we see the result of repeating the procedure three times using a fixed bandwidth
equal to 0.25.
14