Vous êtes sur la page 1sur 60

Linear algebra

Probability theory

Lecture 2

Statistical inference

Linear algebra

Probability theory

Statistical inference

Let X be the n by 1 matrix (really, just a vector) with all entries equal
to 1,

1
1

X = . .
..
1
And consider the span of the columns (really, just one column) of X,
the set of all vectors of the form
X0
(where 0 here is any real number).

Linear algebra

Statistical inference

Probability theory

Consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask: what is the
vector in the subspace closest to Y?
That is, what value of 0 minimizes the (squared) distance between
X0 and Y,
||Y X0 ||2 = (Y X0 )T (Y X0 ) =

n
X
i=1

(Yi 0 )2 .

Linear algebra

Probability theory

Lets solve for the value of 0 to minimize the distance


n
X

(Yi 0 )2

i=1

by differentiating with respect to 0 and setting to zero:


n
X
2
(Yi 0 ) = 0.
i=1

In matrix notation, differentiate


(Y X0 )T (Y X0 )
with respect to 0 and set to zero:
2X T (Y X0 ) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

Solving, we obtain for the nearest vector in the subspace



1
1

X b0 = . b0
..
1
where

Pn
b0 =

i=1 Yi

or equivalently,
X b0 = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

What if we take a more general (but one-dimensional) X: suppose that


the entries of X are arbitrary numbers Xi1 .

X11
X21

X = . .
..
Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X1
(where again 1 here is any real number).

Linear algebra

Statistical inference

Probability theory

And as before, consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of 1 minimizes the (squared) distance between
X1 and Y,
||Y X1 ||2 = (Y X1 )T (Y X1 ) =

n
X
i=1

(Yi 1 Xi1 )2 .

Linear algebra

Probability theory

Lets solve for the value of 1 to minimize the distance


n
X

(Yi 1 Xi1 )2

i=1

by differentiating with respect to 1 and setting to zero:


2

n
X

(Yi 1 Xi1 ) = 0.

i=1

In matrix notation, differentiate


(Y X1 )T (Y X1 )
with respect to 1 and set to zero:
2X T (Y X1 )Xi1 = 0.

Statistical inference

Linear algebra

Probability theory

Solving, we obtain for the nearest vector in the subspace

Xi1
Xi2

b
X 1 = . b1
..
Xin
where
b1 =

Pn
Yi Xi1
Pi=1
,
n
2
i=1 Xi1

or equivalently,
X b1 = X(X T X)1 X T Y.

Statistical inference

Linear algebra

Statistical inference

Probability theory

Now lets go to an n by 2 matrix X:

1 X11
1 X21

X = .
..
..
.
1 Xn1

And consider the span of the columns of X, the set of all vectors of the
form
X,
where here is the two-dimensional column vector (0 , 1 )T .

Linear algebra

Statistical inference

Probability theory

And as before, consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of the two-dimensional minimizes the (squared)
distance between X and Y,
||Y X||2 = (Y X)T (Y X) =

n
X
i=1

(Yi [0 + 1 Xi1 ])2 .

Linear algebra

Probability theory

Lets solve for the value of to minimize the distance


n
X
(Yi [0 + 1 Xi1 ])2
i=1

by taking the gradient with respect to and setting to zero:


2
2

n
X
(Yi [0 + 1 Xi1 ]) = 0

i=1
n
X

(Yi [0 + 1 Xi1 ])Xi1 = 0

i=1

In matrix notation, take the gradient of


(Y X)T (Y X)
with respect to and set to zero:
2X T (Y X) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

Solving, we obtain for the nearest vector in the subspace


Xi1
1
Xi2
1

b
b
X = . 0 + . b1
..
..
1

Xin

where
b0 = Y b1 X1
, n
n
X
X
b1 =
(Xi1 X1 )Yi
(Xi1 X1 )2
i=1

i=1

or equivalently,
X b = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

In general, for an n by p matrix X, the vector in the span of the


columns of X nearest to Y, the so-called projection of Y onto the span
of the columns of X, is the vector
b
X ,
where b is the minimizer of
||Y X||2 = (Y X)T (Y X).
If we take the gradient with respect to we arrive at
X T (Y X) = 0,
from which it follows that
b = (X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

Let = D be a diagonal matrix with all positive entries.


Note that is a simple example of a symmetric, positive definite

matrix.
Note that we could write = IDI T , where I is the identity

matrix.
Note that the columns of I are orthogonal, of unit length, and
they are eigenvectors,
with eigenvalues equal to the corresponding elements of D.

Linear algebra

Statistical inference

Probability theory

Let c be any unit length vector, and consider the decomposition of c


as a weighted sum of the columns of I,



0
0
1
0
1
0



c = c1 . + c2 . + + cp . .
.
.
..
.
.
0

What happens when you compute c? You get



1
0

0
1

c = d1 c1 . + d2 c2 . + + dp cp
.
.

.
.
0
0

0
0
..
.
1

Linear algebra

Statistical inference

Probability theory

And if you further compute cT c, you get,



0
1

1
0



cT c = d1 c1 cT . + d2 c2 cT . + + dp cp cT

..
..
0
= c21 d1 + c22 d2 + . . . + c2p dp

0
0
..
.
1

Linear algebra

Probability theory

Suppose you wanted to maximize cT c among unit length c?


That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?

Take c to be the eigenvector associated with the largest d!

Statistical inference

Linear algebra

Probability theory

Statistical inference

Lets start all over again. But this time, well not take = D a
diagonal matrix with all positive entries. Instead, take = PDPT ,
where D is again a diagonal matrix with all positive entries, and P is a
matrix whose columns are orthonormal (and span p dimensional
space).
Note that is a complicated example of a symmetric, positive

definite matrix.
Note that we write = PDPT , where P is the not the identity

matrix any more, but rather some other orthonormal matrix.


Note that the columns of P are by definition orthogonal, of unit

length.
And just like the columns of I were eigenvectors, so are the

columns of P
again with eigenvalues equal to the corresponding elements of D.

Linear algebra

Probability theory

Statistical inference

Let c be any unit length vector, and consider the decomposition of c


as a weighted sum of the columns of P (not of I now, but rather of P),
c = c1 P1 + c2 P2 + + cp Pp .
(The columns of P are a basis for p dimensional space.)
What happens when you compute c? You get
c = PDPT c = PDPT (c1 P1 + c2 P2 + + cp Pp )

c1
d1 c1
c2
d2 c2

= PD . = P .
..
..
cp
dp cp
= c1 d1 P1 + c2 d2 P2 + + cp dp Pp

Linear algebra

Probability theory

And if you further compute cT c, you get,


cT c = d1 c1 cT P1 + d2 c2 cT P2 + + dp cp cT Pp .
= c21 d1 + c22 d2 + . . . + c2p dp

Statistical inference

Linear algebra

Probability theory

Statistical inference

Suppose you wanted to maximize cT c as a function of unit length


vectors c?
That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?

Again, take c to be the eigenvector associated with the largest d!

Linear algebra

Probability theory

Statistical inference

One last fact that will be relevant when we use these results: every
symmetric positive definite matrix can be written in the form PDPT
where the columns of P are an orthonormal basis for p-dimensional
space, the columns are the eigen-vectors of , and the diagonal
matrix D has as components the corresponding (positive) eigenvalues.

Linear algebra

Probability theory

Statistical inference

It might help to think of = PDPT as a linear transformation. Think


of how it maps the unit sphere. . .
The transformation corresponding to maps the orthonormal
eigenvectors, the columns of P, into stretched or shrunken versions of
themselves. That is, maps the unit sphere into an ellipsoid, with
axes equal to the eigen-vectors, and the lengths of the axes equal to
twice the eigenvectors.
From this point of view, does it make sense that to maximize cT c for
c on the unit sphere, one can do no better than taking c equal to the
eigenvector with the largest eigenvalue?

Linear algebra

Probability theory

Statistical inference

Suppose that a p by p matrix is symmetric, so that = T .


Suppose also that is positive definite, so that for any non-zero
p-dimensional vector c, cT c is greater than zero. Then
All of the eigen-values of are real and positive.
All of the eigen-vectors of are orthogonal (or have the same

eigen-value).
We can find p linearly independent orthogonal unit-length

p-dimensional eigen-vectors Pj .
Let P be the p by p matrix whose columns are the Pj and Let D

be the corresponding diagonal matrix whose entries are the


eigen-values.
Then,

= PDPT .

Linear algebra

Probability theory

Statistical inference

Suppose you want to maximize cT c with respect to p-dimensional


unit vectors c.
Local maximizers are given by c = Pj
and the corresponding local maxima are the eigen-values.

Linear algebra

Probability theory

Statistical inference

Suppose you want to maximize cT c with respect to p-dimensional


unit vectors c such that
ct Pj = 0
for the Pj corresponding to some set of eigen-vectors.
Local maximizers are given by the Pj associated with the other

eigen-vectors.
and the corresponding local maxima are the eigen-values.

Linear algebra

Statistical inference

Probability theory

The linear transformation maps the unit sphere to an elipsoid

with axes the Pj , and with the length of the axes equal to twice
the eigen-values.
For a vector c, PT c has as its components the aj for which
p
X

aj Pj = c.

j=1

So DPT c stretches or shrinks those aj by the associated

eigen-values, j .
and so PDPT c is

p
X

aj j Pj .

j=1

In short,

p
X
j=1

aj Pj

p
X
j=1

aj j Pj .

Linear algebra

Statistical inference

Probability theory

Let be a (vector of ) random variable(s) with (joint) density

()
Let Y be a (vector of) random variable(s) with (joint) conditional
density f (y) given .
The conditional density of given Y = y is
Z

()f (y)d .
(|y) = ()f (y)
The conditional expectation of given Y = y is

Z
E{|y} =

(|y)d.

and the value of that maximizes the posterior likelihood solves


d 
d
ln (()) +
ln f (y)d() = 0.
d
d

Linear algebra

Probability theory

Statistical inference

Suppose that X and Y are jointly distributed random variables with


joint density fXY (x, y),
the density of Y is

Z
fY (y) =

fXY (x, y)dx

the density of X is

Z
fX (x) =

fXY (x, y)dy

and the conditional density of Y given X is

fXY (x, y)/fX (x)

Linear algebra

Probability theory

Statistical inference

The expectations of X and Y and the conditional expectation of Y


given X are
Z
E{X} =
fX (x)xdx
Z
E{Y} =

fY (y)ydy
Z

E{Y|X} =

fY|X (y|X)ydy

And we have
E{Y} = E{E{Y|X}}.

Linear algebra

Probability theory

The law of the unconscious statistician says that


Z
E{g(X)} =
fX (x)g(x)dx.
so that also
Z
Var{g(X)} =

fX (x)(g(x) E{g(X)})2 dx.

Statistical inference

Linear algebra

Probability theory

The variance of Y and the conditional variance of Y given X are


Z
Var(Y) =
fY (y)(y E{Y})2 dy
Z
Var(Y|X) =

fY|X (y|X)(y E{Y|X})2 dy

Statistical inference

Linear algebra

Probability theory

Statistical inference

The covariance between two random variables X and Y are defined as


E{(Y E{Y})(X E{X})}
and we have
Var(Y) = E{Var(Y|X)} + Var(E{Y|X})

Linear algebra

Probability theory

Statistical inference

For a vector of random variables X, we define the expectation vector:


E{X} is the vectors with entries equal to the expectations of the
components of X.

Linear algebra

Probability theory

Statistical inference

And we define the covariance matrix, Cov(X),


E{(X E{X})(X E{X})T },
with diagonal entries equal to the variances of the components of X,
and the covariances arranged in the off-diagonals.
Note that a covariance matrix is symmetric, and, as long as the
components of X are not linear functions of each other, positive
definite.

Linear algebra

Probability theory

Statistical inference

Independence
Random variables are independent if their joint density is equal

to the product of their marginals.


Independence captures the notion of one random variables value

having no implications for the value of the other.


If two random variables are independent, their covariance is

equal to zero.

Linear algebra

Probability theory

Statistical inference

IF X is a q-dimensional vector of random variables with expectation


vector and covariance matrix , and if M is an r by q matrix of
constants, and is an r-dimensional vector of constants, then
E{MX + } = M +
and
Cov(MX + ) = MCov(X)M T .

Linear algebra

Probability theory

Chebechevs inequality:
P{|X | } Var(X)/2

Statistical inference

Linear algebra

Probability theory

Statistical inference

Suppose Xi are all independent , i from 1 to n, and suppose that each


Xi has a finite variance (which we will denote i2 .) Then the variance
that is, the variance of
of X,
n

1X
Xi ,
n
i=1

is equal to
n
1 X 2
i .
n2
i=1

i2

And, in particular, if the


have an upper bound in common, then the
variance tends to zero with large values of n.

Linear algebra

Probability theory

Statistical inference

In this situation, from Chebechevs inequality, we find that



P{|X
| }
is also small. Here,
is the average of the expectations of the Xi .
In short, by taking more data, we can learn.

Linear algebra

Statistical inference

Probability theory

With minimal technical assumptions about the finite variance of the



independent Xi , we can go beyond the behavior of X
to consider
not just that it tends to zero, but also how it varies around zero.

v
!
uX
n
n
n

X
X
u
1
1
n
Xi
E{Xi } xt
i2
P

n
n
i=1

i=1

i1

et /2
dt.
2

Not only can we learn, we can know how well weve learned!

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

When you analyze data,


There is data
A statistical method is applied
There are the results of your method
You also produce some indication of the precision of your results
The results and precision estimates are used to draw conclusions

How do you know what method to apply?

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

Given a statistical model, and given an analytic goal, there is (almost


always) the appropriate method already in SAS, R, SPSS, Matlab,
Minitab, Systat, et cetera.
What is a statistical model?
What is an analytic goal?
How does one elicit them from the client?

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

A probability model has


A sample space for the observables (the random variables)
A joint distribution on the sample space
The joint distribution reflects all the sources of variability that

are inherent in the random variables.

Linear algebra

Probability theory

Statistical inference

A statistical model is a family of probability models


on the sample space for the observables (the random variables)
What is specified about the joint distribution reflects what is

known about the distribution of the random variables


What is unspecified reflects what is unknown about the

distribution of the random variables


That we have random variables reflects that even if we knew

everything that could be known about the distribution, there


would still be randomness.
Parameters index the possible distributions for the data.

Linear algebra

Probability theory

Statistical inference

What must be considered in devising a statistical model? What is


known, what is unknown about
The sources of variability
The sampling plan
Mechanisms underlying the phenomena under examination
Counterfactuals - issues of causation and confounding
Practical issues relating to complexity, sample size, and

computation

Linear algebra

Probability theory

Statistical inference

The analytic goal is a statement of the researchers goal in terms of


the parameters.
The mathematical version of this is decision theory.
Parameters
indexing probability models on outcomes Y
Possible actions A
A loss associated with parameter-action pairs L(, a)
Decision rules map data to actions d : Y A.
We evaluate decision rules via

E {L(, d(Y))}

Linear algebra

Probability theory

n subjects, randomly assigned to treatment or placebo


Cure or Failure recorded for all
Researchers wish to convince EPA that the treatment is

efficacious, but only if it really is


Model? Analytic Goal?

Statistical inference

Linear algebra

Probability theory

Statistical inference

n patient charts chosen at random from a physicians practice


Total gains generated by up-coding recorded for each
Prosecutors need to assess the total gains in order to recommend

the amount to be recovered


Model? Analytic Goal?

Linear algebra

Probability theory

Statistical inference

n loan applications, or cell histologies, or examples of past

weather patterns
associated foreclosure outcomes, or cancer outcome, or rainfall
Researchers want to help others do prediction with new data

Model? Analytic Goal?

Linear algebra

Probability theory

Statistical inference

A parameterization is a mapping from the parameter space to the

probability models for the data


The likelihood is the density (or probability mass function) of the

observed data as a function of the parameter


The maximum likelihood estimator is the value of the parameter

that maximizes the likelihood


We usually find the MLE by differentiating the logarithm of the

likelihood, and setting it to zero


Maximum likelihood estimates are
b = .)
Unbiased (E {}
Efficient in the sense of having smallest variance among unbiased
estimates

Linear algebra

Probability theory

Statistical inference

Ordinary least squares linear regression as maximum likelihood.


The (conditional) model
The likelihood
The score equations

Linear algebra

Probability theory

Statistical inference

Mixture models and the EM algorithm


Mixture models when the component identifiers are available
The likelihood when they are not
An iterative approach to estimation

Linear algebra

Probability theory

Statistical inference

Suppose you really believe that the parameter has a distribution ().
And that nature or god or . . . chose from that distribution when it
created .
And suppose you wanted to estimate .
Suppose we have some loss function, say
b )
L(,
So that we need to find a function b to minimize

b )}
E{L(,
What expectation are we talking about? The expectation over !
Find b to minimize

b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the

expectation or mode or . . .

Linear algebra

Probability theory

Bayes theorem.
()
Y| f (y)
Z

(|y) = ()f (y)


()f (y)d

Statistical inference

Linear algebra

Probability theory

Statistical inference

And suppose you wanted to estimate after observing some data

generated according to f (y).


Suppose we have some loss function, say
b )
L(,
b to minimize
So that we need to find a function (y)
b
E{L((Y),
)}
What expectation are we talking about? The expectation over

and Y!
b
b
b
E{L((Y),
)} = E{E{L((Y),
)|}} = E{E{L((Y),
)|Y}}
b
Find (Y)
to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the posterior

expectation, or mode, or . . .