Vous êtes sur la page 1sur 60

# Linear algebra

Probability theory

Lecture 2

Statistical inference

Linear algebra

Probability theory

Statistical inference

Let X be the n by 1 matrix (really, just a vector) with all entries equal
to 1,

1
1

X = . .
..
1
And consider the span of the columns (really, just one column) of X,
the set of all vectors of the form
X0
(where 0 here is any real number).

Linear algebra

Statistical inference

Probability theory

## Consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask: what is the
vector in the subspace closest to Y?
That is, what value of 0 minimizes the (squared) distance between
X0 and Y,
||Y X0 ||2 = (Y X0 )T (Y X0 ) =

n
X
i=1

(Yi 0 )2 .

Linear algebra

Probability theory

n
X

(Yi 0 )2

i=1

n
X
2
(Yi 0 ) = 0.
i=1

## In matrix notation, differentiate

(Y X0 )T (Y X0 )
with respect to 0 and set to zero:
2X T (Y X0 ) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

## Solving, we obtain for the nearest vector in the subspace

1
1

X b0 = . b0
..
1
where

Pn
b0 =

i=1 Yi

or equivalently,
X b0 = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

## What if we take a more general (but one-dimensional) X: suppose that

the entries of X are arbitrary numbers Xi1 .

X11
X21

X = . .
..
Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X1
(where again 1 here is any real number).

Linear algebra

Statistical inference

Probability theory

## And as before, consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of 1 minimizes the (squared) distance between
X1 and Y,
||Y X1 ||2 = (Y X1 )T (Y X1 ) =

n
X
i=1

(Yi 1 Xi1 )2 .

Linear algebra

Probability theory

n
X

(Yi 1 Xi1 )2

i=1

2

n
X

(Yi 1 Xi1 ) = 0.

i=1

## In matrix notation, differentiate

(Y X1 )T (Y X1 )
with respect to 1 and set to zero:
2X T (Y X1 )Xi1 = 0.

Statistical inference

Linear algebra

Probability theory

## Solving, we obtain for the nearest vector in the subspace

Xi1
Xi2

b
X 1 = . b1
..
Xin
where
b1 =

Pn
Yi Xi1
Pi=1
,
n
2
i=1 Xi1

or equivalently,
X b1 = X(X T X)1 X T Y.

Statistical inference

Linear algebra

Statistical inference

Probability theory

## Now lets go to an n by 2 matrix X:

1 X11
1 X21

X = .
..
..
.
1 Xn1

And consider the span of the columns of X, the set of all vectors of the
form
X,
where here is the two-dimensional column vector (0 , 1 )T .

Linear algebra

Statistical inference

Probability theory

## And as before, consider some other n-dimensional vector Y,

Y1
Y2

Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of the two-dimensional minimizes the (squared)
distance between X and Y,
||Y X||2 = (Y X)T (Y X) =

n
X
i=1

## (Yi [0 + 1 Xi1 ])2 .

Linear algebra

Probability theory

## Lets solve for the value of to minimize the distance

n
X
(Yi [0 + 1 Xi1 ])2
i=1

## by taking the gradient with respect to and setting to zero:

2
2

n
X
(Yi [0 + 1 Xi1 ]) = 0

i=1
n
X

i=1

## In matrix notation, take the gradient of

(Y X)T (Y X)
with respect to and set to zero:
2X T (Y X) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

## Solving, we obtain for the nearest vector in the subspace

Xi1
1
Xi2
1

b
b
X = . 0 + . b1
..
..
1

Xin

where
b0 = Y b1 X1
, n
n
X
X
b1 =
(Xi1 X1 )Yi
(Xi1 X1 )2
i=1

i=1

or equivalently,
X b = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

## In general, for an n by p matrix X, the vector in the span of the

columns of X nearest to Y, the so-called projection of Y onto the span
of the columns of X, is the vector
b
X ,
where b is the minimizer of
||Y X||2 = (Y X)T (Y X).
If we take the gradient with respect to we arrive at
X T (Y X) = 0,
from which it follows that
b = (X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

## Let = D be a diagonal matrix with all positive entries.

Note that is a simple example of a symmetric, positive definite

matrix.
Note that we could write = IDI T , where I is the identity

matrix.
Note that the columns of I are orthogonal, of unit length, and
they are eigenvectors,
with eigenvalues equal to the corresponding elements of D.

Linear algebra

Statistical inference

Probability theory

## Let c be any unit length vector, and consider the decomposition of c

as a weighted sum of the columns of I,

0
0
1
0
1
0

c = c1 . + c2 . + + cp . .
.
.
..
.
.
0

## What happens when you compute c? You get

1
0

0
1

c = d1 c1 . + d2 c2 . + + dp cp
.
.

.
.
0
0

0
0
..
.
1

Linear algebra

Statistical inference

Probability theory

## And if you further compute cT c, you get,

0
1

1
0

cT c = d1 c1 cT . + d2 c2 cT . + + dp cp cT

..
..
0
= c21 d1 + c22 d2 + . . . + c2p dp

0
0
..
.
1

Linear algebra

Probability theory

## Suppose you wanted to maximize cT c among unit length c?

That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?

## Take c to be the eigenvector associated with the largest d!

Statistical inference

Linear algebra

Probability theory

Statistical inference

Lets start all over again. But this time, well not take = D a
diagonal matrix with all positive entries. Instead, take = PDPT ,
where D is again a diagonal matrix with all positive entries, and P is a
matrix whose columns are orthonormal (and span p dimensional
space).
Note that is a complicated example of a symmetric, positive

definite matrix.
Note that we write = PDPT , where P is the not the identity

## matrix any more, but rather some other orthonormal matrix.

Note that the columns of P are by definition orthogonal, of unit

length.
And just like the columns of I were eigenvectors, so are the

columns of P
again with eigenvalues equal to the corresponding elements of D.

Linear algebra

Probability theory

Statistical inference

## Let c be any unit length vector, and consider the decomposition of c

as a weighted sum of the columns of P (not of I now, but rather of P),
c = c1 P1 + c2 P2 + + cp Pp .
(The columns of P are a basis for p dimensional space.)
What happens when you compute c? You get
c = PDPT c = PDPT (c1 P1 + c2 P2 + + cp Pp )

c1
d1 c1
c2
d2 c2

= PD . = P .
..
..
cp
dp cp
= c1 d1 P1 + c2 d2 P2 + + cp dp Pp

Linear algebra

Probability theory

## And if you further compute cT c, you get,

cT c = d1 c1 cT P1 + d2 c2 cT P2 + + dp cp cT Pp .
= c21 d1 + c22 d2 + . . . + c2p dp

Statistical inference

Linear algebra

Probability theory

Statistical inference

## Suppose you wanted to maximize cT c as a function of unit length

vectors c?
That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?

## Again, take c to be the eigenvector associated with the largest d!

Linear algebra

Probability theory

Statistical inference

One last fact that will be relevant when we use these results: every
symmetric positive definite matrix can be written in the form PDPT
where the columns of P are an orthonormal basis for p-dimensional
space, the columns are the eigen-vectors of , and the diagonal
matrix D has as components the corresponding (positive) eigenvalues.

Linear algebra

Probability theory

Statistical inference

## It might help to think of = PDPT as a linear transformation. Think

of how it maps the unit sphere. . .
The transformation corresponding to maps the orthonormal
eigenvectors, the columns of P, into stretched or shrunken versions of
themselves. That is, maps the unit sphere into an ellipsoid, with
axes equal to the eigen-vectors, and the lengths of the axes equal to
twice the eigenvectors.
From this point of view, does it make sense that to maximize cT c for
c on the unit sphere, one can do no better than taking c equal to the
eigenvector with the largest eigenvalue?

Linear algebra

Probability theory

Statistical inference

## Suppose that a p by p matrix is symmetric, so that = T .

Suppose also that is positive definite, so that for any non-zero
p-dimensional vector c, cT c is greater than zero. Then
All of the eigen-values of are real and positive.
All of the eigen-vectors of are orthogonal (or have the same

eigen-value).
We can find p linearly independent orthogonal unit-length

p-dimensional eigen-vectors Pj .
Let P be the p by p matrix whose columns are the Pj and Let D

## be the corresponding diagonal matrix whose entries are the

eigen-values.
Then,

= PDPT .

Linear algebra

Probability theory

Statistical inference

## Suppose you want to maximize cT c with respect to p-dimensional

unit vectors c.
Local maximizers are given by c = Pj
and the corresponding local maxima are the eigen-values.

Linear algebra

Probability theory

Statistical inference

## Suppose you want to maximize cT c with respect to p-dimensional

unit vectors c such that
ct Pj = 0
for the Pj corresponding to some set of eigen-vectors.
Local maximizers are given by the Pj associated with the other

eigen-vectors.
and the corresponding local maxima are the eigen-values.

Linear algebra

Statistical inference

Probability theory

## The linear transformation maps the unit sphere to an elipsoid

with axes the Pj , and with the length of the axes equal to twice
the eigen-values.
For a vector c, PT c has as its components the aj for which
p
X

aj Pj = c.

j=1

## So DPT c stretches or shrinks those aj by the associated

eigen-values, j .
and so PDPT c is

p
X

aj j Pj .

j=1

In short,

p
X
j=1

aj Pj

p
X
j=1

aj j Pj .

Linear algebra

Statistical inference

Probability theory

## Let be a (vector of ) random variable(s) with (joint) density

()
Let Y be a (vector of) random variable(s) with (joint) conditional
density f (y) given .
The conditional density of given Y = y is
Z

()f (y)d .
(|y) = ()f (y)
The conditional expectation of given Y = y is

Z
E{|y} =

(|y)d.

## and the value of that maximizes the posterior likelihood solves


d 
d
ln (()) +
ln f (y)d() = 0.
d
d

Linear algebra

Probability theory

Statistical inference

## Suppose that X and Y are jointly distributed random variables with

joint density fXY (x, y),
the density of Y is

Z
fY (y) =

## fXY (x, y)dx

the density of X is

Z
fX (x) =

## fXY (x, y)/fX (x)

Linear algebra

Probability theory

Statistical inference

## The expectations of X and Y and the conditional expectation of Y

given X are
Z
E{X} =
fX (x)xdx
Z
E{Y} =

fY (y)ydy
Z

E{Y|X} =

fY|X (y|X)ydy

And we have
E{Y} = E{E{Y|X}}.

Linear algebra

Probability theory

Z
E{g(X)} =
fX (x)g(x)dx.
so that also
Z
Var{g(X)} =

## fX (x)(g(x) E{g(X)})2 dx.

Statistical inference

Linear algebra

Probability theory

## The variance of Y and the conditional variance of Y given X are

Z
Var(Y) =
fY (y)(y E{Y})2 dy
Z
Var(Y|X) =

## fY|X (y|X)(y E{Y|X})2 dy

Statistical inference

Linear algebra

Probability theory

Statistical inference

## The covariance between two random variables X and Y are defined as

E{(Y E{Y})(X E{X})}
and we have
Var(Y) = E{Var(Y|X)} + Var(E{Y|X})

Linear algebra

Probability theory

Statistical inference

## For a vector of random variables X, we define the expectation vector:

E{X} is the vectors with entries equal to the expectations of the
components of X.

Linear algebra

Probability theory

Statistical inference

## And we define the covariance matrix, Cov(X),

E{(X E{X})(X E{X})T },
with diagonal entries equal to the variances of the components of X,
and the covariances arranged in the off-diagonals.
Note that a covariance matrix is symmetric, and, as long as the
components of X are not linear functions of each other, positive
definite.

Linear algebra

Probability theory

Statistical inference

Independence
Random variables are independent if their joint density is equal

## to the product of their marginals.

Independence captures the notion of one random variables value

## having no implications for the value of the other.

If two random variables are independent, their covariance is

equal to zero.

Linear algebra

Probability theory

Statistical inference

## IF X is a q-dimensional vector of random variables with expectation

vector and covariance matrix , and if M is an r by q matrix of
constants, and is an r-dimensional vector of constants, then
E{MX + } = M +
and
Cov(MX + ) = MCov(X)M T .

Linear algebra

Probability theory

Chebechevs inequality:
P{|X | } Var(X)/2

Statistical inference

Linear algebra

Probability theory

Statistical inference

## Suppose Xi are all independent , i from 1 to n, and suppose that each

Xi has a finite variance (which we will denote i2 .) Then the variance
that is, the variance of
of X,
n

1X
Xi ,
n
i=1

is equal to
n
1 X 2
i .
n2
i=1

i2

## And, in particular, if the

have an upper bound in common, then the
variance tends to zero with large values of n.

Linear algebra

Probability theory

Statistical inference

## In this situation, from Chebechevs inequality, we find that

P{|X
| }
is also small. Here,
is the average of the expectations of the Xi .
In short, by taking more data, we can learn.

Linear algebra

Statistical inference

Probability theory

## With minimal technical assumptions about the finite variance of the

independent Xi , we can go beyond the behavior of X
to consider
not just that it tends to zero, but also how it varies around zero.

v
!
uX
n
n
n

X
X
u
1
1
n
Xi
E{Xi } xt
i2
P

n
n
i=1

i=1

i1

et /2
dt.
2

Not only can we learn, we can know how well weve learned!

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

## When you analyze data,

There is data
A statistical method is applied
There are the results of your method
You also produce some indication of the precision of your results
The results and precision estimates are used to draw conclusions

## How do you know what method to apply?

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

## Given a statistical model, and given an analytic goal, there is (almost

always) the appropriate method already in SAS, R, SPSS, Matlab,
Minitab, Systat, et cetera.
What is a statistical model?
What is an analytic goal?
How does one elicit them from the client?

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

## A probability model has

A sample space for the observables (the random variables)
A joint distribution on the sample space
The joint distribution reflects all the sources of variability that

## are inherent in the random variables.

Linear algebra

Probability theory

Statistical inference

## A statistical model is a family of probability models

on the sample space for the observables (the random variables)
What is specified about the joint distribution reflects what is

## known about the distribution of the random variables

What is unspecified reflects what is unknown about the

## distribution of the random variables

That we have random variables reflects that even if we knew

## everything that could be known about the distribution, there

would still be randomness.
Parameters index the possible distributions for the data.

Linear algebra

Probability theory

Statistical inference

## What must be considered in devising a statistical model? What is

known, what is unknown about
The sources of variability
The sampling plan
Mechanisms underlying the phenomena under examination
Counterfactuals - issues of causation and confounding
Practical issues relating to complexity, sample size, and

computation

Linear algebra

Probability theory

Statistical inference

## The analytic goal is a statement of the researchers goal in terms of

the parameters.
The mathematical version of this is decision theory.
Parameters
indexing probability models on outcomes Y
Possible actions A
A loss associated with parameter-action pairs L(, a)
Decision rules map data to actions d : Y A.
We evaluate decision rules via

E {L(, d(Y))}

Linear algebra

Probability theory

## n subjects, randomly assigned to treatment or placebo

Cure or Failure recorded for all
Researchers wish to convince EPA that the treatment is

## efficacious, but only if it really is

Model? Analytic Goal?

Statistical inference

Linear algebra

Probability theory

Statistical inference

## n patient charts chosen at random from a physicians practice

Total gains generated by up-coding recorded for each
Prosecutors need to assess the total gains in order to recommend

## the amount to be recovered

Model? Analytic Goal?

Linear algebra

Probability theory

Statistical inference

## n loan applications, or cell histologies, or examples of past

weather patterns
associated foreclosure outcomes, or cancer outcome, or rainfall
Researchers want to help others do prediction with new data

## Model? Analytic Goal?

Linear algebra

Probability theory

Statistical inference

## probability models for the data

The likelihood is the density (or probability mass function) of the

## observed data as a function of the parameter

The maximum likelihood estimator is the value of the parameter

## that maximizes the likelihood

We usually find the MLE by differentiating the logarithm of the

## likelihood, and setting it to zero

Maximum likelihood estimates are
b = .)
Unbiased (E {}
Efficient in the sense of having smallest variance among unbiased
estimates

Linear algebra

Probability theory

Statistical inference

## Ordinary least squares linear regression as maximum likelihood.

The (conditional) model
The likelihood
The score equations

Linear algebra

Probability theory

Statistical inference

## Mixture models and the EM algorithm

Mixture models when the component identifiers are available
The likelihood when they are not
An iterative approach to estimation

Linear algebra

Probability theory

Statistical inference

Suppose you really believe that the parameter has a distribution ().
And that nature or god or . . . chose from that distribution when it
created .
And suppose you wanted to estimate .
Suppose we have some loss function, say
b )
L(,
So that we need to find a function b to minimize

b )}
E{L(,
What expectation are we talking about? The expectation over !
Find b to minimize

b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the

expectation or mode or . . .

Linear algebra

Probability theory

Bayes theorem.
()
Y| f (y)
Z

## (|y) = ()f (y)

()f (y)d

Statistical inference

Linear algebra

Probability theory

Statistical inference

## generated according to f (y).

Suppose we have some loss function, say
b )
L(,
b to minimize
So that we need to find a function (y)
b
E{L((Y),
)}
What expectation are we talking about? The expectation over

and Y!
b
b
b
E{L((Y),
)} = E{E{L((Y),
)|}} = E{E{L((Y),
)|Y}}
b
Find (Y)
to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the posterior

expectation, or mode, or . . .