Data Mining Lecture 2 columbia university

© All Rights Reserved

8 vues

Data Mining Lecture 2 columbia university

© All Rights Reserved

- German Glossary Math
- Lecture 16
- HWStrang4thEd
- SGN2206LectureNew2
- De RepeatedEigenvalues
- (164792880) AHP_Technique
- ila0601
- Introduction to Programming With OpenCV
- 10.1.1.70.9033 - Time–Frequency Analysis of Complex Space Phasor in Power Electronics
- jsee13_3 Response of Secondary System
- eigenvectors
- breslin
- Exp
- 2 Applications of Eignvalues and Eignvectors
- Sol7.pdf
- POD_Code_Chen_et_al
- Shear Building - seismic
- All Buckling
- Vessel tank
- 1106.6180

Vous êtes sur la page 1sur 60

Probability theory

Lecture 2

Statistical inference

Linear algebra

Probability theory

Statistical inference

Let X be the n by 1 matrix (really, just a vector) with all entries equal

to 1,

1

1

X = . .

..

1

And consider the span of the columns (really, just one column) of X,

the set of all vectors of the form

X0

(where 0 here is any real number).

Linear algebra

Statistical inference

Probability theory

Y1

Y2

Y = . .

.

.

Yn

Maybe Y is in the subspace. But if it is not, we can ask: what is the

vector in the subspace closest to Y?

That is, what value of 0 minimizes the (squared) distance between

X0 and Y,

||Y X0 ||2 = (Y X0 )T (Y X0 ) =

n

X

i=1

(Yi 0 )2 .

Linear algebra

Probability theory

n

X

(Yi 0 )2

i=1

n

X

2

(Yi 0 ) = 0.

i=1

(Y X0 )T (Y X0 )

with respect to 0 and set to zero:

2X T (Y X0 ) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

1

1

X b0 = . b0

..

1

where

Pn

b0 =

i=1 Yi

or equivalently,

X b0 = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

the entries of X are arbitrary numbers Xi1 .

X11

X21

X = . .

..

Xn1

And consider the span of the columns of X, the set of all vectors of the

form

X1

(where again 1 here is any real number).

Linear algebra

Statistical inference

Probability theory

Y1

Y2

Y = . .

.

.

Yn

Maybe Y is in the subspace. But if it is not, we can ask, as before:

what is the vector in the subspace closest to Y?

That is, what value of 1 minimizes the (squared) distance between

X1 and Y,

||Y X1 ||2 = (Y X1 )T (Y X1 ) =

n

X

i=1

(Yi 1 Xi1 )2 .

Linear algebra

Probability theory

n

X

(Yi 1 Xi1 )2

i=1

2

n

X

(Yi 1 Xi1 ) = 0.

i=1

(Y X1 )T (Y X1 )

with respect to 1 and set to zero:

2X T (Y X1 )Xi1 = 0.

Statistical inference

Linear algebra

Probability theory

Xi1

Xi2

b

X 1 = . b1

..

Xin

where

b1 =

Pn

Yi Xi1

Pi=1

,

n

2

i=1 Xi1

or equivalently,

X b1 = X(X T X)1 X T Y.

Statistical inference

Linear algebra

Statistical inference

Probability theory

1 X11

1 X21

X = .

..

..

.

1 Xn1

And consider the span of the columns of X, the set of all vectors of the

form

X,

where here is the two-dimensional column vector (0 , 1 )T .

Linear algebra

Statistical inference

Probability theory

Y1

Y2

Y = . .

.

.

Yn

Maybe Y is in the subspace. But if it is not, we can ask, as before:

what is the vector in the subspace closest to Y?

That is, what value of the two-dimensional minimizes the (squared)

distance between X and Y,

||Y X||2 = (Y X)T (Y X) =

n

X

i=1

Linear algebra

Probability theory

n

X

(Yi [0 + 1 Xi1 ])2

i=1

2

2

n

X

(Yi [0 + 1 Xi1 ]) = 0

i=1

n

X

i=1

(Y X)T (Y X)

with respect to and set to zero:

2X T (Y X) = 0.

Statistical inference

Linear algebra

Statistical inference

Probability theory

Xi1

1

Xi2

1

b

b

X = . 0 + . b1

..

..

1

Xin

where

b0 = Y b1 X1

, n

n

X

X

b1 =

(Xi1 X1 )Yi

(Xi1 X1 )2

i=1

i=1

or equivalently,

X b = X(X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

columns of X nearest to Y, the so-called projection of Y onto the span

of the columns of X, is the vector

b

X ,

where b is the minimizer of

||Y X||2 = (Y X)T (Y X).

If we take the gradient with respect to we arrive at

X T (Y X) = 0,

from which it follows that

b = (X T X)1 X T Y.

Linear algebra

Probability theory

Statistical inference

Note that is a simple example of a symmetric, positive definite

matrix.

Note that we could write = IDI T , where I is the identity

matrix.

Note that the columns of I are orthogonal, of unit length, and

they are eigenvectors,

with eigenvalues equal to the corresponding elements of D.

Linear algebra

Statistical inference

Probability theory

as a weighted sum of the columns of I,

0

0

1

0

1

0

c = c1 . + c2 . + + cp . .

.

.

..

.

.

0

1

0

0

1

c = d1 c1 . + d2 c2 . + + dp cp

.

.

.

.

0

0

0

0

..

.

1

Linear algebra

Statistical inference

Probability theory

0

1

1

0

cT c = d1 c1 cT . + d2 c2 cT . + + dp cp cT

..

..

0

= c21 d1 + c22 d2 + . . . + c2p dp

0

0

..

.

1

Linear algebra

Probability theory

That is, how do you find c to maximize

c21 d1 + c22 d2 + . . . + c2p dp

subject to the constraint that

c21 + c22 + . . . + c2p = 1?

Statistical inference

Linear algebra

Probability theory

Statistical inference

Lets start all over again. But this time, well not take = D a

diagonal matrix with all positive entries. Instead, take = PDPT ,

where D is again a diagonal matrix with all positive entries, and P is a

matrix whose columns are orthonormal (and span p dimensional

space).

Note that is a complicated example of a symmetric, positive

definite matrix.

Note that we write = PDPT , where P is the not the identity

Note that the columns of P are by definition orthogonal, of unit

length.

And just like the columns of I were eigenvectors, so are the

columns of P

again with eigenvalues equal to the corresponding elements of D.

Linear algebra

Probability theory

Statistical inference

as a weighted sum of the columns of P (not of I now, but rather of P),

c = c1 P1 + c2 P2 + + cp Pp .

(The columns of P are a basis for p dimensional space.)

What happens when you compute c? You get

c = PDPT c = PDPT (c1 P1 + c2 P2 + + cp Pp )

c1

d1 c1

c2

d2 c2

= PD . = P .

..

..

cp

dp cp

= c1 d1 P1 + c2 d2 P2 + + cp dp Pp

Linear algebra

Probability theory

cT c = d1 c1 cT P1 + d2 c2 cT P2 + + dp cp cT Pp .

= c21 d1 + c22 d2 + . . . + c2p dp

Statistical inference

Linear algebra

Probability theory

Statistical inference

vectors c?

That is, how do you find c to maximize

c21 d1 + c22 d2 + . . . + c2p dp

subject to the constraint that

c21 + c22 + . . . + c2p = 1?

Linear algebra

Probability theory

Statistical inference

One last fact that will be relevant when we use these results: every

symmetric positive definite matrix can be written in the form PDPT

where the columns of P are an orthonormal basis for p-dimensional

space, the columns are the eigen-vectors of , and the diagonal

matrix D has as components the corresponding (positive) eigenvalues.

Linear algebra

Probability theory

Statistical inference

of how it maps the unit sphere. . .

The transformation corresponding to maps the orthonormal

eigenvectors, the columns of P, into stretched or shrunken versions of

themselves. That is, maps the unit sphere into an ellipsoid, with

axes equal to the eigen-vectors, and the lengths of the axes equal to

twice the eigenvectors.

From this point of view, does it make sense that to maximize cT c for

c on the unit sphere, one can do no better than taking c equal to the

eigenvector with the largest eigenvalue?

Linear algebra

Probability theory

Statistical inference

Suppose also that is positive definite, so that for any non-zero

p-dimensional vector c, cT c is greater than zero. Then

All of the eigen-values of are real and positive.

All of the eigen-vectors of are orthogonal (or have the same

eigen-value).

We can find p linearly independent orthogonal unit-length

p-dimensional eigen-vectors Pj .

Let P be the p by p matrix whose columns are the Pj and Let D

eigen-values.

Then,

= PDPT .

Linear algebra

Probability theory

Statistical inference

unit vectors c.

Local maximizers are given by c = Pj

and the corresponding local maxima are the eigen-values.

Linear algebra

Probability theory

Statistical inference

unit vectors c such that

ct Pj = 0

for the Pj corresponding to some set of eigen-vectors.

Local maximizers are given by the Pj associated with the other

eigen-vectors.

and the corresponding local maxima are the eigen-values.

Linear algebra

Statistical inference

Probability theory

with axes the Pj , and with the length of the axes equal to twice

the eigen-values.

For a vector c, PT c has as its components the aj for which

p

X

aj Pj = c.

j=1

eigen-values, j .

and so PDPT c is

p

X

aj j Pj .

j=1

In short,

p

X

j=1

aj Pj

p

X

j=1

aj j Pj .

Linear algebra

Statistical inference

Probability theory

()

Let Y be a (vector of) random variable(s) with (joint) conditional

density f (y) given .

The conditional density of given Y = y is

Z

()f (y)d .

(|y) = ()f (y)

The conditional expectation of given Y = y is

Z

E{|y} =

(|y)d.

d

d

ln (()) +

ln f (y)d() = 0.

d

d

Linear algebra

Probability theory

Statistical inference

joint density fXY (x, y),

the density of Y is

Z

fY (y) =

the density of X is

Z

fX (x) =

Linear algebra

Probability theory

Statistical inference

given X are

Z

E{X} =

fX (x)xdx

Z

E{Y} =

fY (y)ydy

Z

E{Y|X} =

fY|X (y|X)ydy

And we have

E{Y} = E{E{Y|X}}.

Linear algebra

Probability theory

Z

E{g(X)} =

fX (x)g(x)dx.

so that also

Z

Var{g(X)} =

Statistical inference

Linear algebra

Probability theory

Z

Var(Y) =

fY (y)(y E{Y})2 dy

Z

Var(Y|X) =

Statistical inference

Linear algebra

Probability theory

Statistical inference

E{(Y E{Y})(X E{X})}

and we have

Var(Y) = E{Var(Y|X)} + Var(E{Y|X})

Linear algebra

Probability theory

Statistical inference

E{X} is the vectors with entries equal to the expectations of the

components of X.

Linear algebra

Probability theory

Statistical inference

E{(X E{X})(X E{X})T },

with diagonal entries equal to the variances of the components of X,

and the covariances arranged in the off-diagonals.

Note that a covariance matrix is symmetric, and, as long as the

components of X are not linear functions of each other, positive

definite.

Linear algebra

Probability theory

Statistical inference

Independence

Random variables are independent if their joint density is equal

Independence captures the notion of one random variables value

If two random variables are independent, their covariance is

equal to zero.

Linear algebra

Probability theory

Statistical inference

vector and covariance matrix , and if M is an r by q matrix of

constants, and is an r-dimensional vector of constants, then

E{MX + } = M +

and

Cov(MX + ) = MCov(X)M T .

Linear algebra

Probability theory

Chebechevs inequality:

P{|X | } Var(X)/2

Statistical inference

Linear algebra

Probability theory

Statistical inference

Xi has a finite variance (which we will denote i2 .) Then the variance

that is, the variance of

of X,

n

1X

Xi ,

n

i=1

is equal to

n

1 X 2

i .

n2

i=1

i2

have an upper bound in common, then the

variance tends to zero with large values of n.

Linear algebra

Probability theory

Statistical inference

P{|X

| }

is also small. Here,

is the average of the expectations of the Xi .

In short, by taking more data, we can learn.

Linear algebra

Statistical inference

Probability theory

independent Xi , we can go beyond the behavior of X

to consider

not just that it tends to zero, but also how it varies around zero.

v

!

uX

n

n

n

X

X

u

1

1

n

Xi

E{Xi } xt

i2

P

n

n

i=1

i=1

i1

et /2

dt.

2

Not only can we learn, we can know how well weve learned!

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

There is data

A statistical method is applied

There are the results of your method

You also produce some indication of the precision of your results

The results and precision estimates are used to draw conclusions

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

always) the appropriate method already in SAS, R, SPSS, Matlab,

Minitab, Systat, et cetera.

What is a statistical model?

What is an analytic goal?

How does one elicit them from the client?

Linear algebra

Probability theory

Statistical inference

Linear algebra

Probability theory

Statistical inference

A sample space for the observables (the random variables)

A joint distribution on the sample space

The joint distribution reflects all the sources of variability that

Linear algebra

Probability theory

Statistical inference

on the sample space for the observables (the random variables)

What is specified about the joint distribution reflects what is

What is unspecified reflects what is unknown about the

That we have random variables reflects that even if we knew

would still be randomness.

Parameters index the possible distributions for the data.

Linear algebra

Probability theory

Statistical inference

known, what is unknown about

The sources of variability

The sampling plan

Mechanisms underlying the phenomena under examination

Counterfactuals - issues of causation and confounding

Practical issues relating to complexity, sample size, and

computation

Linear algebra

Probability theory

Statistical inference

the parameters.

The mathematical version of this is decision theory.

Parameters

indexing probability models on outcomes Y

Possible actions A

A loss associated with parameter-action pairs L(, a)

Decision rules map data to actions d : Y A.

We evaluate decision rules via

E {L(, d(Y))}

Linear algebra

Probability theory

Cure or Failure recorded for all

Researchers wish to convince EPA that the treatment is

Model? Analytic Goal?

Statistical inference

Linear algebra

Probability theory

Statistical inference

Total gains generated by up-coding recorded for each

Prosecutors need to assess the total gains in order to recommend

Model? Analytic Goal?

Linear algebra

Probability theory

Statistical inference

weather patterns

associated foreclosure outcomes, or cancer outcome, or rainfall

Researchers want to help others do prediction with new data

Linear algebra

Probability theory

Statistical inference

The likelihood is the density (or probability mass function) of the

The maximum likelihood estimator is the value of the parameter

We usually find the MLE by differentiating the logarithm of the

Maximum likelihood estimates are

b = .)

Unbiased (E {}

Efficient in the sense of having smallest variance among unbiased

estimates

Linear algebra

Probability theory

Statistical inference

The (conditional) model

The likelihood

The score equations

Linear algebra

Probability theory

Statistical inference

Mixture models when the component identifiers are available

The likelihood when they are not

An iterative approach to estimation

Linear algebra

Probability theory

Statistical inference

Suppose you really believe that the parameter has a distribution ().

And that nature or god or . . . chose from that distribution when it

created .

And suppose you wanted to estimate .

Suppose we have some loss function, say

b )

L(,

So that we need to find a function b to minimize

b )}

E{L(,

What expectation are we talking about? The expectation over !

Find b to minimize

b

E{L((Y),

)|Y}.

Or maybe just approximate that optimal choice with the

expectation or mode or . . .

Linear algebra

Probability theory

Bayes theorem.

()

Y| f (y)

Z

()f (y)d

Statistical inference

Linear algebra

Probability theory

Statistical inference

Suppose we have some loss function, say

b )

L(,

b to minimize

So that we need to find a function (y)

b

E{L((Y),

)}

What expectation are we talking about? The expectation over

and Y!

b

b

b

E{L((Y),

)} = E{E{L((Y),

)|}} = E{E{L((Y),

)|Y}}

b

Find (Y)

to minimize

b

E{L((Y),

)|Y}.

Or maybe just approximate that optimal choice with the posterior

expectation, or mode, or . . .

- German Glossary MathTransféré parlastname name
- Lecture 16Transféré parmsakowsk
- HWStrang4thEdTransféré parShazia Ahmed
- SGN2206LectureNew2Transféré parssosloa
- De RepeatedEigenvaluesTransféré parSangramjit Sarkar
- (164792880) AHP_TechniqueTransféré parvipul1947
- ila0601Transféré parFlorence Cheang
- Introduction to Programming With OpenCVTransféré parSubash Chandar Adikesavan
- 10.1.1.70.9033 - Time–Frequency Analysis of Complex Space Phasor in Power ElectronicsTransféré parJandfor Tansfg Errott
- jsee13_3 Response of Secondary SystemTransféré parhamadani
- eigenvectorsTransféré parHemant Kumar
- breslinTransféré parTuan Nguyen
- ExpTransféré parEvanora Java
- 2 Applications of Eignvalues and EignvectorsTransféré parmurugan2284
- Sol7.pdfTransféré parMimo Molio
- POD_Code_Chen_et_alTransféré parManeesh Mishra
- Shear Building - seismicTransféré parArut MV
- All BucklingTransféré parDaniela Mihaiela Boca
- Vessel tankTransféré parEnriqueGD
- 1106.6180Transféré parAnil Kumar
- Modal Analysis of Machine Tool Column Using Finite Element MethodTransféré parDoruk Eğinlioğlu
- kq1224-en.pdfTransféré parDAVID
- 140368616-Haykin-Xue-Neural-Networks-and-Learning-Machines-3ed-Soln.pdfTransféré parLucius Morningstar
- Time-Domain Response of Linear Time-Invariant State EquationsTransféré parRam Kumar
- Syllabus Batch 2014-15Transféré parAbdul Basit
- Linear AlgebraTransféré parsruthy
- ASSESSMENT OF LEAN PERFORMANCE OF MANUFACTURING CELLS IN AN SME USING AHPTransféré parTJPRC Publications
- Hominid EvolutionTransféré parprabhabathi devi
- 443Transféré parhanoi6
- P142 Error AnalysisTransféré parashish sahu

- Implementing IBM Tape in Linux and Windows Sg246268Transféré parbupbechanh
- Front Matter (Pages i–x)Transféré pardrbantm
- NewYorkCity Transit Case StudyTransféré parsridhar
- Analysis of Diallel Progeny Test With SASTransféré parGlennadi Rualo
- BA DF Chest Freeze SMTransféré pardan theman
- Wear in chutesTransféré parJD
- Fanuc Ot Cnc Program Manual Gcodetraining 588[1]Transféré parhiepkhachbk20108243
- Degree of Bachelor of Science in Pharmacy Major in Clinical PharmacyTransféré parJohnny Manahan
- Adcancements in Solid AcidTransféré parBrekb Mndz Mjía
- Struct-function Cell OrgaTransféré parJeenath Justin Doss
- Stp DesignTransféré parmukhlesh
- How to Install Windows Drivers in UbntuTransféré parMatpie Mathpie
- Modification of the Kolmogorov-Johnson-Mehl-Avrami rate equation for non-isothermal experiments and its analytical solutionTransféré parTiago Turcarelli
- JournalTransféré parJorge Pilco
- How to administer Injections? MBBSTransféré parDr.U.P.Rathnakar.MD.DIH.PGDHM
- reflection paper finalTransféré parapi-324102301
- 3D Grass in FSXTransféré partolin430
- A Game-Based Method.pdfTransféré parvimalsairam
- Kaedah Melukis Kulit KerangTransféré parIva Sivaraja
- Weather MonitoringTransféré parSachin Ramesh
- Hp Monitor 1911 LedTransféré parpablo gonzales
- 4008TAG2+ElectroUnit+(PN2199)Transféré parArista Lai
- Manual Blaze en USTransféré parAnonymous QSjoCtkCOv
- MIT18_01SC_pset3solTransféré parFatima Ahsan
- Zhang - LCA of Solid State Perovskite Solar CellsTransféré parISSSTNetwork
- Energy Conversion and Heat EnginesTransféré parHarsa Rizano
- 2016_MOMENI_Degradation and Hemostatic Properties of Polyphosphate CoacervatesTransféré parjuliocesarbp
- EY Performance Platform RevolutionTransféré parAlexis Gonzalez Perez
- Diaphragm Wall - Soletanch BachyTransféré parmlakkiss
- CommunicationTransféré parSalila Ranjan Sahu