Advanced Econometrics Techniques for Data Analysis

Advanced Econometrics
Arnan Viriyavejkul
June 12, 2019

Contents
1 Preliminaries 3
1.1 Bivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Ordinary Least Squares 11
3 Linear Regression Model 15

3.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Instrumental Variables 19
4.1 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Case 1: L = K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Case 2: L ≥ K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Application: Partitioned Regression . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 A Plot Twist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Maximum Likelihood Estimation 32

5.1 Normal Distribution with Unit Variance . . . . . . . . . . . . . . . . . . . . 32
5.1.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.4 Fisher information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.5 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.6 Asymptotic distribution of µ̂ML . . . . . . . . . . . . . . . . . . . . . 34
5.2 Normal Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2 Score Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.3 Hessian matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.4 Fisher information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1
CONTENTS 2
5.2.5 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . 37
Appendix A Probability Theory 38

A.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A.3 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 40
Appendix B Linear Algebra 43

B.1 Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2.1 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.2.2 Orthogonal Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . 46
B.3 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B.3.1 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . 47
B.3.2 Linear Independence and Dependence . . . . . . . . . . . . . . . . . 48
B.3.3 Basis and Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.3.4 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.4 List of Variables and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 1
Preliminaries
1.1 Bivariate Statistics

Covariance
1. Cov(X, Y ) = Cov(X, E(Y |X))
Cov(X, Y ) = E[XY ] − E[X]E[Y ]

= E [E[XY |X]] − E[X]E[E[Y |X]]
= E[XE[Y |X]] − E[X]E[E[Y |X]
= Cov(X, E[Y |X])
Cov(X,Y )
2. X ∈ {0, 1} then Var(X) = E(Y |X = 1) − E(Y |X = 0)
P (X = 1) = p
P (X = 0) = 1 − p
E[X] = E[X|X = 1]p + E[X|X = 0] · (1 − p) = p
[ ] [ ] [ ]
E X 2 = E X 2 |X = 1 · p + E X 2 |X = 0 (1 − p) = p
E[Y ] = E[Y |X = 1]p + E[Y |X = 0](1 − p)
E[Y ] = E[E[Y |X]]
E[XY ] = E[E[XY |X]]
= E[X[E[Y |X]]
= pE[1 · E[Y |X = 1]] + (1 − p)[0 · E[Y |X = 0]]
= pE[Y |X = 1]
3
CHAPTER 1. PRELIMINARIES 4
Cov(X, Y ) pE[Y |X = 1] − p[E[Y |X = 1]p + E[Y |X = 0](1 − p)

=
Var(X) p − p2
p[E[Y |X = 1] − E[Y |X = 1]p − E[Y |X = 0](1 − p)
=
p − p2
p(1 − p)[E[Y |X = 1] − E[Y |X = 0]]
=
p − p2
= E[Y |X = 1] − E[Y |X = 0]
3. E[XY ] is an inner product
⟨X, Y ⟩ = E[XY ] = E[Y X] = ⟨Y, X⟩

⟨aX, Y ⟩ = E[aXY ] = aE[XY ] = a⟨X, Y ⟩
⟨X1 + X2 , Y ⟩ = E[(X1 + X2 )Y ] = E[X1 Y ] + E[X2 Y ] = ⟨X1 , Y ⟩ + ⟨X2 , Y ⟩
X = 0 ⇒ E[X 2 ] = 0
E[X 2 ] = 0 ⇒ X = 0
¬(X = 0) ⇒ ¬[E[X 2 ] = 0]
¬(X = 0) ⇒ P (X = 0) < 1 ⇒ E[X 2 ] > 0
¬(X = 0) ⇒ P (X ̸= 0) > 0 ⇒ E[X 2 ] ̸= 0
∑N
4. ȲN := i=1 Yi /N . Derive E[ȲN ] and Var[ȲN ]
µY = E [Yi ]
σY2 = Var [Yi ] < ∞
∑N
¯ Yi
YN = i=1
N
[∑ ] [N ]
[ ] N ∑
i=1 Yi 1
E ȲN =E = E Yi
N N
i=1
(N )  
1 ∑ 1 
= E [Yi ] = µ Y + . . . + µY 
N N | {z }
i=1 N times
1
= (N µY ) = µY
N
[∑ ] [N ]
[ ] N ∑
i=1 Yi 1
Var ȲN = Var = 2 Var Yi
N N
i=1
1
= 2 Var [Y1 + Y2 + . . . + YN ]
N
1 [ ]
= 2 · σY2 + . . . + σY2
N | {z }
N times
1 ( ) σ2
= N σY2 = Y
N N
 [ ]
ȲN − E ȲN
E [ZN ] = E  √ [ ]

var ȲN
 
ȲN − µY 
=E √ 2
σY
N
√
N( [ ] )
= E ȲN − µY
σ
√Y
N
= (µY − µY ) = 0
σY
 
ȲN − µY
Var [ZN ] = Var  √ 2 
σY
N
N [ ]
= 2 Var ȲN
σY
N σY2
= · =1
σY2 N
5. Finiteness of moments by Cauchy Schwartz inequality
√
E|XY | ≤ E [X 2 ] E [Y 2 ]
|⟨X, Y ⟩|2 ≤ ⟨X, X⟩ · ⟨Y, Y ⟩
[ ] [ ]
|E[XY ]|2 ≤ E X 2 E Y 2
[ ] [ ]
|E[X]|2 ≤ E X 2 E 12
To prove the finiteness,

[ ]
|E[X]| ≤ E X 2 < ∞
[ ] [ ]
|E[Y ]|2 ≤ E Y 2 E 12
[ ]
|E[Y ]| ≤ E Y 2 < ∞
[ ] [ ]
|E[XY ]|2 ≤ E X 2 E Y 2
[ ] [ ]
E X 2 < ∞, E Y 2 < ∞ ⇒ E[XY ]2 < ∞ ⇒ E[XY ] < ∞
Application: Treatment Effect

Potential outcomes framework is translated into a regression model, it can be represented
by
Yi = β0 + β1i Xi + ui
Xi = π0 + π1i Zi + vi
You have iid sample data (Xi Yi , Zi ) and compute β̂ IV

sZY σZY
β̂ IV = = + op (1)
sZX σZX
where sZY denotes the sample covariance and σZY the population covariance of Zi and Yi
The goal here is to present σZY and σZX in terms of moments of π1i and β1i . In your
derivations, please assume random assignment of Zi so that
1. Prove that σZX = σZ2 E (π1i )
P (Zi = 1) = p
P (Zi = 0) = 1 − p
E[Zi ] = E[Zi |Zi = 1]p + E[Zi |Zi = 0] · (1 − p) = p
[ ] [ ] [ ]
E Zi2 = E Zi2 |Zi = 1 · p + E Zi2 |Zi = 0 (1 − p) = p
E[Xi ] = E[Xi |Zi = 1]p + E[Xi |Zi = 0](1 − p)
E[Xi ] = E[E[Xi |Zi ]]
E[Zi Xi ] = E[E[Zi Xi |Zi ]]
= E[Zi [E[Xi |Zi ]]
= pE[1 · E[Xi |Zi = 1]] + (1 − p)[0 · E[Xi |Zi = 0]]
= pE[Xi |Zi = 1]
Cov(Zi , Xi ) = pE[Xi |Zi = 1] − p[E[Xi |Zi = 1]p + E[Xi |Zi = 0](1 − p)

= p[E[Xi |Zi = 1] − E[Xi |Zi = 1]p − E[Xi |Zi = 0](1 − p)
= p(1 − p)[E[Xi |Zi = 1] − E[Xi |Zi = 0]]
= Var(Zi )[E[Xi |Zi = 1] − E[Xi |Zi = 0]]
= σZ2 E [π1i ]
Since E [π1i ] = [E[Xi |Zi = 1] − E[Xi |Zi = 0]. Also, from the two-stage regression
equations,
[E[Xi |Zi = 1] − E[Xi |Zi = 0]] = E[π0 + π1i + vi ] − E[π0 + vi ]

= E[π1i ]
2. Prove that σZY = σZ2 E (β1i π1i )
Cov(Zi , Yi ) = pE[Yi |Zi = 1] − p[E[Yi |Zi = 1]p + E[Yi |Zi = 0](1 − p)

= p[E[Yi |Zi = 1] − E[Yi |Zi = 1]p − E[Yi |Zi = 0](1 − p)
= p(1 − p)[E[Yi |Zi = 1] − E[Yi |Zi = 0]]
Analogously, by plugging in,
[E[Yi |Zi = 1] − E[Yi |Zi = 0]] = E[βo + β1i π0 + β1i π1i + β1i vi + ui ]
− E[β0 + β1i π0 + β1i vi + ui ]
= E [β1i π1i ]
Finally, one gets,
Cov(Zi , Yi ) = p(1 − p)[E[Yi |Zi = 1] − E[Yi |Zi = 0]]

= Var(Zi )[E[Yi |Zi = 1] − E[Yi |Zi = 0]]
= σZ2 E [β1i π1i ]
Putting terms together, one gets local average treatment effect (LATE),
SZY σZY
β̂ IV = = + op (1)
SZX σZX
E [β1i π1i ]
= + op (1)
E[π1i ]
1.2 Linear Algebra

Matrices
1. Projection matrix
PX Y = Ŷ
( )−1 ′
= X X ′X XY
( ′ )−1 ′
=X XX X (Xβ ⋆ + u)
( )−1 ′ ( )−1 ′
= X X ′X X Xβ ⋆ + X X ′ X Xu
( ) −1
= Xβ ⋆ + X X ′ X X ′u
( ( )−1 ′ )
= X β⋆ + X ′X X u = X β̂ OLS = Ŷ
2. Residual maker matrix
MX Y = û
= (IN − PX ) Y
( ( )−1 ′ )
= IN − X X ′ X X Y
( ′ )−1 ′
=Y −X X X XY
= Y − X β̂ OLS = û
MX u = (IN − PX ) u
( ( )−1 ′ )
= IN − X X ′ X X u
( )−1
= IN u − X X ′ X Xu
( ′ )−1 ′
=u−X X X Xu
( )
= u − X β̂ OLS − β ⋆
= u + Xβ ⋆ − X β̂ OLS
= Y − X β̂ OLS = û
3. PX and MX are symmetric

( )−1 ′
PX = X X ′ X X
( ( ) )′
−1
PX′ = X X ′ X X′
( )
( ′ )′ (( ′ )−1 )′ ′
= X XX X
( ( ( )) )
−1
= X X′ X′ X′
( )−1 ′
= X X ′X X
MX = IN − PX
( ( )−1 ′ )′
′
MX = IN − X X ′ X X
( ( )−1 ′ )′
′
= IN − X X ′X X
= IN − PX
4. PX and MX are idempotent

( ( )−1 ′ ) ( ( ′ )−1 ′ )
PX PX = X X ′ X X X XX X
( )−1 ′ ( ′ )−1
= X X ′X XX XX X
( ′ )−1 ′
= IN X X X X
( ′ )−1 ′
=X XX X
MX MX = (In − PX ) (In − PX )
= IN IN − IN PX − PX IN + PX PX
= IN − PX − PX + PX
= IN − PX
5. The rank and trace of PX and MX are identical

( ( )−1 ′ )
′
tr (PX ) = tr X X X X
( ( ) )
−1
= tr X ′ X X ′ X
=K
tr (MX ) = tr (IN − PX )
= tr (IN ) − tr (PX )
=N −K
tr (PX ) = rank (PX ) = K

tr (MX ) = rank (MX ) = N − K
Remarks: since PX and MX are idempotent matrices, they have only 0 and 1 eigen-
values. The multiplicity of one as eigenvalue is precisely the rank. By Singular Value
Decomposition, one can also prove that,
PX PX = PX
⇒ CΛC ′ · CΛC ′ = CΛC ′
⇒ ΛCΛΛC ′ = CΛC ′
⇒ C ′ CΛΛC ′ C
⇒ ΛΛ = Λ
Chapter 2
Ordinary Least Squares
Estimator
( )
1. β̃ := argminb∈RK E (Y − X ′ b)2
dim X = k × 1
dim a = k × 1
dim A = k × k
∂ ( ′ ) ∂ ( ′ )
AX = XA =A
∂X ∂X
∂
(AX) = A
∂X ′
∂ ( ′ ) ( )
X AX = A + A′ X
∂X
∂ 2 ( ′ )
′
X AX = A + A′
∂X∂X
s(b) = E[(Y − X ′ b)2 ]

= E[Y 2 − 2Y (X ′ b) + (X ′ b)′ (X ′ b)]
= E[Y 2 − 2b′ XY + b′ XX ′ b]
= E[Y 2 ] − 2b′ E[XY ] + b′ E[XX ′ ]b
∂s(b)
= −2E[XY ] + 2[XX ′ ]b
∂b
= −2E[XY ] + 2E[XX ′ ]β̃ = 0
β̃ = E[XX ′ ]−1 E[XY ]
11
CHAPTER 2. ORDINARY LEAST SQUARES 12
2. β̂ OLS := argminb∈RK (Y − Xb)′ (Y − Xb)
e(b) = Y − Xb
RSS(b) = e′ (b)e(b)
= (Y − Xb)′ (Y − Xb)
( )
= Y ′ − b′ X ′ (Y − Xb)
= Y ′ Y − Y ′ Xb − b′ X ′ Y + b′ X ′ Xb
= Y ′ Y − 2b′ X ′ Y + b′ X ′ Xb
∂RSS(b)
= 0 − 2X ′ Y + X ′ Xb + X ′ bX
∂b ( )
= −2X ′ Y + X ′ X + X ′ X b = 0
= −2X ′ Y + 2X ′ X β̂ OLS = 0
( )−1
β̂ OLS = X ′X (X ′ Y )
(N )−1 ( N )
∑ ∑
′
= Xi Xi Xi Yi
i=1 i=1
Asymptotic Distribution
√ ( )
1. N β̂ OLS − β ⋆ under homoskedasticity.
[ ]−1
β ⋆ = E Xi Xi′ E [Xi Yi ]
( ) −1
β̂ OLS = X ′ X X ′Y
( )−1 ′
= X ′X X (Xβ ∗ + u)
( )−1 ′
= β⋆ + X ′X Xu
( )−1 ( )
∑N ∑
N
⋆ −1 ′ −1
=β + N Xi Xi N Xi ui
i=1 i=1
( )−1 ( )
√ ( ) ∑
N ∑
N
−1
N β OLS
−β ⋆
= N Xi Xi′ N − 21
Xi ui
i=1 i=1
( )−1
∑
N
[ ]−1
N −1 Xi Xi′ = E Xi Xi′
i=1
( )
∑
N √ ∑
N
− 21 −1
N Xi ui = N N Xi ui
i=1 i=1
√
= N Z̄N ; where Zi = Xi ui
( [ ])
−→ Z ∼ N 0, E Zi , Zi′
d
( [ ])
−→ Z ∼ N 0, E u2i Xi Xi′
d
Therefore,
√ ( ) [ ] ∑
N
′ −1 −1/2
N β OLS
−β ⋆
= E Xi Xi N Xi ui
i=1
[ ( ]−1 [ 2 ] [ ]−1 )
−→ N 0, E Xi Xi′ E ui Xi Xi′ E Xi Xi′
d
Under homoskedasticity,
[ ] [ [ ] ]
E u2i Xi Xi′ = E E u2i Xi Xi′ |X
[ [ ]]
= E Xi Xi′ E u2i |X
[ ]
= σu2 E Xi Xi′
Finally,
√ ( ) d ( [ ])
N β OLS − β ⋆ −→ Z ∼ N 0, E u2i Xi Xi′
( [ ]−1 )
−→ N 0, σu2 E X; Xi′
d
√ ( )
2. A consistent estimator for the asymptotic variance of N β̂ OLS − β ∗ under ho-
moskedasticity
∑
N N (
∑ )2
1 1
2
S = 2
ûi = Yi − Xi′ β̂ OLS
N −K N −K
i=1 i=1
′ OLS
ûi = Yi − Xi β̂
= ui + Xi′ β ⋆ − Xi′ β̂ OLS
( )
= ui − Xi′ β̂ OLS − β ⋆
∑N [ ( )]2
1
Su2 = ui − Xi′ β̂ OLS − β ⋆
N −K
i=1
[ ]
N ∑ N
= N −1 u2i
N −K
i=1
[ ]
( )′ ∑N X X ′ ( ) ( )′ ∑N X u
i=1 i i i=1 i i
+ β̂ OLS
−β ⋆
β̂ OLS
− β − 2 β̂
⋆ OLS
−β ⋆
N −K N −K
[ ]
Su2 −→ 1 · σ 2 + 0 · E Xi Xi′ · 0 − 2 · 0 · 0 = σu2
The proof of consistency ends here. However, to take it a bit further,
û′ û = û′ M û
[ ] [ ]
E û′ û|X = E û′ MX û|X
= E [tr (ûMX û) |X]
[ [ ] ]
= tr E ûû′ |X MX
( )
= tr σu2 IMX
= σu2 tr (MX )
= σu2 (N − K)
[ ] E [û′ u|X]
E s2 |X =
N −K
σu2 (N − K)
=
N −K
2
= σu
1
σ̂u2 = û′ û
N −K
1
= u′ MX u
N −K
1 [ ( )−1 ′ ]
= u′ IN − X X ′ X X u
N −K
1 [ ( )−1 ′ ]
= u′ u − u′ X X ′ X Xu
N −K [ ]
( )
N u′ u u′ X X ′ X −1 X ′ u
= −
N −K N N N N
X ′u p u′ u N p
−→ 0 → σu2 →1
N N N −K
( ′ )−1
′
û û XX p [ ]−1
−→ σu2 E Xi Xi′
N −K N
Chapter 3
Linear Regression Model
3.1 Ordinary Least Squares

Consistency
( )−1 ( ′ )
β̂ OLS = X ′ X XY
( ′ )−1 ′
= XX X (Xβ + e)
( ′ )−1 ′ ( )−1 ′
= XX X Xβ + X ′ X Xe
( ′ )−1 ′
=β+ XX Xe
= β + Op (1) · op (1)
= β + op (1)
( )−1 ( )
X ′X Xe
p lim β̂OLS = β + p lim
N N
( )
Xe
ρ lim =0
N
( ′ )−1
XX [ ]−1
ρ lim = E X ′X
N
ρ
β̂OLS −→ β
15
CHAPTER 3. LINEAR REGRESSION MODEL 16
Expectation and Variance

[ ] [ ( )−1 ′ ]
E β̂OLS = E β + X ′ X Xe
[ [ ′ −1 ′ ] ]
= β + E E (X X) X e |X
[( )−1 ′ ]
= β + E X ′X X E[e|X]
=β
( ) ( )( )′
Var β̂OLS |X = E[ β̂OLS − E [βOLS ] β̂OLS − E [βOLS ] |X]
[( ]
( ′ )−1 ′ ) (( ′ )−1 ′ )′
=E XX Xe XX X e |X
[( )−1 ′ ′ ( ′ )−1 ]
= E X ′X X ee X X X |X
( ′ )−1 ′ [ ′ ] ( ′ )−1
= XX X E ee |X X X X
( ′ )−1 ′ ( )−1
= XX X ΣX X ′ X
Compared to other linear unbiased estimator β̃ := CY where dim C ′ = N × K
E[β̃|X] = E[CY |X] = CE[Y |X] = CXβ

CXβ = β ⇒ CX = IK
[ ]
Var(β̃|X) = E Ce(Ce)′ |X
[ ]
= CE ee′ |X C ′
= CΣC ′
Guess C ⋆ from GLS,

( )−1
C ⋆ = X ′ Σ−1 X XΣ−1
D = C − C⋆
CX = Ik
Finally, the variance is,

( )−1 ( ′ −1 )−1
Var(β̃|X) = CΣC ′ − X ′ Σ−1 X + XΣ X
′ ( )−1
= CΣC ′ − C ⋆ ΣC ⋆ + X ′ Σ−1 X
′ ( )−1
= (C − C ⋆ + C ⋆ ) Σ (C − C ⋆ + C ⋆ )′ − C ⋆ ΣC ⋆ + X ′ Σ−1 X
′ ′ ′ ( )−1
= DΣD′ + DΣC ⋆ + C ⋆ ΣD′ + C ⋆ ΣC ⋆ − C ⋆ ΣC ⋆ + X ′ Σ−1 X
( )−1
= DΣD′ + X ′ Σ−1 X
It is important to know that,

′ ( )−1
DΣC ⋆ = (C − C ∗ ) ΣΣ−1 X X ′ Σ−1 X
( )−1
= (CX − C ⋆ X) X ′ Σ−1 X =0
To check positive semidefinite,
q̃DΣD′ q̃ ′ ≥ 0 ⇒ DΣD′
3.2 Generalized Least Squares

Expectation and Variance
β̂GLS = (X̃ ′ X̃)−1 X̃ ′ Ỹ

(( )′ ( 1 ))−1 ( 1 )′ ( 1 )
− 12
= Γ X Γ− 2 X Γ− 2 X Γ− 2 Y
( )−1 ( )
= X ′ Γ− 2 Γ− 2 X X ′ Γ− 2 Γ− 2 Y
1 1 1 1
( )−1 ( )
= X ′ Γ− 2 X X ′ Γ− 2 Γ− 2 y
1 1 1
( )−1 ′ −1
= X ′ Γ−1 X XΓ Y
( ′ −1 )−1 ′ −1
= XΓ X X Γ (Xβ + e)
( ′ −1 )−1 ( )−1 ′ −1
= XΓ X XΓ−1 Xβ + XΓ−1 X XΓ e
( ′ −1 )−1 ′ −1
=β+ XΓ X XΓ e
[( )−1 ]
E[β̂GLS |X] = E X ′ Γ−1 X XΓ−1 Y |X
[( )−1 ′ ]
= Γ−1 ΓE X ′ X X Y |X
( )−1 ′
= X ′X X E[Y |X]
( ′ )−1 ′
= XX X Xβ
=β
[ ] [( )( )′ ]
var β̂GLS |X = E β̂GLS − β β̂GLS − β |X
[( )−1 ′ −1 ′ −1 ( ′ −1 )−1 ]
= E X ′ Γ−1 X X Γ ee Γ X X Γ X |X
( ′ −1 )−1 ′ −1 [ ′ ] −1 ( ′ −1 )−1
= XΓ X X Γ E ee |X Γ X X Γ X
( ′ −1 )−1 ′ −1 −1 ( ′ −1 )−1
= XΓ X X Γ ΣΓ X X Γ X
2
( ′ −1 )−1 ′ −1 −1 ( ′ −1 )−1
=σ XΓ X X Γ ΓΓ X X Γ X
( ) −1 ( ′ −1 ) ( ′ −1 )−1
= σ 2 X ′ Γ−1 X XΓ X XΓ X
( ) −1
= σ 2 X ′ Γ−1 X
( )−1
= X ′ (σ 2 Γ)−1 X
( )−1
= X ′ Σ−1 X
[ ]
It is worth noting that, Var[β̃|x] ≥ Var β̂GLS |X .
Chapter 4
Instrumental Variables
Recall that for a general model Y = Xβ + e (or Yi = Xi′ β + ei ), one of the most important
assumptions of OLS theory is the exogeneity of the independent variables,
E (e|X) = 0
When this assumption is violated we essentially have that,

( ) ( )
cov(X, e) = E X ′ e − E(X)E(e) = E X ′ e
= E[ei Xi ] ̸= 0
and the OLS estimator loses all advantages. In particular it is easy to verify that since
E[ei Xi ] ̸= 0 then
[( )−1 ′ ]
E[β̂] = β + E X ′ X Xe
( ( )) −1
= β + E Xi Xi′ E (Xi ei )
̸= β
when E[Xi Ei ] ̸= 0 we have errors-in-variable problem in which the independent variable

X is measured with error. Namely we want to estimated the model
Yi = Xi′ β + ei
but we can only observe
X̃i = Xi + ri
19
CHAPTER 4. INSTRUMENTAL VARIABLES 20
where ri is a K ×1 measurement error, independent of ei and Xi The regression we perform

is Y on X̃. The estimator of β is expressed as:
( )−1
β̂ = X̃ ′ X̃ X̃ ′ Y
( )−1
= X ′ X + r′ r + X ′ r + r′ X (X + r)′ (Xβ + e)
( ′ ) −1
E[β̂|X] = X X + r′ r X ′ Xβ
Measurement error on X leads to a biased OLS estimate, biased towards zero. This is also
called attenuation bias or measurement error bias. In this case the estimator is inconsistent
 ( 2) 
E r
β̂ → β 1 − ( i ) 
p
E X̃i X̃i′
Proof. Recall from the least squares estimation,
(N )−1 ( N )
∑ ∑
′
β̂ = X̃i X̃i X̃i Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N ( )
= X̃i X̃i′ X̃i X̃i′ β + vi
i=1 i=1
(N )−1 ( ) ( )−1 ( )
∑ ∑
N ∑
N ∑
N
= X̃i X̃i′ X̃i X̃i′ β+ X̃i X̃i′ X̃i vi
i=1 i=1 i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ X̃i vi
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ X̃i (ei − ri β)
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ X̃i X̃i′ (Xi + ri ) (ei − ri β)
i=1 i=1
(N )−1 ( )
∑ ∑
N
( )
=β+ X̃i X̃i′ Xi ei − Xi ri β + ri ei − ri2 β
i=1 i=1
( )−1
∑
N
β̂ = β+ N −1 X̃i X̃i′
i=1
( )
∑N ∑
N ∑
N ∑
N
−1 −1 −1 −1
N Xi ei − N Xi ri β + N ri ei − N ri2 β
i=1 i=1 i=1 i=1
By taking probability limit and since ,
∑
N
p
−1
N Xi ei −→ 0
i=1
∑
N
p
N −1 Xi ri β −→ 0
i=1
∑
N
p
−1
N ri ei −→ 0
i=1
∑
N
p
N −1 X̃i X̃i′ −→ E[X̃i X̃i′ ]
i=1
∑N
p
N −1 ri2 β −→ E[ri2 ]
i=1
Notice that,
( )−1
1 ∑ ( [ ] [ ])−1
N
′
p lim X̃i X̃i = E Xi Xi′ + E ri ri′
N
i=1
( )
1 ∑
N
p lim X̃i ei = E [Xi ei ] + E [ri ei ] = 0
N
i=1
( )
1 ∑ [ ] [ ] [ ]
N
p lim ′
X̃i ri = E Xi ri′ + E ri ri′ = E ri ri′
N
i=1
( [ ] [ ])−1 ( [ ] )
p lim β̃ = β + E Xi Xi′ + E ri ri′ −E ri ri′ β
( [ ] [ ])−1 ( [ ] )
p lim β̃ = β + E Xi2 + E ri2 −E ri2 β
( [ ] )
E ri2
= β 1 − [ 2] [ ]
E Xi + E ri2
[ ]
E X̃i X̃i is positive definite,
[ ] [ ]
E X̃i X̃i = E (Xi + ri ) (Xi + ri )′
[ ] [ ] [ ] [ ]
= E Xi Xi′ + E Xi ri′ + E ri Xi′ + E ri ri′
[ ] [ ]
= E Xi Xi′ + E ri ri′
Therefore,
 ( ) 
E ri2
β̂ → β 1 − ( )
p
E X̃i X̃i′
( )
p σr2
→β 1− 2
σx + σr2
4.1 Estimator
General theory: consistent estimator of β for the general model y = Xβ +e, when E[X ′ e] ̸=
0 can be obtained if we could find a matrix of instruments Z of order N × L, with L ≥ K
(more instruments than variables) such that:
p
1. Variables in Z correlated with those in X and Z ′ X/N → ΣZX finite and full rank
(by column or row).
2. Z ′ e/N →p 0
Idea 4.1.1. By projecting (regressing) X on Z, hence creating X̂, we are taking away the
share of X related to e, making β̂IV consistent!
Xi = π ′ Zi + vi
such that π = E (Zi Zi′ )−1 E (Zi Xi′ ) implying E (Zi vi′ ) = 0.The reduced form for Xi can be
plugged into the original regression:
Yi = Xi′ β + ei
( )′
= π ′ Zi + vi β + ei
= Zi′ λ + wi
where λ := πβ and wi := vi′ β + ei . From two regressions above,

( )−1 ( )
π = E Zi Zi′ E Zi Xi′
( )−1
λ = E Zi Zi′ E (Zi Yi )
4.1.1 Case 1: L = K
Recall a useful theorem from Linear Algebra
Theorem 4.1.2. The linear system Ax = b has a solution if and only if its augmented
matrix and coefficient matrix have the same rank.
That means the augmented matrix (π λ) has K full rank such that there exists a unique
solution for β.
β = π −1 λ
( )−1 ( ) ( )−1
= E Zi Xi′ E Zi Zi′ E Zi Zi′ E (Zi Yi )
( ) −1
= E Zi Xi′ E (Zi Yi )
Applying the analogy principle delivers the estimator
(N )−1 ( N )
∑ ∑
IV ′
β̂ = Zi Xi Zi Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N
= Zi Xi′ Zi (Xi′ β + ei )
i=1 i=1
(N )−1 ( ) (N )−1 ( N )
∑ ∑
N ∑ ∑
= Zi Xi′ Zi Xi′ β+ Zi Xi′ Zi ei
i=1 i=1 i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ Zi Xi′ Zi e i
i=1 i=1
General Form
To derive the IV estimator we start from the classic regression setup
Y = Xβ + e
and pre-multiply it by Z (Z ′ Z)−1 Z ′ obtaining

( )−1 ′ ( )−1 ′ ( )−1 ′
Z Z ′Z Z Y = Z Z ′Z Z Xβ + Z Z ′ Z Ze
Denote each term as,
( )−1 ′
Ŷ = Z Z ′ Z ZY
( ′ )−1 ′
X̂ = Z Z Z ZX
( ′ )−1 ′
ê = Z Z Z Ze
such that,
ŷ = X̂β + ê
The instrumental variable estimator can the be obtained by applying OLS on this modified
model
( )−1
β̂IV = X̂ ′ X̂ X̂ ′ Ŷ
[( ]
( ′ )−1 ′ )′ ( ′ )−1 ′ −1 ( ( ′ )−1 ′ )′ ( ′ )−1 ′
= Z ZZ ZX Z ZZ ZX Z ZZ ZX Z ZZ ZY
[ ( )−1 ′ ( ′ )−1 ′ ]−1 ′ ( ′ )−1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZZ ZZ ZX XZ ZZ ZZ ZZ ZY
[ ( ]
)−1 ′ −1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZX XZ ZZ ZY
This is how the IV estimator looks in general. If the number of instruments L equals the
number of regressors L (i.e. L = K) then the product Z ′ X is a square matrix of dimension
K × K( or L × L), which is non-singular (i.e. invertible). Therefore, we can re-write the
term in square brackets as
[ ( )−1 ′ ]−1 ′ ( ′ )−1 ′
β̂IV = X ′ Z Z ′ Z ZX XZ ZZ ZY
( )−1 (( ′ )−1 )−1 ( ′ )−1 ′ ( ′ )−1 ′
= Z ′X ZZ XZ XZ ZZ ZY
( ′ )−1 ( ′ ) ( ′ )−1 ′
= ZX ZZ ZZ ZY
( ′ )−1 ′
= ZX ZY
Consistency
The estimate is consistent
( )−1 ( )
∑
N ∑
N
−1
β̂ IV
=β+ N Zi Xi′ N −1
Zi e i
i=1 i=1
By taking probability limit,

( )−1
∑
N
p
N −1 Zi Xi′ → E[Zi Xi′ ]−1
i=1
( )
∑
N
p
−1
N Zi ei →0
i=1
p
β̂ IV → β
√ ( )
Asymptotic Distribution of N β̂ IV − β
( )−1 ( )
∑
N ∑
N
β̂ IV = β + N −1 Zi Xi′ N −1 Zi ei
i=1 i=1
( )−1 ( )
√ ( ) ∑
N ∑
N √
N β̂ IV − β = N −1 Zi Xi′ N −1 Zi e i N
i=1 i=1
Consider each terms separately,

( )−1
∑
N
p
−1
N Zi Xi′ −→ E[Zi Xi′ ]−1
i=1
( )
∑
N √ ( )
N −1 N −→ N 0, E[e2i Zi Zi′ ]
d
Zi e i
i=1
Therefore,
√ ( )
d
N β̂ IV − β −→ N (0, Ω)
( )
where Ω = E (Zi Xi′ )−1 E e2i Zi Zi′ E (Xi Zi′ )−1 .
4.1.2 Case 2: L ≥ K
Recall from the general form that
( )−1
β̂2sls = X̂ ′ X̂ X̂ ′ Ŷ
[( ]
( ′ )−1 ′ )′ ( ′ )−1 ′ −1 ( ( ′ )−1 ′ )′ ( ′ )−1 ′
= Z ZZ ZX Z ZZ ZX Z ZZ ZX Z ZZ ZY
[ ( )−1 ′ ( ′ )−1 ′ ]−1 ′ ( ′ )−1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZZ ZZ ZX XZ ZZ ZZ ZZ ZY
[ ( ]
)−1 ′ −1 ′ ( ′ )−1 ′
= X ′Z Z ′Z ZX XZ ZZ ZY
such that it can also be written as

(N )−1 ( N )
∑ ( ) ( ) ∑ ( )
′ ′ −1 ′ ′ ′ −1
β̂2sls = Xi Zi Zi Zi Zi Xi Xi Zi Zi Zi (Zi Yi )
i=1 i=1
(N )−1 ( )
∑ ∑
N
bP = (P Zi )Xi′ (P Zi )Yi
i=1 i=1
Different choices for the matrix P result in different estimators. For example, the simple IV
estimator for the exactly identified case simply sets P = I. It can be shown that another
choice, namely P = P ∗ := E (Xi Zi′ ) E (Zi Zi′ )−1 , results in an estimator, bP ∗ , with minimal
asymptotic variance. Notice, however, that bP ∗ is an infeasible estimator because you do
not observe P ∗ . Replace P ∗ by P̂ = P ∗ + op (1) resulting in the feasible estimator bP̂ .
(N )−1 ( )
∑ ∑
N
bP̂ = P̂ Zi Xi′ P̂ Zi Yi
i=1 i=1
(N )−1 ( )
∑ ∑
N
( )
= P̂ Zi Xi′ P̂ Zi Xi′ β + ei
i=1 i=1
(N )−1 ( N )
∑ ∑
=β+ P̂ Zi Xi′ P̂ Zi ei
i=1 i=1
( )−1 (N )
∑
N ∑
=β+ P̂ Zi Xi P̂ Zi e i
i=1 i=1
( )−1 ( )
1 ∑
N
1 ∑
N
=β+ P̂ Zi Xi′ P̂ Zi ei
N N
i=1 i=1
Consistency
bP̂ = β + (Op (1) · Op (1))−1 OP (1) · op (1)

= β + Op (1) · op (1)
= β + op (1)
p
Therefore, bP̂ −→ β.
√ ( )
Asymptotic Distribution of N bP̂ − β
( )−1 (√ )
√ ( ) 1 ∑ N∑
N N
N bP̂ − β = P̂ Zi Xi′ P̂ Zi ei
N N
i=1 i=1
Consider each term separately,

( )
1 ∑ ( [ ])−1 ∗
N
p lim P̂ Zi Xi P̂ = P ∗ E Zi Xi′
′
P
N
i=1
= (P ∗ CZX )−1 P ∗
(√ )
N∑ [ [ ])
N
−→ N 0, E Zi ei e′i Zi′
d
Zi e i
N
i=1
( [ ])
−→ N 0, σe2 E Zi Zi′
d
d ( −1
)
−→ N 0, σe2 CZZ
To sum up,
√
N (bP̂ − β) −→ (CXZ CZZ CZX )−1 CXZ CZZ N (0, E[e2i Zi Zi′ ])
d
where CXZ := E (Xi Zi′ ), CZX := E (Zi Xi′ ), and CZZ := E (Zi Zi′ )−1 . To simplify the
notation,
√
N (bP̂ − β) −→ N (0, AV A′ )
d
where
A = (CXZ CZZ CZX )−1 CXZ CZZ

= (P ∗ CZX )−1 P ∗
V = E[e2i Zi Zi′ ]
Asymptotic Variance Under Homoskedasticity

( )
We have assumption that E e2i Zi Zi′ = σe2 E (Zi Zi′ )
(√ ) (√ )
avar N (bP̂ − β) = avar N (bP̂ )
= AV A′
( )′
= (CXZ CZZ CZX )−1 CXZ CZZ E[e2i Zi Zi′ ] (CXZ CZZ CZX )−1 CXZ CZZ
= σe2 (P ∗ CZX )−1 P ∗ CZZ
−1
CZZ CZX (P ∗ CZX )−1
= σe2 (P ∗ CZX )−1 (P ∗ CZX ) (P ∗ CZX )−1
= σe2 (P ∗ CZX )−1
Good Estimator for P ∗

Recall that
[ ] [ ]−1
P ∗ = E Xi Zi′ E Zi Zi′
By analogy principal,
(N )( )−1
∑ ∑
N
P̂ = Xi Zi′ Zi Zi′
i=1 i=1
( )( )−1
∑
N ∑
N
P̂ = N −1 Xi Zi′ N −1 Zi Zi′
i=1 i=1
To test its consistency, we take the probability limit,

( )
∑
N
p [ ]
N −1 Xi Zi′ −→ E Xi Zi′
i=1
( )−1
∑
N
−→ E [Zi Zi ]−1
p
N −1 Zi Zi′
i=1
Which is equivalent to
( )
∑
N
[ ]
N −1 Xi Zi′ = E Xi Zi′ + op (1)
i=1
( )−1
∑
N
N −1 Zi Zi′ = E [Zi Zi ]−1 + op (1)
i=1
Plugging these terms back in

( [ ] )( )
P̂ = E Xi Zi′ + op (1) E [Zi Zi ]−1 + op (1)
[ ]
= E Xi Zi′ E [Zi Zi ]−1 + E [Xi Zi ] op (1) + op (1)E [Zi Zi ]−1 + op (1)op (1)
[ ]
= E Xi Zi′ E [Zi Zi ]−1 + op (1)
= P ∗ + op (1)
4.2 Application: Partitioned Regression

The linear model under endogeneity is
Y = Xβ + e
X = Zπ + v
where E (ei Xi ) ̸= 0 and E (ei Zi ) = 0. Notice dim X = N × K, dim β = K × 1, dim Z =

N × L, dim π = L × K, and dim v = N × K.
The source of the endogeneity is correlation between the two error terms, write
e = vρ + w
where E (vi wi ) = 0. Notice dim ρ = K × 1, and dim w = N × 1.
Combining, we obtain
Y = Xβ + vρ + w
You have available an iid data set (Xi , Yi , vi ),

 
[ ] β
Y = X v  +w
ρ
Normal equation:
    
X ′X X ′v β̂ X ′Y
  = 
v′X v′v ρ̂ v′Y
We have
X ′ X β̂ + X ′ v ρ̂ = X ′ Y
v ′ X β̂ + v ′ v ρ̂ = v ′ Y
by rearranging
( )−1 ′
β̂ = X ′ X X (Y − v ρ̂)
( ′ )−1 ′
ρ̂ = v v v (Y − X β̂)
Since Pv = v (v ′ v)−1 v ′ and Mv = I − Pv

( )−1 ′ ( ( )−1 ′ ( ))
β̂ = X ′ X X Y − v v′v v Y − X ′ β̂
( )−1 ′
β̂ OLS = X ′ Mv X X Mv Y
4.2.1 Consistency
( )−1 ′
( ′ )−1 ′
= X Mv X X Mv (Xβ + vp + w)
( ′ )−1 ′
= β + 0 + X Mv X X Mv w
Notice that
( ( )−1 ′ )
Mv v = I − v v ′ v v v=0
Rewriting
( )−1 ′
β̂ OLS − β = X ′ Mv X X Mv w
( )−1 ( )
1 ′ 1 ′
= X Mv X X Mv w
N N
Consider each term
1 ′ 1 ( ( )−1 ′ )
X Mv X = X ′ I − v v ′ v v X
N N
1 ( ′ ( )−1 ′ )
= X X − X ′v v′v vX
N
( )
X ′X X ′ v v ′ v −1 v ′ X
= −
N N N n
( )( )−1 ( )
1 ′ 1 ′ 1 ′ 1 ′
= ΣXi Xi − ΣXi vi Σvi vi Σvi Xi
N N N N
= Op (1) − Op (1) · Op−1 (1) · Op (1)
= Op (1)
1 ′ 1 ( ′( ( )−1 ′ ) )
X Mv w = X I − v v′v v w
N N
( )( ) ( )
1 ′ 1 ′ 1 ′ −1 1 ′
= X w− Xv vv vw
N N N N
( )( )−1 ( )
1 ′ 1 ′ 1 ′ 1 ′
= ΣXi wi − ΣXi vi Σvi vi Σvi wi
N N N N
= op (1) − Op (1) · (Op (1))−1 · op (1)
= op (1)
Therefore,
β̂ OLS − β = Op (1) · op (1) = op (1)
p
β̂ OLS −→ β
4.2.2 A Plot Twist

You do not have available an iid data set (Xi , Yi , vi ) . Instead, you have available an iid
data set (Xi , Yi , Zi ) . You cannot run a regression of Y on X and v, but you can instead
run a regression of Y on X and v̂ where v̂ is the first stage residual.
( )−1 ′
Recall that
( )−1 ′
P̂v = v̂ v̂ ′ v̂ v̂
X = Zπ + v
v̂ = X − Z π̂
( )−1 ′
= X − Z Z ′Z ZX
= X − PZ X
= (I − PZ ) X
= MZ X
( )−1 ′
P̂v = MZ X X ′ MZ′ MZ X X MZ
( ′ )−1 ′
= MZ X X MZ X X MZ
Remember that, P̂v X = MZ X, M̂v X = PZ X, and PZ = Z (Z ′ Z)−1 Z

( )−1 ( ′ )
β̂ OLS = X ′ PZ X X PZ Y
( ( )−1 ′ )−1 ( ′ ( ′ )−1 ′ )
β̂ 2SLS = X ′ Z Z ′ Z ZX XZ ZZ ZY
Remark: the estimate derived in the former case is more precise but since you cannot
observe vi in reality, the estimate from the latter case is more practical.
Chapter 5
Maximum Likelihood Estimation
5.1 Normal Distribution with Unit Variance

You have Y1 , . . . , YN iid with pdf
( )
1 1
fY (y|µ) = √ exp − (y − µ) 2
2π 2
The likelihood function

1 2 1 2
L (µ|y) = √ e− 2 (y1 −µ) · . . . · √ e− 2 (yN −µ)
1 1
2π 2π
The log likelihood function
( ) ( )
1 − 1 (y1 −µ)2 1 − 1 (yN −µ)2
L (µ|y) = ln √ e 2 + . . . + ln √ e 2
2π 2π
( ( ) ) ( ( ) )
1 1 1 1
= ln √ − (y1 − µ) + . . . + ln √
2
− (yN − µ) 2
2π 2 2π 2
∑N [ ( ) ]
1 1
= ln √ − (yi − µ)2
2π 2
i=1
( )
1∑
N
1
= N ln √ − (yi − µ)2
2π 2
i=1
32
CHAPTER 5. MAXIMUM LIKELIHOOD ESTIMATION 33
Differentiating with respect to µ

( ( ))
1 ∑
N
∂L ∂ 1
(µ|y1 , . . . , yN ) = N ln √ − ·2 (yi − µ) (−1)
∂µ ∂µ 2N 2
i=1
∑
N
( )
0= yi − µ̂ML
i
i=1
∑N
= yi − N µ̂ML
i=1
∑N
i=1 yi
µ̂ML = =y
N
5.1.1 Expectation
[ ]
ML 1
E[µ̂ ]=E (y1 + . . . + yN )
N
1
= (E [y1 ] + . . . + E [yN ])
N
1
= · Nµ
N
=µ
5.1.2 Variance
( )
1 ∑
n
ML
var(µ̂ ) = var yi
N
i=1
1
= · (var (y1 ) + . . . + var (yn ))
N
1 1
= 2 ·N =
N N
5.1.3 Score function
∂ ln fY (y|µ)
S(y|µ) = =y−µ
∂µ
5.1.4 Fisher information

[ ]
I(µ) = E S(y|µ)2 = var(S(y|µ)) = var(y − µ) = 1
To confirm the information equality,

1 1
var (T (Y1 , . . . , YN )) ≥ =
N I(µ) N
1
Cramér Rao bound for unbiased estimator is N
∂S(y|µ)
= −1
∂µ
[ ]
∂S(y|µ)
−E = 1 = I(µ)
∂µ
1
The information equality holds. Since E[µ̂ML ] = µ and var(µ̂ML ) = N, the ML estimator
is unbiased and attains the CRB.
5.1.5 Decomposition
1 ∑
N
S (yi |µ) = a(µ) · (T (Y1 , . . . , YN ) − µ)
N
i=1
1
= [(y1 − µ) + . . . + (yN − µ)]
N
1 ∑
N
1
= yi − N µ
N N
i=1
1 ∑
N
= yi − µ
N
i=1
( )
1 ∑
N
=1· yi − µ
N
i=1
Therefore, a(µ) = 1, which is Fisher information!
5.1.6 Asymptotic distribution of µ̂ML
√ ( ) d
N µ̂ML − µ −→ N (0, 1)
( )
ML d 1
µ̂ −→ N µ,
N
5.2 Normal Regression Model
Yi = Xi′ β + ei
( )
ei |Xi ∼ N 0, σe2
The novelty here is that the errors are assumed to have a normal distribution. The unknown
parameters are β ∈ RK and σe2 . Notice that the above normal regression model can be
regarded, equivalently, as a statement about the density of Yi given Xi . That conditional
density is
( )
( ) 1 1 ( ′
)2
fY y|x, β, σe = √
2
exp − 2 y − x β
2πσe2 2σe
You have available a random sample (Xi , Yi ) , where the Yi are iid with pdf fY (y|x).
The likelihood function

( ) ( )
( ) 1 1 ( ′
)2 1 1 ( ′
)2
L β, σe2 |x, y =√ exp − 2 y1 − x1 β · ... · √ exp − 2 yN − xN β
2πσe2 2σe 2πσe2 2σe
( )
1 ∑( )2
N
1
= √ exp − 2 yi − x′i β
σeN ( 2π)N 2σe
i=1
The log likelihood function
( ) 1 ∑( )2
N
N
L β, σe2 |x, y = −N log (σe ) − log(2π) − 2 yi − x′i β
2 2σe
i=1
N ( ) N 1 ∑N
( )2
=− log σe2 − log(2π) − 2 yi − x′i β
2 2 2σe
i=1
5.2.1 Estimators
1 ∑( )2
N
∂L N
2
= − 2
+ 4
yi − x′i β = 0
∂σe 2σe 2σe
i=1
N (
∑ )2
ML 1
σê2 = yi − x′i β̂ ML
N
i=1
1 ∑N
= ê2i
N
i=1
2 ∑( )
N
∂L
=− 2 yi − x′i β (−xi ) = 0
∂β 2σe
i=1
∑
N ∑
N
= xi yi − xi x′i β̂ ML = 0
i=1 i=1
(N )−1 ( N )
∑ ∑
β̂ ML = xi x′i xi yi
i=1 i=1
5.2.2 Score Vector
∂ ln fY ( ) 1 ( )
y|x, β, σe2 = 2 xi yi − x′i β
∂β σe
∂ ln fY ( ) 1 1 ( )2
y|x, β, σe2 = − 2 + 4 yi − x′i β
∂σe2 2σ 2σe
 e 
( ) 1
x (y − x ′ β)
S y|x, β, σe2 : =  σe2 i i 
i
− 2σ2 + 2σ4 (yi − x′i β)2

1 1
e e
5.2.3 Hessian matrix
∂ 2 ln fY ( ) 1
′
y|x, β, σe2 = − 2 xi x′i
∂β∂β σe
2
∂ ln fY ( ) 1 1 ( ′
)2
y|x, β, σ 2
e = − yi − x i β
∂ (σe2 )2 2σe4 σe6
∂ 2 ln fY ( ) 1 ( ′
) ′
y|x, β, σ 2
= − y i − x β xi
∂σe2 ∂β ′ e
σe4 i
∂ 2 ln fY ( ) 1 ( )
2
y|x, β, σe2 = − 4 xi yi − x′i β
∂β∂σe σe
 
− σ12 xi x′i − σ14 xi ei
H(x, y) =  e e 
− σ14 ei x′i 1
2σe4
− 1 2
e
σe6 i
e
( )′
∂ 2 ln fY ∂ ln fY
Notice that ∂β∂σe2
= ∂σe2 ∂β
5.2.4 Fisher information

( )
Since I β, σe2 = −E (H (Xi , Yi ))
 
( ) 1
E (Xi Xi′ ) 0
I β, σe = 
2 σe2 
1
0 2σe4
5.2.5 Asymptotic distribution

     
√ β̂ M L − β 0 σe2 E (Xi Xi′ )−1 0
N  ML  → N   ,  
ˆ
σe2 − σe
2 0 0 4
2σe
Appendix A
Probability Theory
A.1 Moments
Definition A.1.1. (Expectation) Let X be a continuous random variable with density
f (x). Then, the expected value of X, denoted by E[X], is defined to be
∫ ∞
E[X] = xf (x)dx
−∞
if the integral is absolutely convergent. The expected value does not exist when the fol-
lowings are true,
∫ 0
xf (x)dx = −∞
−∞
∫ ∞
xf (x)dx = ∞
0
Definition A.1.2. (Variance) The variance of X measures the expected square of the
deviation of X from its expected value
[ ]
Var(X) = E (X − E[X])2
Definition A.1.3. (Covariance) The covariance of any two random variables X and Y ,
denoted by Cov(X, Y ), is defined by
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

= E[XY − Y E[X] − XE[Y ] + E[X]E[Y ]]
= E[XY ] − E[Y ]E[X] − E[X]E[Y ] + E[X]E[Y ]
= E[XY ] − E[X]E[Y ]
38
APPENDIX A. PROBABILITY THEORY 39
Properties of Covariance
For any random variables, XY, Z and constant c ∈ R,
1. Cov(X, X) = Var(X)
2. Cov(X, Y ) = Cov(Y, X)
3. Cov(cX, Y ) = c Cov(X, Y )
4. Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)
Cov(X, Y + Z) = E[X(Y + Z)] − E[X]E[Y + Z]

= E[XY ] − E[X]E[Y ] + E[XZ] − E[X]E[Z]
= Cov(X, Y ) + Cov(X, Z)
A.2 Conditional Expectation

Definition A.2.1. If X and Y are discrete random variables, then the conditional expec-
tation of X given that Y = y is defined by
∑
E[X|Y = y] = xP {X = x|Y = y}
x
∑
= xpX|Y (x|y)
x
Definition A.2.2. Let us denote by E[X|Y ] that function of the random variable Y whose
value at Y = y is E[X|Y = y]. Note that E[X|Y ] is itself a random variable. An extremely
important property of conditional expectation is that for all random variables X and Y
E[X] = E[E[X|Y ]]
If Y is a discrete random variable, then,

∑
E[X] = E[X|Y = y]P {Y = y}
y
Proof.
∑ ∑∑
E[X|Y = y]P {Y = y} = xP {X = x|Y = y}P {Y = y}
y y x
∑ ∑ P {X = x, Y = y}
= x P {Y = y}
y x
P {Y = y}
∑∑
= xP {X = x, Y = y}
y x
∑ ∑
= x P {X = x, Y = y}
x y
∑
= xP {X = x}
x
= E[X]
One way to understand the proof is to interpret it as follows. It states that to calculate
E[X] we may take a weighted average of the conditional expected value of X given that
Y = y, each of the terms E[X|Y = y] being weighted by the probability of the event on
which it is conditioned.
Remark: the usefulness of above representation is illustrated in section 1.1
A.3 Convergence of Random Variables

Convergence in probability
Definition A.3.1. (a) We say that a sequence of random variables Xn (not necessarily
defined on the same probability space) converges in probability to a real number c, and
p
write Xn → c, if
lim P (|Xn − c| ≥ ϵ) = 0, ∀ϵ > 0

n→∞
(b) Suppose that X and Xn , n ∈ N are all defined on the same probability space. We say
p
that the sequence Xn converges to X, in probability, and write Xn → X, if Xn − X
converges to zero, in probability, i.e.,
lim P (|Xn − X| ≥ ϵ) = 0, ∀ϵ > 0

n→∞
When X in part (b) of the definition is deterministic, say equal to some constant c, then
the two parts of the above definition are consistent with each other
p
The intuitive content of the statement Xn → c is that in the limit as n increases, almost all
of the probability mass becomes concentrated in a small interval around c, no matter how
small this interval is. On the other hand, for any fixed n, there can be a small probability
mass outside this interval, with a slowly decaying tail. Such a tail can have a strong
impact on expected values. For this reason, convergence in probability does not have any
p
implications on expected values. For example, We have Xn → X, but E[Xn ] does not
converge to E[X].
Convergence in distribution
Definition A.3.2. Let X and Xn , n ∈ N, be random variables with CDFs F and Fn ,
respectively. We say that the sequence Xn converges to X in distribution, and write
d
Xn → X, if
lim Fn (x) = F (x)

n→∞
for every x ∈ R at which F is continuous.

p
Convergence in probability is the stronger notion of convergence. In particular Xn → X
d d p
implies Xn → X. For a constant c, Xn → c implies Xn → c, but not the converse.
For convenience, we say op for convergence and Op for boundedness in probability. So, Let
xn be a sequence of non-negative real-valued random variables.
p
1. xn = op (1) means xn → 0 as n grows.
2. xn = op (bn ) for a non-negative sequence {bn } means xn

bn = op (1)
3. xn = op (yn ) for a sequence of non-negative random variables {yn } implies xn

yn = op (1)
Furthermore,
1. xn = Op (1) mean {xn } is bounded in probability; i.e. for any ϵ > 0, ∃bϵ > 0 such
that supn P (xn > bϵ ) ≤ ϵ
xn
2. xn = Op (bn ) mean bn = Op (1)
xn
3. xn = Op (yn ) means yn = Op (1)
Slutsky’s Theorem
d
Let {Xn } , {Yn } be sequences of scalar/vector/matrix random elements. If Xn → X and
p
Yn → c, then,
d
1. Xn + Yn → X + c
d
2. Xn Yn → cX
d
3. Xn /Yn → X/c
Appendix B
Linear Algebra
B.1 Inner Product

130 3 Inner Products and Norms
v
v v3
v2
v1 v2
v1
An inner product isFigure The

3.1. dot
the familiar Euclidean Norm in R 2 and R 3 .
product
T ∑ n
T
v · w v1 w
between (column) vectors v = ( v , v
11 + , . . . ,
2 v 2 w2 n v + · · · + vn wn 1= 2 , .v. i.w, w
) , w = ( w , w i n ) , both lying in the
Euclidean space R n . A key observation is that the dot product i=1 (3.1) is equal to the matrix
product ⎛ ⎞
Definition B.1.1. An inner product on the real vector spacewV1 is a pairing that takes two
⎜ w2 ⎟
vectors v, w ∈ V and produces a Treal number ⟨v, w⟩ ∈ R.⎜The ⎟inner product is required
v · w = v w = ( v1 v2 . . . vn ) ⎜ .. ⎟ (3.2)
to satisfy the following three axioms for all u, v, w ∈ V , and . ⎠ c, d ∈ R.
⎝ scalars
wn
between the row vector vT and the column vector w.
The dot product is the cornerstone of Euclidean geometry. The key fact is that the dot
product of a vector with itself,
v · v = v12 + v43
2 + · · · + vn ,
2 2
is the sum of the squares of its entries, and hence, by the classical Pythagorean Theorem,
equals the square of its length; see Figure 3.1. Consequently, the Euclidean norm or length
of a vector is found by taking the square root:
√
v = v·v = v12 + v22 + · · · + vn2 . (3.3)
Note that every nonzero vector, v = 0, has positive Euclidean norm, v > 0, while only
the zero vector has zero norm: v = 0 if and only if v = 0. The elementary properties
of dot product and Euclidean norm serve to inspire the abstract definition of more general
APPENDIX B. LINEAR ALGEBRA 44
(i) Bilinearity
⟨cu + dv, w⟩ = c⟨u, w⟩ + d⟨v, w⟩,

⟨u, cv + dw⟩ = c⟨u, v⟩ + d⟨u, w⟩
(ii) Symmetry
⟨v, w⟩ = ⟨w, v⟩
(iii) Positivity
⟨v, v⟩ > 0 whenever v ̸= 0, while ⟨0, 0⟩ = 0
Given an inner product, the associated norm of a vector v ∈ V is defined as the positive
square root of the inner product of the vector with itself.
√
∥v∥ = ⟨v, v⟩
B.2 Orthogonality
Orthogonal and Orthonormal Bases
Definition B.2.1. A basis u1 , . . . , un of an n-dimensional inner product space V is called
orthogonal if if ⟨ui , uj ⟩ = 0 for all i ̸= j. The basis is called orthonormal if, in addition,
each vector has unit length: ∥ui ∥ = 1, for all i = 1, . . . , n
Proposition B.2.2. Let v1 , . . . , vk ∈ V be nonzero, mutually orthogonal elements, so
vi ̸= 0 and ⟨vi , vj ⟩ = 0 for all i ̸= j. Then, v1 , . . . , vk are linearly independent.
Lemma B.2.3. If v1 , . . . , vn is an orthogonal basis of a vector space V , then the normalized
vectors ui = vi / ∥vi ∥ , i = 1, . . . , n, form an orthonormal basis
Theorem B.2.4. Let u1 , . . . , un be an orthonormal basis for an inner product space V .
Then one can write any element v ∈ V as a linear combination
v = c1 u1 + · · · + cn un
where its coordinates
ci = ⟨v, ui ⟩ , i = 1, . . . , n
are explicitly given as inner products. Moreover, its norm is given by the Pythagorean
formula
v
√ u n
u∑
∥v∥ = c1 + · · · + cn = t
2 2 ⟨v, ui ⟩2
i=1
namely, the square root of the sum of the squares of its orthonormal basis coordinates.
Definition B.2.5. A square matrix Q is called orthogonal if it satisfies
QT Q = QQT = I
The orthogonality condition implies that one can easily invert an orthogonal matrix
Q−1 = QT
B.2.1 Orthogonal Projection

4.4 Orthogonal Projections and Orthogonal Subspaces 213
v
z
Figure 4.4. The Orthogonal Projection of a Vector onto a Subspace.

For W ⊂ V is be a finite-dimensional subspace of a real inner product space, then we have
definitions.
fitting, as we shall discuss in Chapter 5. Second, we develop the concept of orthogonality
for a pairDefinition B.2.6.
of subspaces, culminating z ∈ aVproof
A vectorwith is said
of to
thebeorthogonality
orthogonal toofthe
thesubspace W ⊂ V if it is
fundamental
subspacesorthogonal
associatedto every
with an vector in W , so
m × n matrix ⟨z, at
that w⟩last w ∈striking
for all the
= 0reveals W. geometry that
underlies linear systems of equations and matrix multiplication.
Definition B.2.7. The orthogonal projection of v onto the subspace W is the element
Orthogonalw ∈ WProjection
that makes the difference z = v − w orthogonal to W.
ThroughoutTheorem B.2.8.WLet
this section, ⊂ uV1 , .will n be
. . , ube an orthonormal basis
a finite-dimensional for theofsubspace
subspace W ⊂ V Then
a real inner
product space. The inner
the orthogonal productofspace
projection v ∈ VV onto w ∈ Wtois be
is allowed infinite-dimensional.
given by But,
to facilitate your geometric intuition, you may initially want to view W as a subspace of
Euclidean space V = R m w equipped
= c1 u1 +with· · · +the where
cn uordinary
n ci = ⟨v, ui ⟩ , i = 1, . . . , n
dot product.
Definition 4.30.
Proof. Since u1 , . . .z, ∈
A vector unV form
is said to be of
a basis orthogonal to thethe
the subspace, subspace W ⊂ projection
orthogonal V if it is element
orthogonal to every vector in W , so z , w = 0 for all w ∈ W .
must be some linear combination thereof: w = c1 u1 + · · · + cn un . Definition B.2.7 requires
Given that the difference
a basis w ,...,wz = −w
of vthe be orthogonal
subspace W , we tonote
W , that
it suffices to check orthogonality
z is orthogonal to W if to the
1 n
and only if it is orthogonal to every basis vector: z , wi = 0 for i = 1, . . . , n. Indeed,
any other vector in W has the form w = c1 w1 + · · · + cn wn , and hence, by linearity,
z , w = c1 z , w1 + · · · + cn z , wn = 0, as required.
Definition 4.31. The orthogonal projection of v onto the subspace W is the element
w ∈ W that makes the difference z = v − w orthogonal to W .
The geometric configuration underlying orthogonal projection is sketched in Figure 4.4.

As we shall see, the orthogonal projection is unique. Note that v = w + z is the sum of
its orthogonal projection w ∈ V and the perpendicular vector z ⊥ W .
The explicit construction is greatly simplified by taking an orthonormal basis of the
subspace, which, if necessary, can be arranged by applying the Gram–Schmidt process
to a known basis. (The direct construction of the orthogonal projection in terms of a
basis vectors. By our orthonormality assumption
⟨z, ui ⟩ = ⟨v − w, ui ⟩
= ⟨v − c1 u1 − · · · − cn un , ui ⟩
= ⟨v, ui ⟩ − c1 ⟨u1 , ui ⟩ − · · · − cn ⟨un , ui ⟩
= ⟨v, ui ⟩ − ci
=0
The coefficients ci = ⟨v, ui ⟩ of the orthogonal projection W are thus uniquely prescribed
by the orthogonality requirement, which thereby proves its uniqueness.
B.2.2 Orthogonal Subspaces
218 4 Orthogonality
v
W⊥
Figure 4.6. Orthogonal Decomposition of a Vector.

Definition B.2.9. Two subspaces W, Z ⊂ V are called orthogonal if every vector in W is
complement W ⊥to=every
orthogonal w1⊥ isvector
the planein Z. passing through the origin having normal vector w ,
1
T ⊥
as sketched
Lemma in B.2.10.
Figure 4.5.
If wIn, .other
. . , w words,
1 span W
k
z=and Zy,, .z. ). , Z∈ span
( x,
1
W if
l Z,and
thenonly if Z are orthog-
W and
onal subspaces if and only ifz ⟨w
· wi1, z=j ⟩x=+0 2for
y +all3 zi = 0.. . . , k and j = 1, . . . , l.
= 1, (4.45)
W ⊂ Vequation ⊥
W ⊥ is characterized
Thus,Definition B.2.11. The orthogonal
as the complement
solution space of a subspace linear
of the homogeneous , denoted W , is
(4.45),
defined as the set of all vectors that are orthogonal to
or, equivalently, the kernel of the 1 × 3 matrix A = w1 = ( 1 2 3 ). We can write the
T W
general solution in the form
⎛ W⊥⎞ = {v ∈⎛V |⟨v,⎞w⟩ = ⎛ ⎞ w ∈ W}
0 for all
−2y − 3z −2 −3
Proposition ⎝
z =B.2.12. ySuppose y ⎝W 1⊂⎠V+ is
⎠ =that z⎝ 0 ⎠ = y z1 + z z2subspace
a finite-dimensional , of an inner
product space. Then every
z vector v ∈ 0V can be uniquely
1 decomposed into v = w + z,
where w ∈ W and z ∈ W ⊥ .
T T
where y, z are the free variables. The indicated vectors z1 = ( −2, 1, 0 ) , z2 = ( −3, 0, 1 ) ,
form a (non-orthogonal) basis for the orthogonal complement W ⊥ .
Proposition 4.40. Suppose that W ⊂ V is a finite-dimensional subspace of an inner

product space. Then every vector v ∈ V can be uniquely decomposed into v = w + z,
where w ∈ W and z ∈ W ⊥ .
Proof : We let w ∈ W be the orthogonal projection of v onto W . Then z = v − w is,
by definition, orthogonal to W and hence belongs to W ⊥ . Note that z can be viewed
as the orthogonal projection of v onto the complementary subspace W ⊥ (provided it is
+
finite-dimensional). If we are given two such decompositions, v = w + z = w z, then
Proof. Let w ∈ W be the orthogonal projection of V onto W . Then, z = v − w is, by

definition, orthogonal to W and hence belongs to W ⊥ . Note that z can be viewed as the
orthogonal projection of v onto the complementary subspace W ⊥ . If we are given two
such decompositions, v = w + z = w e +ez, then w − w = z − z. The left-hand side of this
equation lies in W , while the right-hand side belongs to W ⊥ . But, as we already noted, the
only vector that belongs to both W and W ⊥ is the zero vector. Thus, w − w = 0 = z − z,
so w = w and z = z, which proves uniqueness.
Proposition B.2.13. If W ⊂ V is a subspace with dim W = n and dim V = m, then

dim W ⊥ = m − n.
B.3 Vector Spaces

B.3.1 Span and Linear Independence
88 2 Vector Spaces and Bases
v1
v1
v2 v2
Figure 2.4. Plane and Line Spanned by Two Vectors.

Definition B.3.1. Let v1 , . . . , vk be elements of a vector space V . A sum of the form
For instance, 3 v1 + v2 − 2 v3 , 8 v1 − 13 v3 = 8 v1 +∑ 0k v2 − 13 v3 , v2 = 0 v1 + 1 v2 +
0 v3 , and 0 = 0 v1 + 0 v2 + 0 v c1 vare1 + four
c2 v2 + · · · + cklinear
different vk = combinations
ci vi of the three vector
3
space elements v1 , v2 , v3 ∈ V . i=1
wherekey
The theobservation
coefficients is c1 that
, c2 , . .the are any
. , ckspan scalars,
always formsis aknown subspace.as a linear combination of the
elements v1 , . . . , vk . Their span is the subset W = span {v1 , . . . , vk } ⊂ V consisting of all
Proposition 2.14.
possible linear The span with
combinations W =scalarsspan c{v, 1. ,. .. ., .c, v∈k }R.of any finite collection of vector
1 k
space elements v1 , . . . , vk ∈ V is a subspace of the underlying vector space V .
Proposition B.3.2. The span W = span {v1 , . . . , vk } of any finite collection of vector
Proof : We
space need tov1show
elements ,...,v k ∈ if
that V is a subspace of the underlying vector space V .
= c1to
Proof. Wevneed + · · that
v1show · + cifk vk and =
v c1 v1 + · · · +
c k vk
are any two linear combinations,
v = c1 v1 + · ·then
· + cktheir
vk sum
and is also
b=b ca1 v
linear combination, since
v 1 + ··· + b
ck vk
v+v = (c1 + c1 )v1 + · · · + (ck + ck )vk = c 1 v1 + · · · + c k vk ,
where
ci = ci +
ci . Similarly, for any scalar multiple,
a v = (a c1 ) v1 + · · · + (a ck ) vk = c∗1 v1 + · · · + c∗k vk ,
where ci∗ = a ci , which completes the proof. Q.E.D.
Example 2.15. Examples of subspaces spanned by vectors in R 3 :

(i ) If v1 = 0 is any non-zero vector in R 3 , then its span is the line { c v1 | c ∈ R }
consisting of all vectors parallel to v1 . If v1 = 0, then its span just contains the
origin.
are any two linear combinations, then their sum is also a linear combination, since
b = (c1 + b
v+v c1 ) v1 + · · · + (ck + b
ck ) vk = e
c1 v1 + · · · + e
ck vk
where e
ci = ci + b
ci . Similarly, for any scalar multiple,
av = (ac1 ) v1 + · · · + (ack ) vk = c∗1 v1 + · · · + c∗k vk
where c∗i = aci , which completes the proof.
B.3.2 Linear Independence and Dependence

Definition B.3.3. The vector space elements v1 , . . . , vk ∈ V are called linearly dependent
if there exist scalars c1 , . . . , ck , not all zero, such that
c1 v1 + · · · + ck vk = 0
Elements that are not linearly dependent are called linearly independent.
B.3.3 Basis and Dimension

Definition B.3.4. A basis of a vector space V is a finite collection of elements v1 , . . . , vn ∈
V that,
1. spans V
2. is linearly independent
Theorem B.3.5. Every basis of Rn consists of exactly n vectors. Furthermore, a set of

n vectors v1 , . . . , vn ∈ Rn is a basis if and only if the n × n matrix A = (v1 . . . vn ) is
nonsingular: rank A = n.
Theorem B.3.6. Suppose the vector space V has a basis v1 , . . . , vn for some n ∈ N . Then
every other basis of V has the same number, n, of elements in it. This number is called
the dimension of V , and written dim V = n.
B.3.4 Kernel
Definition B.3.7. The image of an m × n matrix A is the subspace imgA ⊂ Rm spanned
by its columns. The kernel of A is the subspace ker A ⊂ Rn consisting of all vectors that
are annihilated by A,
ker A = {z ∈ Rn |Az = 0} ⊂ Rn
B.4 List of Variables and Dimensions
Variables Dimension
Y N ×1
X N ×K
β K ×1
e N ×1
r N ×K
Z N ×L
v N ×K
π L×K
λ L×1
w N ×1
Yi 1×1
Xi K ×1
ei 1×1
ri K ×1
Zi L×1
vi K ×1
wi 1×1
APPENDIX
Lecture 1: Projection Theorem

1. Variance: Var(X) := E [X − E[X]]2 = E[X]2 − (E[X])2 .
2. Covariance: Cov(X, Y ) := E [(X − µX ) (Y − µY )] = E[XY ] − E[X]E[Y ] = Cov(X, E(Y |X))

Cov(X,Y )
3. If X ∈ {0, 1} then Var X = E[Y |X = 1] − E[Y |X = 0]. spoiler for lecture 7.
4. Conditional expectation: a function µ : R → R such that E(Y |X) := µ(X).
5. Law of iterated expectations: E[Y ] = E [E[Y |X]]
6. Hilbert space: a complete inner product space.
7. Inner product: ⟨X, Y ⟩ = E[XY ] has following properties:
(i) ⟨X, Y ⟩ = ⟨Y, X⟩ (commutative)

(ii) ⟨X + Y, Z⟩ = ⟨X, Z⟩ + ⟨Y, Z⟩ (distributive)
(iii) ⟨aX, Y ⟩ = a⟨X, Y ⟩
(iv) ⟨X, X⟩ ≥ 0
(v) ⟨X, X⟩ = 0 if and only if X = 0
√
8. Norm: ∥X∥ := ⟨X, X⟩ is the length and ∥X − Y ∥ is the distance.
9. Projection theorem: If S is a closed subspace of the Hilbert space H and Y ∈ H, then
(i) there is a unique Ŷ ∈ S such that ∥Y − Ŷ ∥ = inf Z∈S ∥Y − Z∥

(ii) Ŷ ∈ S and ∥Y − Ŷ ∥ = inf Z∈S ∥Y − Z∥ [iff] (Y − Ŷ ) ∈ S ⊥
10. Space L2 is the collection of all rvs X defined on (Ω, F , P ) such that E|X|2 < ∞ (finite variance).
11. Linear subspaces of L2 is any Xk ∈ L2 , k = 1, . . . , K so b1 X1 + · · · + bK Xk ∈ L2 , and bk ∈ R.

{∑ }
K
12. Span: sp (X1 , . . . , XK ) := k=1 bk Xk : Xk ∈ L2 , bk ∈ R .
⟨ ⟩ ( )
13. Orthonormal basis: X̃k ∈ L2 such that X̃j , X̃l = E X̃j · X̃l = 1 for j = l and 0 otherwise.
14. Gram-Schmidt orthogonalization:
Step 1: Let X̃1 := 1

√ ( )
⃗ 2 where Ẍ2 := X2 − E X2 X̃1 X̃1
Step 2: X̃2 := Ẍ2 / VarX
√ ( ) ( )
Step 3: X̃3 := X3 / Var X3 where Ẍ3 := X3 − E X3 X̃1 X̃1 − E X3 X̃2 X̃2
15. Projection (orthogonal): P(Y ) = Ŷ = argminZ∈sp(X1 ) E[Y − Z]2 = inf b∈R E(Y − bX)2 is the
orthogonal projection of Y onto S.
∑ ( )
16. Projection (orthonormal): P(Y ) = Ŷ = K ′ ∗
k=1 E X̃k · Y X̃k = X β where X̃k is an orthonor-
mal basis of sp(X), X is a K × 1 vector, and β ∗ := E (XX ′ )−1 E(XY ).
17. Linear model representation: Y = PX Y + (Y − PX Y ) = X ′ β ∗ + u where E(Xu) = 0 with

requirements that EY 2 < ∞, E∥X∥2 < ∞, and E (XX ′ ) be positive definite.
1
Lecture 2: Ordinary Least Squares Estimation
1. Feature: Let Z ∈ L2 and P ∈ P where P is a class of distributions on Z. A feature of P is an
object of the form γ(P ) for some γ : P → S where S is often times R.
2. Statistic: gN : {Z1 , . . . , ZN } → S with observed data Zi .
3. An estimator γ̂N is a statistic used to infer some feature γ(P ) of an unknown distribution P .
4. The empirical distribution PN of the sample {Z1 , . . . , ZN } is the discrete distribution that puts
equal probability 1/N on each sample point Zi , i = 1, . . . , N.
5. Analogy principal: γ̂N := γ (PN ).
6. Convergence in probability: A sequence of random variables Z1 , Z2 , . . . converges in probability

p p
to c ∈ R if ∀ε > 0, limN →∞ P (|ZN − c| > ε) → 0 then ZN → c. If ZN → 0 we write ZN = op (1)
7. Bounded in probability: A sequence of random variables Z1 , Z2 , . . . is bounded in probability if

∀ε > 0, ∃bε ∈ R and Nε ∈ Z such that P (|ZN | ≥ bε ) < ε ∀N ≥ Nε , write ZN = Op (1).
8. If ZN = c + op (1) then ZN = Op (1) for c ∈ R.
9. Slutsky theorem: g (c + op (1)) = g(c) + op (1).
10. Weak law of large numbers: Let Z1 , Z2 , . . . be a sequence of iid rvs with E[Zi ] = µZ . Define
∑ p p
Z N := Ni=1 Zi /N. Then Z N − µZ → 0 or Z N → µZ , or write Z N = µZ + op (1).
p
11. Consistency of an estimator: γ̂N → γ
∑N ( ∑N )−1 ( ∑N )
′ 2 ′ ′ −1
12. OLS: β̂ OLS := argmin i=1 (Yi − Xi b) =
1
N i=1 Xi Xi
1
N X Y
i=1 i i = (X X) X ′Y
b∈RK
( ∑N )−1 ∑N
13. Method of Moments: β ∗ = E (XX ′ )−1 E(XY ) −→
a.p 1 ′ 1
N i=1 Xi Xi N i=1 Xi Yi = β̂ MM
∑ ∑
14. Representation: Xi Xi′ = X ′ X and Xi Yi = X ′ Y .
Lecture 3: Intermediate Ordinary Least Squares Estimation

1. Convergence in distribution: A sequence of random variables Z1 , Z2 , . . . converges in distribution
to the continuous random variable Z if limN →∞ FN (t) = F (t) ∀t ∈ R where FN is the CDF of
d
ZN . Write ZN → Z or ZN = Op (1).
d d
2. Continuous mapping theorem: If ZN → Z then g (ZN ) → g(Z) for continuous g(·).
d
3. If ZN → N (0, Ω) then
d
(i) AZN → N (0, AΩA′ )
d
′ Ω−1 Z → χ2 (dim (Z ))
(ii) ZN N N
d
(iii) (A + op (1)) ZN → N (0, AΩA′ )
′ (Ω + o (1))−1 Z → χ2 (dim (Z ))
(iv) ZN
d
p N N
4. Central limit theorem (CLT): Let Z1 , Z2 , . . . be a sequence of iid random vectors with µz := E[Zi ]
∑ √ ( ) d ( ( ′ ))
and E ∥Zi ∥2 < ∞. Define Z N := N i=1 Zi /N. Then N Z N − µZ → N 0, E (Zi − µZ ) (Zi − µZ ) .
√ ( ) ( ∑ ) −1 ( ∑ )
d
5. Asymptotic distribution: N β̂ OLS − β ∗ = N −1 N i=1 Xi Xi
′ N −1/2 N
i=1 Xi ui → N (0, Ω)
2
6. Projection matrix: PX := X (X ′ X)−1 X ′ .
7. Residual maker: MX := IN − PX .
∑
8. Trace: tr A := K k=1 akk with tr(AB) = tr(BA) and tr(A + B) = tr A + tr B.
∑ ( 2 )
9. Since σ̂u2 := N
i=1 ûi /N but E σ̂u |X < σu use Su := N −K σ̂u so limN →∞ σ̂u = limN →∞ su .
2 2 2 N 2 2 2
( ∑N )−1 ( ∑N )( ∑N )−1
10. Heteroskedasticity robust variance estimator: Ω̂ = 1 ′ 1 2 ′ 1 ′
N i=1 Xi Xi N −K i=1 ûi Xi Xi N i=1 Xi Xi
√ ( )
11. Confidence interval construction: let N r′ β̂ OLS − r′ β ∗ ∼ N (0, r′ Ωr) where r is non-stochastic
K × 1 vector and set r = ek then,
( √ ′ √ ′ )
e Ωe e Ωe
P e′k β̂ OLS − 1.96 √k < e′k β ∗ < e′k β̂ OLS + 1.96 √k
k k
= 0.95
N N
the probability is 95% that the random interval contains βk∗ .

√ ( )
d
( )
12. Confidence interval construction: N r′ β̂ OLS − r′ β ∗ → N 0, r′ Ω̂r where Ω̂ = Ω + op (1)
where r is non-stochastic K × 1 vector and set r = ek then,
 √ √ 
e′k Ω̂ek e′k Ω̂ek
P e′k β̂ OLS − 1.96 √ < e′k β ∗ < e′k β̂ OLS + 1.96 √  → 0.95
N N
√ √
( ) e′k Ω̂ek
( ) ( )
13. Standard error: se β̂kOLS = √
N
= e′k · se β̂ OLS where se β̂ OLS = diag Ω̂/N .
( ) β̂kOLS −βknull ( ) d
14. t-statistic: tOLS βknull := if β̂kOLS = βknull + op (1), then tOLS βknull → N(0, 1).
se(β̂kOLS )
15. Joint hypotheses test: Suppose that Rβ ∗ = q where R is an m × K dimensional nonstochastic

matrix and q is an m–dimensional nonstochastic vector. Then,
( )′ ( )−1 ( )
RΩ̂R′
d
WALD := N Rβ̂ OLS − q Rβ̂ OLS − q → χ2 (m)
Lecture 4: Advanced Ordinary Least Squares Estimation

1. Conditional expectation function: µ : R → R, µ (Xi ) ∈ L2 .
2. Conditional mean independence: E (ei ) = E (E (ei |Xi )) = 0
3. Full model: Let X1 , . . . , XM be an exhaustive list of variables that explain Y then Y =

h (X1 , . . . , XM ) = h(X, v) where h(·) is a well behaved function, observed X := (X1 , . . . , XK )′ ,
and unobserved v := (XK+1 , . . . , XM )′ .
4. In linear projection model, Yi = g (Xi ) + ei with E (ei |Xi ) = 0 implies Yi = µ (Xi ) + ei .

∂h(X1 ,...,XM )
5. Causal effect: Ck (X1 , . . . , XM ) := ∂Xk for k ∈ {1, . . . , M }.
∫
6. Average causal effect: ACEk (X) := Ck (X, v) · f (v|X)dv
7. µ(X) = E (h (X1 , . . . , XM ) |X1 , . . . , XK ) = E(h(X, v)|X) = X ′ β + E(v|X).
8. Linear regression model: Yi = Xi′ β + ei with E (ei |Xi ) = 0, EYi2 < ∞, and E ∥Xi ∥2 < ∞
9. Exogeneity: E (ei |Xi ) = 0.
3
10. Homoskedasticity: E (ee′ |X) = σe2 IN
11. Variance decomposition: For P, Q ∈ L2 , Var P = E Var(P |Q) + Var E(P |Q)
12. Estimator: β̂ OLS = (X ′ X)−1 X ′ Y .
13. Unbiased: Eβ̂ OLS = β under exogeneity.

[ ]
14. Unconditional variance: Var β̂ OLS = σe2 · E (X ′ X)−1 under homoskedasticity.
15. A symmetric matrix P is nonnegative definite if q ′ P q ≥ 0 for all vectors q.
16. Gauss Markov theorem: In the linear regression model with homoskedastic errors,
( β̂ OLS
) is the
linear unbiased estimator with minimum variance (BLUE). Var(β̃|X) ≥ Var β̂ OLS |X .
17. Generalised least squares estimator β̂ GLS is the minimum variance unbiased estimator under
heteroskedasticity by Gauss Markov theorem.
18. GLS is an infeasible estimator because E (ee′ |X) cannot be observed. To make it feasible, run
( )−1
OLS and obtain ê and Σ̂ then compute β̂ GLS ′ −1 X ′ Σ̂−1 Y doesn’t satisfy Gauss
feas := X Σ̂ X
Markov theorem.
19. Frisch-Waugh-Lovell theorem: Y = Xβ + e = X1 β1 + X2 β2 + e. The OLS estimator β̂2OLS results

from regressing Y on X2 adjusted for X1 β̂1OLS
[ ][ ] [ ]
X1′ X1 X1′ X2 β̂1OLS X1′ Y
=
X2′ X1 X2′ X2 β̂2OLS X2′ Y
( )−1 ( ) ( )−1 ( )
Estimators: β̂1OLS = X̃1′ X̃1 X̃1′ Ỹ and X̃2′ X̃2 X̃2′ Ỹ . without ‘∼’ if X1 and X2 have
zero sample covariance.
Lecture 5: Instrument Variable I

1. Yi = Xi′ β +ei with E (ei |Xi ) = 0 where the linearity is with respect to β and Xi can be nonlinear
(e.g. quadratic, interaction, logarithms terms).
2. OLS is the best estimation choice under exogeneity, E (ei Xi ) = 0.
3. Exogeneity: E (ei · f (Xi )) = E (f (Xi ) · E (ei |Xi )) = 0

( )
4. Measurement error: X̃i = Xi + ri . Only observe X̃i , Yi so the model: Yi = Xi′ β + ri′ β + vi
5. Omitted variable non-bias: E (ei |Xi ) ̸= 0 but E (ei Xi ) = 0
6. OLS makes sense whenever E (ei Xi ) = 0.
̸ 0 then find Zi such that E (ei Zi ) = 0 where size Zi = L × 1.

7. Instrument variable: if E (ei Xi ) =
rank E (Zi Xi′ ) = K (full column rank) so we need L ≥ K.
8. First stage regression: Xi = π ′ Zi + vi ⇒ π = E (Zi Zi′ )−1 E (Zi Xi′ ) where E (Zi vi ) = 0.
9. Second stage regression: Yi = (π ′ Zi + vi )′ β + ei = Zi′ λ + wi ⇒ λ = E (Zi Zi′ )−1 E (Zi Yi )
10. Exogenous regressors: E (ei Xi1 ) = 0
11. Endogenous regressors: E (ei Xi2 ) ̸= 0 ⇒ β ̸= β ∗ , use IV instead of OLS.
4
12. Identification: If a parameter can be written as an explicit function of population moments, then
it is identifiable. For exogenous variable dim Zi1 = K1 and endogenous variable dim Zi2 = L2
( ) ( )
Zi1 Xi1
Zi := =
Zi2 Zi2
Three cases: L = K (exactly identified), L > K (over identified), L < K (under identified). λ
and π are explicit functions of population moments and thus identified.
13. Existence and uniqueness: if Rank(π) = Rank(π λ) then the solution exists and unique if
dim ker(π) = 0 ⇒ dim π = K. We need E (Zi Xi2 ′ )=K .
2
(∑ )−1 (∑ )
14. Case 1: L = K, β = π −1 λ = E (Zi Xi′ )−1 E (Zi Yi ). Thus, β̂ IV = N
Z X
i=1 i i
′ N
Z Y
i=1 i i
15. Case 2: L > K, see Lecture 8.
16. Consistency: β̂ IV = β + op (1)

√ ( ) ( ( ) )
N β̂ IV − β → N 0, E (Zi Xi′ )−1 E e2i Zi Zi′ E (Xi Zi′ )−1
d
17. Asymptotic Distribution:
Lecture 7: Treatment Effects with IV

1. Treatment: Xi ∈ {0, 1}, is a dummy variable.
2. Outcome function: Yi : {0, 1} → R.
3. Individual treatment effect (ITE): Yi (1) − Yi (0), cannot be observed. To fix this, find any
identical j ̸= i such that Yi (1) − Yj (0) and not Yi (p) = Yj (p) for p ∈ {0, 1}.
4. The regression model: Yi := Yi (1) · Xi + Yi (0) · (1 − Xi ) = β0 + β1i Xi + ũi , don’t use OLS
unless β1i is constant. To take out i-subscript, manipulate it by ±E (Yi (1) − Yi (0)) · Xi where
E (Xi ui ) = 0.
5. Average treatment effect: β1 := E (Yi (1) − Yi (0))
6. Independence: (Yi (1), Yi (0)) ⊥ Xi
7. OLS estimator of β1 in the model Yi = β0 + β1 Xi + ui is an estimator of the average treatment

effect E (Yi (1) − Yi (0)).
8. Randomised eligibility: Zi ∈ {0, 1} if treatment cannot be randomised.

1i ·π1i )
9. Estimator: β̂ IV = E(β
E(π1i ) + op (1) ̸= E (β1i ) + op (1), estimates the causal effect for those with
most influential Zi (large π1i ).
1i ·π1i )
10. Local average treatment effect (LATE): E(β E(π1i ) = ATE+
Cov(β1i ,π1i )
E(π1i )
π1i
where E(π 1i )
is interpreted
as weights. Unreal cases: β1i = β1 , π1i = π1 , β1i and π1i are independent, then LATE = ATE.
E(β1i ·π1i )
11. If there are four types, always taker, complier, defier, never taker, then LATE = E(π1i ) =
E (β1i |C) ̸= E (β1i ), thus the treatment effect for compliers.
12. Intention-to-treat effect: calculated under full compliance.

Y 1 −Y 0
13. Wald estimator: β̂ IV = X 1 −X 0
5
Lecture 8: Instrument Variable II
1. Case 2: L > K, π ′ πβ = π ′ λ ⇒ β = (π ′ π)−1 π ′ λ, motivates 2SLS.
( )−1
2. Two stage least squares estimator: β̂ 2SLS = X ′ Z (Z ′ Z)−1 Z ′ X X ′ Z (Z ′ Z)−1 Z ′ Y
( )−1 ( )−1
3. Representation: (X ′ PZ X)−1 X ′ PZ Y = X̂ ′ X X̂ ′ Y = X̂ ′ X̂ X̂ ′ Y .
4. First stage: X on Z, π̂ = (Z ′ Z)−1 Z ′ X where X̂ = Z π̂ = PZ X

( )−1
5. Second stage: Y on X̂ so β̂ 2SLS = X̂ ′ X̂ X̂ ′ Y .
p
6. Consistency: β̂ 2SLS → β.
√ ( )
d
7. Asymptotic Distribution: N β̂ 2SLS − β → N (0, Ω)
8. Bias of 2SLS: Hahn and Kuersteiner (2002) uses the concentration parameter µ, when large
enough, the bias approaches zero.
9. Invalid instrument: think of π = E (Zi Zi′ )−1 E (Zi Xi′ ) when E (Zi Xi′ ) = 0. β̂ IV is not consistent
and converge to a Cauchy distribution. t-statistic doesn’t converge to normal distribution.
( ) null
10. Generic t-statistic: t β null := β̂−β .
se(β̂)
p
( ) p (ρ): affects asymptotic distribution of t. As ρ → 1 (worst case), ξ1 → ξ2 ,
11. Degree of endogeneity
σ̂e → 0, se β̂
2 IV → 0, S(ρ) → ∞, and t → ∞ ⇒ rejecting H0 : β = 0.
Lecture 9: Weak Instruments

1. Weak instrument: π ̸= 0 but π ≈ 0
2. Asymptotic distribution of t: depends on ρ and τ (the strength of the instrument), both cannot
ξ2
be estimated. If ρ = 1, ξ1 = ξ2 , and t-statistic becomes S(1, τ ) = ξ1 + τ1 .
3. The mixture: S(1, τ ) is χ21 distribution controlled by τ .
4. Strong instrument: large τ , S(1, τ ) close to N(0, 1).
5. Weak instrument: small τ , χ21 dominates.
6. Nightmare: limτ →0 S(1, τ ) = ∞, misleadingly t suggesting significant β.
7. Dealing with weak instruments: Staiger and Stock (1997) uses first stage F statistic of X on Z.
The instrument is strong if F > 10 (safe to use β̂ IV and β̂ 2SLS ) and weak if F < 10.
8. Nominal size (Type I error): P(rejecting H0 |H0 is true) = α.
9. Stock and Yogo: provides a table of F -statistic based on actual size. The more you can tolerate
with high α, the more likely you will reject the null and conclude that the instrument is strong.
6
Lecture 10: Maximum Likelihood Estimation
1. Requirements: fY (y|θ) is known given an iid random sample Y1 , . . . , YN .
∏
2. Likelihood function: L (θ|y) := fY1 ,...,YN (y1 , . . . , yN |θ) = Ni=1 fY (yi |θ).
∑N
3. Log likelihood function: L(θ|y) := ln L (θ|y) = i=1 ln fY (yi |θ).
4. Maximum Likelihood Estimator: θ̂ML (Y ) := argmaxθ L (θ|Y ) = argmaxθ L(θ|Y ).

( ) ( )
|τ ) fY (Y |τ )
5. Jensen’s inequality: E − ln ffYY (Y
(Y |θ) ≥ − ln E fY (Y |θ) .
6. Invariance property: M̂ L is the ML estimator of Θ, then function κ : Θ → R, the ML

( If θ̂)
estimator κ(θ) is κ θ̂M L .
∂ ln fY
7. Score function: S(y|θ) := (y|θ) with a random variable Y .
∂θ
( )
8. Fisher information: I(θ) := E S(Y |θ)2 = Var S(Y |θ).
(( )2 ) ( 2 )
∂ ln fY ∂ ln fY
9. Information Equality: E ∂θ (Y |θ) = −E ∂θ 2 (Y |θ) .
10. Cramér Rao Bound: Var T (Y1 , . . . , YN ) ≥ 1

where T (·) is an unbiased estimator of θ and at-
N ·I(θ)
∑
tains the bound with necessary and sufficient condition N1 N i=1 S (yi |θ) = a(θ)·(T (Y1 , . . . , YN ) − θ).
11. Consistency: θ̂M L → θ.

√ ( ) ( )
d
12. Asymptotic distribution: N θ̂M L − θ → N 0, I(θ)−1 where I(θ)−1 is the CRB.
Lecture 11: Limited Dependent Variable (LDV) Models

1. Limited dependent variables: binary outcome: Y ∈ {0, 1}, multinomial outcome: Y ∈ {0, 1, . . . , s},
integer outcome: Y ∈ {0, 1, . . .}, or censored outcome: Y ∈ R+ .
2. Estimation: parametric MLE
3. Linear probability model (lpm): E (Yi |Xi ) = Pr (Yi = 1|Xi ) = Xi′ β where β is the effect of Xi
on the probability of success Pr (Yi = 1|Xi ). Use OLS to estimate β.
4. Limitations: Pb (Yi = 1|Xi ) < 0, P

b (Yi = 1|Xi ) > 1, and linearity of the probability restrictive
′
fixed by P (Yi = 1|Xi ) = G (Xi β) where G (probit or logit) is a CDF. Use MLE to estimate β.
∏ ∏N
5. Likelihood function: L (β|x, y) = N i=1 Pr (Yi = yi |Xi = xi ) Pr (Xi = xi ) = i=1 fY (yi |xi , β) fX (xi )
∑N
6. MLE: argmaxL(β|x, y) = argmax i=1 ln fY (yi |xi , β)
β∈B β∈B
7. Latent variable representation: Yi∗ = Xi β + ei where Yi = 1 if Yi∗ > 0 and Yi = 0 otherwise.
8. In the binary outcome model, fY (y|x, β) = Pr (Yi = y|Xi = x) = G (x′ β)y · (1 − G (x′ β))1−y .
9. Let G be the standard normal or the logistic CDF. Then L(β|x, y) is globally concave.
′ β)
′
∑N
10. Score: S(y|x, β) = G(x′y−G(x
β)(1−G(x′ β)) ·g (x β)·x, let the computer find β such that i=1 S (yi |xi , β) =
0.
√ ( )
d ( )
11. Asymptotic distribution of MLE: N β̂ ML − β → N 0, I(β)−1
7
( )
( ′)
2
g (Xi′ β )
12. Information matrix: I(β) = E S (Yi |Xi , β) S (Yi |Xi , β) = E G(Xi′ β )(1−G(Xi′ β ))
· Xi Xi′
i |Xi )
13. Causal effect: ∂E(Y = ∂ Pr(Y∂X
i =1|Xi )
= g (Xi′ β) β ̸= β by chain rule. Take expectation, ϕ :=
( ) ∂Xi i
∑N ( ( ′ ML ) ML )
i =1|Xi )
E ∂ Pr(Y∂X = E (g (X ′ β) β) and use analogy principal, ϕ̂ = 1 .
i i N i=1 g Xi β̂ β̂
√ d
14. Delta method: Let N (θ̂ − θ) → N (0, Ω) with dim θ = K. Take a continuously differentiable
√ d
function C : Θ → R where Q ≤ K. Then N (C(θ̂) − C(θ)) → N (0, c(θ) · Ω · c(θ)′ ) where
Q
∂C
c(θ) := ∂θ′ (θ).
15. Sample selection model: Yi∗ = Xi′ β + ei and Di = 1 · (Zi′ γ + vi ) where Yi = Yi∗ if Di = 1 and
unobserved if Di = 0 with given (Di , Xi , Yi , Zi ).
ϕ(c)
16. Inverse Mills ratio: E (vi |vi > −c) = Φ(c) =: λ(c)
17. Estimating γ (first stage): run a probit estimation of Di on Zi and obtain γ̂ ML .
18. Regression model (second stage): Yi = Xi′ β+ρλ (Zi′ γ)+ri where derive first E (ei |Di = 1, Xi = x, Zi = z) =
ρλ (z ′ γ) then E (Yi∗ |Di = 1, Xi , Zi ) = Xi′ β + ρλ (Zi′ γ). Use OLS.
19. Exclusion restriction: at least one instrument that is not in Xi
20. Important note: β is only identified by imposing some functional form on the joint error distri-
bution.
Lecture 12: Extremum Estimators and M-Estimation

1. Extremum estimators: θ̂EE := argmaxQN (Wi , θ) given data Wi , i = 1, . . . , N and QN (Wi , θ) =
θ∈Θ
1 ∑N
N i=1 q (W i , θ) where q (W i , θ) is ln fY (Yi |θ) for MLE.
u.p
2. Convergence: Q0 (θ) : QN (Wi , θ) −→ Q0 (θ)
∑

3. Sufficient condition: supθ∈Θ N1 Ni=1 (q (Wi , θ) − E (q (Wi , θ))) = op (1)
u.p
4. Regularity conditions: Q0 (θ) is continuous (by inspection) and Q0 (θ) : QN (Wi , θ) −→ Q0 (θ).
5. Substantive conditions: Q0 (θ) is uniquely maximized at θ0 (checked naturally in MLE) and Θ

is compact (verified when estimating a probability or running a regression).
6. Consistency: regularity, substantive conditions and identification must be satisfied.

∑ ( )
∂q(Wi ,θ)
7. M-estimator: N1 N i=1 s W i , θ̂ M = 0 given data W , i = 1, . . . , N where s (W , θ) :=
i i ∂θ .
( )
∂QN
8. Extremum vs M: θ̂EE := argmaxQN (Wi , θ) and ∂θ Wi , θ̂M = 0.
θ∈Θ
√ ( ) ( )
d
9. Asymptotic distribution: N θ̂EE − θ0 → N 0, H −1 ΣH −1 .

Advanced Econometrics Techniques for Data Analysis

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Advanced Econometrics Techniques for Data Analysis

Transféré par

Droits d'auteur :

Formats disponibles

Advanced Econometrics

June 12, 2019

2 Ordinary Least Squares 11

3 Linear Regression Model 15

5 Maximum Likelihood Estimation 32

5.2.5 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . 37

Appendix A Probability Theory 38

Appendix B Linear Algebra 43

1.1 Bivariate Statistics

Cov(X, Y ) = E[XY ] − E[X]E[Y ]

Cov(X, Y ) pE[Y |X = 1] − p[E[Y |X = 1]p + E[Y |X = 0](1 − p)

3. E[XY ] is an inner product

⟨X, Y ⟩ = E[XY ] = E[Y X] = ⟨Y, X⟩

5. Finiteness of moments by Cauchy Schwartz inequality

To prove the finiteness,

Application: Treatment Effect

You have iid sample data (Xi Yi , Zi ) and compute β̂ IV

1. Prove that σZX = σZ2 E (π1i )

Cov(Zi , Xi ) = pE[Xi |Zi = 1] − p[E[Xi |Zi = 1]p + E[Xi |Zi = 0](1 − p)

[E[Xi |Zi = 1] − E[Xi |Zi = 0]] = E[π0 + π1i + vi ] − E[π0 + vi ]

2. Prove that σZY = σZ2 E (β1i π1i )

Cov(Zi , Yi ) = pE[Yi |Zi = 1] − p[E[Yi |Zi = 1]p + E[Yi |Zi = 0](1 − p)

Analogously, by plugging in,

Finally, one gets,

Cov(Zi , Yi ) = p(1 − p)[E[Yi |Zi = 1] − E[Yi |Zi = 0]]

1.2 Linear Algebra

2. Residual maker matrix

3. PX and MX are symmetric

4. PX and MX are idempotent

5. The rank and trace of PX and MX are identical

tr (PX ) = rank (PX ) = K

Ordinary Least Squares

s(b) = E[(Y − X ′ b)2 ]

2. β̂ OLS := argminb∈RK (Y − Xb)′ (Y − Xb)

Linear Regression Model

3.1 Ordinary Least Squares

Expectation and Variance

Compared to other linear unbiased estimator β̃ := CY where dim C ′ = N × K

E[β̃|X] = E[CY |X] = CE[Y |X] = CXβ

Guess C ⋆ from GLS,

Finally, the variance is,

It is important to know that,

To check positive semidefinite,

3.2 Generalized Least Squares

β̂GLS = (X̃ ′ X̃)−1 X̃ ′ Ỹ

When this assumption is violated we essentially have that,

when E[Xi Ei ] ̸= 0 we have errors-in-variable problem in which the independent variable

but we can only observe

where ri is a K ×1 measurement error, independent of ei and Xi The regression we perform

By taking probability limit and since ,

where λ := πβ and wi := vi′ β + ei . From two regressions above,

and pre-multiply it by Z (Z ′ Z)−1 Z ′ obtaining

By taking probability limit,

Consider each terms separately,

such that it can also be written as

bP̂ = β + (Op (1) · Op (1))−1 OP (1) · op (1)

Consider each term separately,

A = (CXZ CZZ CZX )−1 CXZ CZZ

Asymptotic Variance Under Homoskedasticity

Good Estimator for P ∗

To test its consistency, we take the probability limit,

Plugging these terms back in